Talk:File comparison

Latest comment: 7 years ago by Yaris678 in topic Article title RFC

Other file comparison programs find block moves. edit

I have just tagged the sentence "Other file comparison programs find block moves." for clarification. It's in the "method types" section. Are we saying that several programs use Heckel's algorithm? Are we saying that other algorithms have been developed? Either way we could change the wording here... either to make it clear that the IBM history flow tool is nothing special or that Heckel's algorithm has been... improved upon? Superseded? Yaris678 (talk) 14:21, 4 January 2012 (UTC)Reply

Article title edit

I note that Cacycle moved this article from File comparison to Data comparison with the comment "Article is not only about files, but also about texts and other data objects".

I agree with the comment. However, I think that the title "data comparison" is a bit too general. For example, the term could be used to refer to statistical comparison, which the article is definitely not about. Perhaps the title should be Text comparison. This title is arguably too specific, but I think it may be the best option. We can talk about how text comparison techniques can be applied to non-text data... which I think is fair. The techniques were basically invented for text but have been applied in a few other contexts.

Yaris678 (talk) 12:29, 8 May 2015 (UTC)Reply

Yaris678: I think text comparison is again way too specific. I don't think that the title has to be changed as the intro clearly states the scope of the article. But I will add a hatnote to Comparison. Cacycle (talk) 14:35, 8 May 2015 (UTC)Reply
Hi Cacycle. I think the hatnote is a good idea. It improves the situation, but I'm not convinced that it is sufficient. I have left messages at WT:WikiProject Computing and WT:WikiProject Statistics to see if anyone on those WikiProjects has an opinion on this. Yaris678 (talk) 13:18, 9 May 2015 (UTC)Reply
I have read through the article and don't find the original justification for move by Cacycle to be compelling. The article, as it stands, is primarily concerned with comparison of files. The move creates other problems. The move was not a net improvement and should be reverted. We can continue discussion trying to find a title that is better than File comparison. ~Kvng (talk) 16:56, 9 May 2015 (UTC)Reply
We would have to find a title that is better than file comparison as well as text comparison and covers the articles content, i.e. text, file, and data structure comparisons. Cacycle (talk) 18:23, 9 May 2015 (UTC)Reply
Data structure comparison was only mentioned in the lead and since there is no supporting discussion in the body, I have removed that. Non-text comparison is discussed in the body so I don't think Text comparison is a good title. We don't need to find a title better than File comparison, we just need to find one better than the current title. ~Kvng (talk) 01:52, 10 May 2015 (UTC)Reply

This discussion seems to have stalled so I thought I would try to sumarise the arguments made so far in the table below. I have also added, in grey text, arguments that haven't been made until now, but I thought were worth mentioning.

File comparison Data comparison Text comparison
Too specific Methods can be used for things other than files. e.g. two pieces of text stored in the same database. Methods can also be used with text-like data, such as genome sequences.

Article includes some info on the check-sum approach, which applies to any file, not just a text file.

Too general Methods wouldn't be much use in comparing some types of file. e.g. two image files (exception is the check-sum approach, which can be used for any type of file) Many other ways to compare data. e.g. statistical comparison. Most of the article is only about comparing text and text-like data. Texts can be compared in other ways, e.g. comparing the meaning.

I think looking at it like this should help us to decide what to do next.

Does anyone think there is anything missing from the table? Anything in the table that people think is not true? We can get on to the relative strengths of these arguments in a while, but I think it is worth having all the arguments set out first.

Yaris678 (talk) 12:38, 16 May 2015 (UTC)Reply

I'm afraid this is not helping me much. Should we consider doing an WP:RFC? ~Kvng (talk) 14:05, 18 May 2015 (UTC)Reply
I'm happy to have an RfC. What do you mean by not helping you much? The table is not intended to lead directly to any conclusions, just to present the arguments that have been made. Yaris678 (talk) 11:23, 20 May 2015 (UTC)Reply
From where I stand, the table is not any more clear than the original statements from individual editors. Also, the fact that you've introduced new arguments (in summary form only) has complicated things. ~Kvng (talk) 14:52, 25 May 2015 (UTC)Reply

Article title RFC edit

Should article title be restored to File comparison? See discussion above ~Kvng (talk) 15:27, 25 May 2015 (UTC)Reply

Yes - Looking through the text of the whole article, "file comparison" is always used over "data comparison" with no confusion. It seems odd to have the title of the article as "data comparison" then have it not mention that name anywhere in the article. More anecdotally, "file comparison" is tech jargon that refers to all of the types of comparison encompassed by this article, whether they're done on an actual node stored on a filesystem or not. The only other appropriate term for the same thing is "diff", which I'm not sure is the best title here. This is a difficult point to find reliable secondary sources for, but the language already used in the article plus doing some searching of the various terms on google shows the common usage of "file comparison". Arathald (talk) 05:22, 26 May 2015 (UTC)Reply
Comment - I was going to suggest moving the article to the title File comparison before having an RfC. That was the name until recently and, as Arathald points out, most of the article uses the term "file comparison". Plus we've not heard from cacycle recently, and he was the only one arguing for "data comparison". I suggest we WP:SNOW close this RfC and move the article back to File comparison and then have a wider discussion, possibly in the form of an RfC, about whether or not there is a better title for the article. Yaris678 (talk) 07:46, 27 May 2015 (UTC)Reply
Comment While I think the right conclusion is pretty clear, the objections to "file comparison" were not unreasonable or in violation of policy (per the guidelines in WP:SNOW). I think we should let this RFC gather a few more comments and see if there's any further objection or if there are reliable sources that support "data comparison" as a better title. Arathald (talk) 18:45, 28 May 2015 (UTC)Reply
Are we there yet? ~Kvng (talk) 22:22, 14 June 2015 (UTC)Reply
Yes definitely. Or more. Data comparison is hopelessly too wide a term for the topic, which here amounts to string comparison. I checked the history, and the justification given was "Cacycle moved page File comparison to Data comparison: Article is not only about files, but also about texts and other data objects". In common usage in the IT field items of all those types and more are stored and processed in data media in structures called files, whereas data comparison is a far wider concept and might concern statistical, structural, or information theoretical concepts or operations and might refer to semantic, implicit, linguistic or teleological comparisons, as well as comparisons between media, formats, structures,terminology, times, dimensions, or locations. Anyone competent in the field who consults this article on the basis of a title such as "Data comparison" would get a very bad impression indeed of the triviality of WP. The comparisons in this article are of or within files, and typically they are character-encoded, linear files at that. Whether the files contain texts or other data objects is irrelevant; any such items could appear in files.
This said, though I think Yaris678 has the right general idea, it does not go far enough; we here have something of a rats nest in terms of the range of topics. Looking at the "See also" section I find Computer-assisted reviewing, Data differencing, Delta encoding, Edit distance and so on. On inspection of these articles, their topics overlap and the texts are mutually inconsistent. The whole lot needs to be rationalised, which might not be easy unless we can identify a suitably widely acceptable terminology. They might be combined into one article (still not "data comparison", which is too wide a term, as I have pointed out) or they might be re-written individually to improve the articulateness of the material with the aid of suitable links. As they stand the topic is a mess.
In passing I note that I might have missed something on skimming the associated articles, but the importance of these concepts to comparison of DNA or peptide chains seems to me to have been overlooked. JonRichfield (talk) 08:06, 11 June 2015 (UTC)Reply
Yes. The software in the article compares files (or things equivalent to files, such as text fields). It does not compare "data" in general. For instance - I have a database which includes two tables that ought to contain identical data, but don't, and I would like to compare them: I can't do this comparison using any of the software mentioned in the article. Comment: the article could also make it clearer that 'diff' only does pairwise comparisons, while some software can compare three or more files at a time. Maproom (talk) 07:04, 12 June 2015 (UTC)Reply

  Implemented. Everyone said yes, so I've made the move, even if it is nearly 2 years since the last comment. Yaris678 (talk) 13:29, 25 March 2017 (UTC)Reply