CorenSearchBot seems to have become over-elaborate and may be causing problems. A recent check of Wikipedia:Suspected copyright violations showed 73 entries of which 6 had already been checked and marked as false positives. It was very hard indeed to see why these six had been flagged by the bot. A stern but incomprehensible copyright violation warning on the first article that a new editor creates may discourage them from further contributions. It is disconcerting even to experienced editors. The rules applied by this bot should be reviewed, and logic that often creates false positives should be removed.

Hypothetical examples edit

Copyright violations are presumably common, often harmless and may be hard to detect mechanically.

Case 1: A fan of a bar-band singer decides she should have a Wikipedia article. He creates it by copying her bio from her website.

Case 2: An established editor creates a stub for a scout camp in Michigan. The article makes it through new page review. The scout leader responsible for the camp spots the article and copies in the prospectus he sends to parents describing activities, mission and values.

Case 3: A popular author releases a new book. An editor copies a magazine review of the book verbatim into the article on the author, without attribution. Two days later Wikipedia receives a letter from the magazine's lawyers demanding immediate removal of the plagiarized content.

The first two violations are unlikely to cause Wikipedia any problem in the sense of legal costs. The first may be caught by CorenSearchBot if the singer's website shows up in the search results for a phrase in the article. But if the fan first creates a stub article and then adds the content, it will not be caught. The second cannot be detected by any bot since the content exists nowhere else online. The third would not be detected by CorenSearchBot since it only checks new articles, but is by far the most serious.

Argument edit

Wikipedia cannot prevent or be held responsible for copyright violations created by editors, but must quickly remove them when the problem is pointed out. Wikipedia should also make reasonable efforts to detect and remove violations proactively, but has no more obligation than the owner of a public noticeboard.

Some copyright violations are blatant: long sentences or whole paragraphs that are almost identical to the source. Most will be obvious to reviewers, but not all. When CorenSearchBot finds and reports blatant violations it is providing a useful service. If practical, the bot's scope should be expanded to look for blatant violations in edits to existing articles, not just new ones. However, any changes to the bot must be tested carefully. It must not discourage improvements and waste reviewers' time by frequently reporting false positives.

It seems that although CorenSearchBot may be accurate in reporting blatant copyright violations it is less accurate and often causes problems when it attempts to detect and report more subtle violations. The logic needs review, starting with an analysis of false positive reports and their causes. There should be no concern with publishing the algorithm. There is no conspiracy to introduce wholesale copyright violations into Wikipedia, just editors who do not know they are breaking the rules.

CorenSearchBot should focus on the most obvious patterns. The more sophisticated the logic becomes the more likely it will do more harm than good. It may have reached that stage already. Since it looks only at new articles and only checks against a limited number of online sources, it can at best detect only some of the total violations. False positive reports are damaging to the project. It is better to aim to find 4.7% of violations with no false positives than 4.8% of violations - the maximum that can be detected - with frequent false positives.

False positives discourage editors from contributing.

Warning edit

 

This is an automated message from CorenSearchBot. I have performed a web search with the contents of User:Aymatth2/CorenSearchBot, and it appears to include a substantial copy of http://en.wikipedia.org/w/index.php?title=User:Aymatth2/CorenSearchBot. For legal reasons, we cannot accept copyrighted text or images borrowed from other web sites or printed material; such additions will be deleted. You may use external websites as a source of information, but not as a source of sentences. See our copyright policy for further details. (If you own the copyright to the previously published content and wish to donate it, see Wikipedia:Donating copyrighted materials for the procedure.)

This message was placed automatically, and it is possible that the bot is confused and found similarity where none actually exists. If that is the case, you can remove the tag from the article and it would be appreciated if you could drop a note on the maintainer's talk page. CorenSearchBot (talk) 01:01, 12 May 2024 (UTC)

Comments edit