Wikipedia:WikiProject Military history/News/June 2011/Op-ed





Hunting for plagiarism: a case study

Have an idea for the next op-ed? We welcome all submissions—for more details, please visit the newsroom!

By Fifelfoo

I asked Fifelfoo to review one of the articles I authored, South American dreadnought race, for a Milhist A-class review. There were two minor instances with similar wording, which he listed at the article's review page, but I did not realize how in-depth he went until he posted a detailed summary on his talk page. He gave me permission to post it in this op-ed [1], and I hope it will give reviewers ideas on what and how to look for instances of plagiarized text as they go through articles at GAN, A-class, or FAC. Ed [talk] [majestic titan]




Ed. note: all text below was written by Fifelfoo in these two edits

I reviewed South American dreadnought race for MILHIST A Class. To check copyvios first I determined how I should approach the issue

  1. Automated testing and individual source manual verification
  2. Full text reading for plagiarism through analysis of style changes, exceptional turns of phrase, etc. and individual source manual verification

revision reviewed

Firstly, in either technique, read the article history and look at the change logs. It is mostly one editor's work (unlikely to be adhoc plagiarism then), so then I look at the first version of the page.

  • Good sign: the editor knows how to cite in general, and the works are predominantly scholarly. Citations like "Scheina, Naval History, 45–49, 297–298, 347." are great, because they indicate that the data is a result of WP:Close paraphrase's suggestions: to read broadly, internalise, and then write. Single page citations are much more likely to be paraphrase or plagiarism.
  • Mixed sign: ""Minas Geraes I," Serviço de Documentação da Marinha - Histórico de Navios." Citations from sources other than English may involve translated plagiarism or close paraphrase. Or they could mean that the author has fundamentally changed the method of expression entirely due to the need to internalise the non-English facts, and then present them in high quality written encyclopaedic English.
  • Danger sign: ""Brazil," Naval Engineers, 883." A single page from a single specialised chapter in a specialised work. While most editors use these well, some accidentally or due to bad encyclopaedic editing, lift clauses, phrases and sentences entire; or closely paraphrase already terse technical expressions. Check the final version, and examine that.

The revision history shows the article grew at a steady pace in byte count for 14 days, then levelled off (obviously undergoing copyediting!). This is a good sign. Steady growth indicates normal authorial work. As does sudden spurts of growth, each spurt being the same. Sudden increases in size which are out of pattern can indicate things being lifted whole.

Finally, glancing over the final revision for review:

  • The Endnotes are full """The Brazilian Battleship "Minas Geraes"," Scientific American, 241." and give appropriate page references—I can spot check these!
  • The Bibliography is full ""Brazil." Journal of the American Society of Naval Engineers 20, no. 3 (1909): 833–836. ISSN 0099-7056. OCLC 3227025." and shows a clear use of scholarly sources—plagiarism is likely to sound "scholarly" for this article, or "military technical", rather than "encyclopaedic" or "journalistic". Plus, knowing that scholarly sources have been used deeply, this article is probably written by a well developed editor—either the editor is a habitual plagiarist, or the editor may make the occasional mistakes we all do when writing up a fact, or the editor is an excellent editor who internalises and writes fresh excellent prose. There would be no problem with clear copyright violations of online blogs, journalism, or scholarly papers (and this is demonstrated below with the tools).

To conduct automated testing I used http://toolserver.org/~earwig/cgi-bin/copyvio.py and http://en.wikipedia.org/wiki/User:CorenSearchBot/manual on the article. Earwig showed clear. CorenBot showed clear. This is only the start, however, as close paraphrase and "do the sources support their conclusions" need to be checked using this method. As Full text reading proceeds to individual source checking, I'll deal with Full text reading next.

Full text reading is the process of closely reading the style and expression of a work, to look for jarring changes in style, very unusual verbs verbal clauses or adjectival constructions, material worded far poorer than average, material worded much better than average, and styles which appear to have a academic or journalistic (etc.) rather than encyclopaedic style. So I started at the top of SAdr. For example:

  • jarring changes
  • unusual expressions:
    • "A South American dreadnought arms race between the countries of Argentina, Brazil, and Chile was kindled in 1907." => Google: ""dreadnought race" "was kindled" -wikipedia" => no hits, probably clear
    • "After canceling a naval-limiting pact between them" unusual tense => Google "naval-limiting pact" -wikipedia => clear, with wikipedia, ah, the article was started on wikipedia in a sandbox, check the sandbox history "The deletion and move log for the page are provided below for reference.", no worries, we can use the history check above.
    • "warships seized merchant ships which had been licensed to operate" => Google as a phrase => Ah, this article includes wikipedia material that has been previously published in other featured articles. A good sign, multiple reviews are better (but notice that FAC has had some rather upsetting incidents, for a "full" review like MILHIST A or FAC, spotchecks must be done)
  • Identifying the author's own style: "By this time, however, the First World War had broken out in Europe, so dreadnoughts for foreign countries were suspended while the shipbuilders assisted the war effort. " ah, the author brackets clauses too much, and speaks in a particular kind of tense "so…were suspended while" while writing history... I can use this to identify unusual tenses.
  • There's a block quote in the article, cited thus, "(John H. Biles, "Argentina," Navy (Washington) 4, no. 7 (1910): 30, quoted in Scheina, Naval History, 84)." the author knows how to cite, they're less likely to copyright violate, or plagiarise; but close paraphrasing by habit or accident could still exist.

By this stage I've determined the editor's own prose style, and have read the rest of the article, not noticing any sudden stylistic changes. Thus I need to move to spot checking.

Spot checking relies on picking sources, footnotes, or sentences which are likely to be close paraphrase:

  • The smaller the page range cited, the more likely paraphrase
  • The smaller the content cited to the source (ie: a single clause, phrase or sentence) the more likely a paraphrase
  • The fewer sources cited to a fact, the more likely a paraphrase
  • The more unique the fact, the more likely a paraphrase
  • The more often a single source is relied upon, particularly in the manner above, the more likely that plagiarism has occurred.

Now saying this doesn't mean that editors acting in such ways are plagiarising; but these are the signs which I have found when dealing with Humanities encyclopaedia articles. When I see these signs, I concentrate spot checking on sources and sentences which display this behaviour. If a source is particularly relied upon in this way I check every useage of that source.

Consider, "The United States' Fore River Shipbuilding Company tendered the lowest bid—in part due to the high availability of cheap steel—and was awarded the contract.[32]" Endnote 32: "Livermore, "Battleship Diplomacy," 39." Bibliography: "Livermore, Seward W. "Battleship Diplomacy in South America: 1905–1925." The Journal of Modern History 16: no. 1 (1944), 31–44. JSTOR 1870986. ISSN 0022-2801. OCLC 62219150."

  • This is an excellent example. A short sentence with a unique fact, unusual expression, cited to a single page, in a source which is available online (JSTOR).
  • I opened the JSTOR link (blessed my access) and proceeded to page 39.
  • I need to check that the source
    1. Supports the claim
      • To support the claim the text must support "Fore River's lowest tender" and "due to cheap steel" and "awarded contract"
      • It does: over a long paragraph! The source supports the claims, all of them. Though I have to read "lowest cost result" and "Power of the US Steel Trust" "successful vendor Fore River"—the editor has distilled the encyclopaedic facts from the more expansive scholarly discussion.
    2. Is not copyvio, plagiarism, or close paraphrase.
      • Does the source use the exact order "Fore lowest bid—cheap steel—awarded contract"? Does the source use the same expressions, or precise turns of phrase? I this a sentence lifted from the source with merely new adjectives, "high availability" instead of "cheaply available"? If this three sentences presented in another order, with very similar language, and the order of the material simply reversed?
      • It does not: In the long paragraph the unique terms "lowest bid" "high availability" "cheap steel" "awarded contract" do not appear.
      • The source uses completely different turns of phrase, and orders of presentation, and spreads them out over the entire page. The article condenses these, uses a different order, a different tense and style of expression entirely.

I then repeat this for every citation of Livermore.

  • For example, "Chile's naval tonnage was 36,896 long tons (37,488 t), Argentina's 34,425 long tons (34,977 t), and Brazil's 27,661 long tons (28,105 t)." is risky, as it is technical. I check the source. The source reads instead, "The tonnage of the Chilean navy was...; of the Argentine,…; of the Brazilian….". This clearly isn't close paraphrase. The editor provides converted values. The editor's sentence follows the presentation order (Chile, Argentina, Brazil) but this is clearly a natural order of simple fact being a list by size. Additionally the editor expressed themself with the nations as possessive singulars (Brazil's as if Brazil were a person), whereas the source expresses using adjectival national descriptions of the navies, then contracting the navy out in the last two examples of the clause. This isn't close paraphrase because it is obviously a unique re-expression of the fact in a manner I know to be the editor's prose style (a modern contemporary casual yet encyclopaedic style, which often uses direct expressions grounded in colloquial ways of behaving with abstract nouns).
  • [8d], population size, for example, is a combination of the source indicating that population size is relevant in economic production by comparing Brazil to both rivals together; whereas our editor uses the source's own figures to trivially calculate this in terms of Brazil compared to each in turn. It substantiates and it is in no way plagiarised.
  • [17b]
    • Article: "The Argentine government made a last-ditch attempt to preclude an arms race by offering to purchase one of the Brazilian ships, but when this was rebuffed, they sent a naval delegation to Europe to solicit tenders from armament companies to build warships for Argentina."
    • Source: "Argentina made a final effort to secure naval parity with Brazil by offering to purchase one of the dreadnoughts; when this proposal was rejected, an Argentine naval commission sailed for Europe to receive tenders for the construction of two dreadnoughts and a number of destroyers."
    • So phrase ordering is the same, "Argentina" "avoid" "ship buy" "refusal" "naval" "to Europe" "build" "plural vessels" but is this close paraphrase?
    • Unfortunately it appears to be so. The main point of difference is that preclusion of an arms race is a fundamentally different meaning to security naval parity. However, in all other aspects the expression follows that of the original. How close is close? I'm particularly concerned by the similarity of clause "by offering to purchase one of the"
    • Phrase searching on Google Scholar like so indicates the worrying similarity.

So I reported it in my review:

    • Spotcheck for copyvio, plagiarism, close paraphrase and citations supporting facts: issues, close paraphrase
      • Earwig copyvio: clear
      • CorenBot: clear
      • Sudden stylistic changes and random unique turns of phrase manually checked: clear
      • Spotchecked as Livermore; issues
          • Endnote [17b] Article: "The Argentine government made a last-ditch attempt to preclude an arms race by offering to purchase one of the Brazilian ships, but when this was rebuffed, they sent a naval delegation to Europe to solicit tenders from armament companies to build warships for Argentina." ; Source: "Argentina made a final effort to secure naval parity with Brazil by offering to purchase one of the dreadnoughts; when this proposal was rejected, an Argentine naval commission sailed for Europe to receive tenders for the construction of two dreadnoughts and a number of destroyers." ; phrase ordering is the same, "Argentina" "avoid" "ship buy" "refusal" "naval" "to Europe" "build" "plural vessels" ; this appears as close paraphrase: The main point of difference is that preclusion of an arms race is a fundamentally different meaning to security naval parity. However, in all other aspects the expression follows that of the original. How close is close? I'm particularly concerned by the similarity of clause "by offering to purchase one of the"
  • So why didn't the bots pick this up? Who knows.
  • So why didn't I pick it up in a style review? Because the source author's style on this paragraph was very close to the editor's style.

Now that I've found close paraphrase I need to overcome my sadness and run a detailed check of Livermore, thorough and suspicious.

  • Oh no! The article and the source present the same material at:
    • [84b]
      • Article: "In the end, Chile only bought Canada and four destroyers in April 1920, all of which had been ordered by Chile prior to the war's outbreak and requisitioned by the British for the war.[84]"
      • Source: "The Chileans, however, contented themselves with the purchase of a single dreadnought and a few destroyers, all of which had been ordered before the war and had been taken over by the British Navy in 1914"
      • Again, the verb clause, "all of which had been ordered" and the common presentation order.
  • So I reported it again, similarly.

Livermore was cited 20 times. On two occasions close paraphrase occurred. In both cases it was where a single sentence in the text displayed the same information that a single sentence in the source displayed. In both cases the verb clause remained identical. In both cases the order of presentation was sufficiently similar. This appears to be accidental close paraphrase, and not a matter of style or habit for the editor. The editor's work is fantastic, but they need to hede WP:Close paraphrase on internalisation and re-expression.

The first close paraphrase appeared in the initial article version, the second must have appeared later. This clearly indicates that these were natural slip ups and not an matter of fundamentally bad habits.

I then repeat this for two or three other sources. I chose to check the "weakest" sources, because looking through Livermore exhausted me; and, the style review passed clearly.

  • "British and Foreign News," Evening Post, 12 September 1908, 13. A web available source, with a single cite to a single page, from a journalistic source.
  • "Acorazado Almirante Latorre," Armada de Chile, archived 8 June 2008. A modern web source, with a single cite, in a foreign language I can dog translate, from a non-scholarly source.

I saved my report, and let the author know that they need to watch when they write single article sentences from single source sentences.

It is also obvious that we need automated tools which identify verb clauses and extensively search Google Scholar.

It took me (while writing this up) 90 minutes to automatically, style read, search google / scholar for style turns of phrase, and close read Livermore and two web citations. I estimate that the cost of documenting the process was about 25-50% of the time. I estimate I spent 60 minutes reading Livermore closely. Thus, I'd estimate the cost of spotchecking a FACable article to be approximately 60 minutes.

Fifelfoo, who wrote the bulk of this article, is an Australian Wikipedian with a keen interest in the history of Hungary. Ed, who only wrote an introduction because Fifelfoo is on a wikibreak, is an American university student who has a borderline obsession with early 20th-century battleships.