Open main menu

The moss project seeks to find and remove the furry green typos that have been growing on Wikipedia articles. It uses software written by User:Beland to automatically find misspellings, mistakes in English grammar, violations of the Wikipedia:Manual of Style, and confusing or broken wiki markup.

Dearth to tyops!

QUICK LINK TO THE BEST PAGE FOR NEW PARTICIPANTS

Contents

About misspellingsEdit

How the lists are madeEdit

The moss spell checker is run against a recent set of database dumps, which are generated on the 1st and 20th of every month (but take a few days to process). All the articles in the English Wikipedia are examined. The following are ignored:

  • Text inside references, templates, tables, quotation marks, sections like "External links" and "Works", and some other weird places.
  • Capitalized words (which are presumed to be correctly-spelled proper nouns)
  • Words that appear in titles in the English Wiktionary (which has definitions of all words in all languages, excluding proper nouns and systematic words like chemical names and large numbers)
  • Words that appear in titles in the English Wikipedia (which explains some things that don't appear in the dictionary)
  • Words that appear in titles in the Wikispecies (which has many technical words that don't appear in the dictionary or encyclopedia)

Many mistakes are not (yet) caught:

  • Improper addition of 's (possessives are not added to Wiktionary, so these are excluded systematically)
  • Incorrect capitalization
  • Incorrect multi-word phrases
  • Wrong word used in context
  • Non-English language words not tagged with {{lang}} or where an English misspelling happens to be the same as a word in another language. (These are counted as correct spellings if they are in the English Wiktionary, which lists words in all languages – only the definitions are restricted to English.)
  • Other situations listed in #False negatives below

New statisticsEdit

From 2018-09-20 to 2019-03-01, the number of typos classified as T1 (edit distance 1 from an English word, the most likely to be actual misspellings) dropped by 35,488, or 32%, and this appears to be due to the hard work of editors participating in the moss project fixing typos on the T1 lists. Amazing progress! The numbers for categories we aren't fixing have remained relatively stable, though for all categories there is some bouncing around as new typos are created and fixed in the normal course of writing and editing articles.

While processing the 2019-03-01 dump, I made a major change to how typos are classified. (You can see the old method in the archived statistics.) I've dropped categories with an edit distance greater than 3 from an English word (T4 thru T16) since these are quite unlikely to be misspellings. Most of the reported typos that are not likely English misspellings are either compound words or non-English words. (Some of the non-English words are also misspelled.) Some English compounds end up as TS, if they are caught by a conventional spell checker; the rest are now classified as ME. (There are various other categories for compounds, all starting with M, and these will all need to be refined later because a fair number of words and up there that don't belong.) In an effort to exclude as many non-English words as possible, I've started looking at non-English Wiktionaries; any words found there but not in the English Wiktionary are classified as W. Romanizations are not eligible for Wiktionary; words native to non-Latin writing systems are entered under those other systems. I've written some code that attempts to perform transliteration from any given writing system. It's starting to catch a few thousand words (classified as L) but is obviously missing a lot and so will need to be further refined. I've also added some categories for bad HTML tags and similar problems.

Since the classification changes make the new numbers incomparable with the old numbers, I've started a new table below. I've started posting some TS typos as well as T1s, so expect to see both those numbers to improve significantly in the coming months. -- Beland (talk) 07:30, 23 March 2019 (UTC)

Reporting symbol Explanation Instances, 2019-03-01 dump (692642d) Instances, 2019-03-20 dump (802b6c0) Instances, 2019-04-01 dump (ab3fabd) Instances, 2019-04-20 dump (7bb97ba) Instances, 2019-05-01 dump (dcb388a) Instances, 2019-05-20 dump (dcb388a) Instances, 2019-06-01 dump (30a59f6) Instances, 2019-07-01 dump (2fc381f) Instances, 2019-07-20 dump (41f99ab) Instances, 2019-08-01 dump (bc954d6) Instances, 2019-08-20 dump (c600526) Instances, 2019-09-01 dump (4660042) Instances, 2019-09-20 dump (18f7307)
TS Missing or extra whitespace or dash (or new compound) 183795 182018 (-1777/.97%) 178591 (-3427/1.9%) 177391 176266 175163 173312 170828 168401 166966 164205 161344 160707
T1 Edit distance 1 from common English word 75941 73600 (-2341/3.1%) 70756 (-2844/3.9%) 69261 68790 66099 64732 61255 57141 55160 51987 48904 45926
T2 Edit distance 2 from common English word 72093 71615 (-478/.66%) 70949 (-666/.93%) 70909 70684 70247 69741 69629 69365 69266 69146 68748 68657
T3 Edit distance 3 from common English word 79609 78925 (-684/.86%) 78209 (-716/.91%) 78139 78046 77541 76954 76887 76672 76691 76663 75998 76061
R Regular word (A-Z only) not near a common English word 101178 100067 (-1111/1.1%) 99491 (-576/.58%) 99722 99694 99236 98856 98788 98646 98498 98411 97438 97588
I Definitely not English (International) due to accents or mixed with punctuation (other than hyphen) 93902 90875 (-3027/3.2%) 88564 (-2311/2.5%) 87748 87925 84690 81042 81284 82263 82412 82431 71982 71240
W Not in English Wiktionary, in non-English Wiktionary 82548 82519 (-29/.04%) 80041 (-2478/3.0%) 79664 79486 77888 76310 76309 76224 76177 76142 75508 76248
L Probable Romanization (transLiteration) 4294 4306 (+12/.28%) 4206 (-100/2.3%) 4219 4237 4197 4168 4181 4189 4188 4191 4191 4234
ME Probable coMpound, English (with and without dash) 51279 51052 (-227/.44%) 50845 (-207/4.1%) 50932 50902 50659 50263 50352 50439 50419 50700 50606 50708
MI Probable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash) 194949 192743 (-2206/1.1%) 189661 (-3082/1.6%) 189758 190172 187870 184497 185101 185733 185960 186074 175904 176069
MW Probable coMpound, found in non-English Wiktionary 51656 51240 (-416/.81%) 50288 (-952/1.9%) 50026 49785 48728 47641 47642 47544 47831 47555 46854 46850
ML Probable coMpound, transLiteration 4010 3964 (-46/1.1%) 3925 (-39/.98%) 3881 3892 3835 3829 3827 3826 3857 3853 3849 3852
C Chemistry words 1853 1855 (+2/.11%) 1863 (+8/.43%) 1862 1858 1864 1569 1559 1554 1560 1561 1552 1551
D DNA sequences (a, c, g, t) 0 0 (-) 0 (-) 0 0 0 1 1 1 0 0 0 0
N A-Z plus numbers and hyphens 26620 25854 (-766/2.8%) 25711 (-143/.56%) 25739 26263 26134 25945 25841 25703 25650 25664 26664 25776
P Patterns (e.g. rhyme schemes) 47 50 (+3/6.4%) 49 (-1/2.0%) 50 48 47 50 49 45 42 38 37 39
H HTML/XML/SGML tag 3519 3459 (-60/1.7%) 3423 (-36/1.0%) 3420 3404 3237 3197 3160 3173 3180 3190 3059 3078
HB Known bad HTML tag, like <font> 15366 14837 (-529/3.4%) 14541 (-296/2.0%) 14776 14622 16313 16286 16818 16816 3180 15558 14620 15525
HL Bad HTML-like linking, like <http://...> 516 510 (-6/1.2%) 501 (-9/1.8%) 500 497 492 491 496 492 493 492 474 482
U URL - 1284 1242 (-42/3.3%) 1235 1222 1225 1218 1225 1227 1213 1200 1219 1213
BC Bad characters - - - - - - - - - - - 205046* 196231
BW Bad words - - - - - - - - - - - 306181* 120289*
Total 1043175 instances 1030773 instances (-12402/1.2%) 1012856 instances (-17917/1.7%) 1009232 1007793 995465 980102 975232 969454 964828 959061 1440178* instances 1242324* instances
Parse failure Mismatched punctuation 199130 articles 200032 articles (+902/.45%) 195598 articles (-4434/2.2%) 195995 articles 196330 articles 196566 articles 196882 articles 197380 articles 197810 articles 198086 articles 198442 articles 158283 articles + 40465 MOS:STRAIGHT violations 158564 articles + 40523 MOS:STRAIGHT violations

* Affected by significant algorithm changes. 1 Sep 2019: Added BC and BW. (Parse failures dropped due to JWB-powered MOS:STRAIGHT cleanup.) 20 Sep 2019: BC and BW restricted to lowercase; added TS+COMMA, TS+BRACKET, TS+EXTRA.

  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

Instructions for editorsEdit

Just like a regular spell checker, sometimes a word that's highlighted is really a misspelling and should be changed, but sometimes it is a correct spelling that needs to be added to the spell checker's dictionary (which in this case is the English Wiktionary and Wikispecies). For the below lists, here's how you can help:

  • For spelling mistakes: Click on the links to the individual Wikipedia articles, and edit them to correct the misspelling. Make sure this is actually a misspelling, and not a technical term that needs to be better explained, or an alternate spelling (possibly from a different regional variety of English).
  • For non-English words (including words from Old English and Middle English, since they are pronounced differently): Edit the article and use the {{lang}} or {{transl}} templates to mark all non-English passages. Template contents are ignored, so they will not show up in the next report. If you can define the word, it would still be helpful to add the non-English word to the English Wiktionary or the same-language Wiktionary if you speak that language. As of the March 20, 2019 dump, only words not found in any Wiktionary are reported by moss as misspellings. (The "home" Wiktionary for Old and Middle English words is the modern English one.) NEW: If you don't know which language is being used, you can tag it with {{which lang}}. If you add a "reason=" parameter, that will change the pop-up tooltip text readers will see when they hover over "what language is this?". If you have a guess as to which language it might be, or any other question or comment, you can leave that here to help future editors. If you use this tag, you can delete the article from the moss listing; the article will be added to Category:Articles with unidentified words instead, and ignored by future runs of moss until the mystery is solved.
  • For incorrect spellings in direct quotes:
    • These shouldn't be picked up by the spell checker, as text in double quotes "" is ignored. The article probably has incorrect punctuation.
    • Regardless of punctuation problems, you can add {{sic}} around the word or phrase. See Wikipedia:Manual of Style#Quotations for guidance.
  • For correct spellings that belong in the dictionary: Click on the word to add it to the English Wiktionary. Remember the word might not be English (though the definition must be), and be sure to check capitalization!
  • For correct spellings already in the dictionary: Delete from the list or strike through; these have been added in the meantime since the database dump by other editors. They do not automatically turn red as internal Wikipedia links do.
  • For correct spellings not appropriate for Wiktionary:
    • For DNA sequences, add {{DNA sequence}} around it.
    • For species, add the whole name to Wikispecies:Wikispecies:Requested articles#From_Wikipedia and it will be suppressed from future runs.
    • For proper nouns and (including non-English titles) that aren't capitalized, put inside a {{proper name}} tag.
    • Use <code></code> or similar tags for computer programs; see Wikipedia:WikiProject_Computer_science/Manual_of_style#Code_samples.
    • For terms that are only relevant to one Wikipedia article (and for which the article makes clear the definition) consider creating a redirect to the article. As long as the "typo" word is in the title (as a whole word), it won't show up as a mistake in future spell checks.
    • Anything else, add {{not a typo}} around it (for example, nonsense series of letters used as examples in puzzles).
    • For bird calls: Treat these as foreign-language words or words-as-words and put them in italics, following MOS:ITALICS. Put the call inside {{not a typo}} so it won't show up on moss spell check reports. (It doesn't matter if the double apostrophes that make the italics go inside or outside the template.)
  • Correct or incorrect, when finished delete or strike out the entry for the word from the lists on this page (or subpages), so work won't be duplicated. It is preferred to delete the entry for sections that rotate through specific letters, and strikethrough for sections where the whole thing gets updated (to prevent duplicating work done while the dumps were being processed, which can take more than a week).
  • If an article or section has generally bad grammar, and you don't have time to fix the whole thing, just add {{copyedit}} at the top of the article or {{copyedit|section}} at the top of the affected section. If it's just a sentence or two, {{copy edit inline}} or {{incomprehensible inline}} can go at the end of the problem passage.
  • If you see errors being reported from footnotes or bibliographies, check to make sure the section is titled with a standard name following MOS:APPENDIX conventions. Standard end-matter sections like "References" and "Further reading" and "Works" are ignored.
  • If it helps to leave a message on the article's talk page asking if the word is correct or incorrect, you can use Template:Typo help like this when editing the bottom of the talk page (leave the section header blank; it will automatically be added):
{{subst:typo help|PUT WORD HERE}} -- ~~~~
  • NEW: If you are uncertain whether a word is spelled correctly or not, you can add {{typo help inline}} immediately after it. If you add a "reason=" parameter, that will change the pop-up tooltip text readers will see when they hover over "check spelling". You can add a specific question or comment that may help identification. If you use this tag, you can delete the article from the moss listing; the article will be added to Category:Articles with unidentified words instead, and ignored by future runs of moss until the mystery is solved.

Don't worry if you miss something; it will reappear in a future report if there are still mistakes.

Suggested edit summariesEdit

If you want to help publicize this project, you can copy-and-paste these into your edit summary, if appropriate.

For Wikipedia edits:

Fix misspelling found by [[Wikipedia:Typo Team/moss]] – you can help!
Tag non-English text found by [[Wikipedia:Typo Team/moss]] – you can help!
Tag correct text as {{not a typo}} for automated spell checkers (including [[Wikipedia:Typo Team/moss]])
Fix mismatched quote marks found by [[Wikipedia:Typo Team/moss]] – you can help!

For Wiktionary edits:

Add word identified by [[w:Wikipedia:Typo Team/moss]] – you can help!

Wiktionary cheat sheetEdit

Need to add a word to Wiktionary? The Wiktionary cheat sheet has copy-and-paste templates that make it easy for the types of words commonly encountered here, even if you've never done it before.

Misspellings - lists of things to fixEdit

Likely misspellings by article (main listing)Edit

The most efficient list to work on if all you want to do is fix misspellings. All typos from a given article are shown, but only typos that are very close to known words are shown. The algorithm is not perfect, so some of these may still be words that need to be added to Wiktionary. A different part of the alphabet is posted on each run to avoid duplicate work, and because the whole list is too long to post all at once.

See subpages due to length:

Cases that require investigation are being moved to Category:Articles with unidentified words. -- Beland (talk) 21:33, 9 March 2019 (UTC)

Likely misspellings by frequency (a-m)Edit

(Updated from 2019-08-20 dump.)

The best list to work on if you want to eliminate all instances of a specific typo. Only typos that are very close to known words are shown. The algorithm is not perfect, so some of these may still be words that need to be added to Wiktionary. For each run, only words from half of the alphabet are shown, to avoid duplicate work from when new dumps are being processed.

Legitimate misspellings are candidates for Wikipedia:Lists of common misspellings. If there is an obvious correction, adding that to Wikipedia:Lists of common misspellings/For machines will help editors who use automated tools to fix cases faster.

Likely new compounds by frequency (a-m)Edit

(Updated from 2019-08-20 dump.)

The best list to work on if you want to add variations of known words to Wiktionary, mostly compound words. The algorithm is not perfect, so some of these might be common mistakes that need to corrected. For each run, only words from half of the alphabet are shown, to avoid duplicate work from when new dumps are being processed.

Most common words with slashesEdit

This is a special manual report from the 2019-08-20 dump. -- Beland (talk) 00:37, 31 August 2019 (UTC)

Compound units of measure, probably eligible for Wiktionary:

Actually, I made a "science word" filter that just needs to be expanded to cover these, though some are suffering mu/micro confusion. -- Beland (talk) 05:39, 2 September 2019 (UTC)

Probably need correcting or tagging in articles, per MOS:SLASH:

Likely new words by frequency (a-m)Edit

(Updated from 2019-08-20 dump.)

The best list to work on if you want to add completely new words to Wiktionary. The algorithm is not perfect, so some of these might be common mistakes that need to corrected. For each run, only words from half of the alphabet are shown, to avoid duplicate work from when new dumps are being processed.

Some of the words might not be from English. To get these words off this list, you can either add an entry to the English Wiktionary (which provides English definitions for words in all languages) or tag all instances of the word on the English Wikipedia with {{lang}}. Wiktionary does not accept Romanizations for some languages, so those cases must be tagged as {{transl}} or {{lang}}.

Compounds and technical words:

Probably non-English words:

These all appear to be the Greek word 'οτι', which does not appear in wikt without breath marks. That is, see wikt:ότι, which then mentions forms wikt:τι, wikt:ὅτι, wikt:ό,τι.
It would appear then that the proper action is to mark all these quoted Greek texts with {{lang}}? ::Also, I think I'll ask over at wikt if it would be reasonable for them to have an entry for wikt:οτι. They do have an entry for wikt:oti, which mentions at least wikt:ότι and wikt:ό,τι, but not wikt:ὅτι. (sigh) Oh what a tangled web we wind, when first we endeavor these defined. Shenme (talk) 04:30, 13 October 2019 (UTC)
Additionally, many (all?) of these appear to be 'biblical' == classical == ancient Greek, which has ISO 639-2 code 'grc'. Modern Greek is ISO 639-1 code 'el', ISO 639-2 code 'gre'. Shenme (talk) 04:49, 18 October 2019 (UTC)
Ah, but not all. Some found with search are modern Greek, so lang|el, and some 'oti' found having breath marks. Currently searching using "οτι" -insource:"lang|grc" -insource:"lang|el" -insource:"lang|gre" and working on labelling any form of Greek. Shenme (talk) 02:27, 20 October 2019 (UTC)

Weird cases:

Likely misspellings by frequency (n-z)Edit

(Waiting for next dump; only items with manual notes are listed below.)

'vlne' is Violone wikt:violone Abbreviations in the column "instrumentation" String Instrument Names;
there are music instrument abbreviations used in these articles that are either old or derived from German/etc. e.g. 'vle' is 'wikt:viole' viole;
so it would seem the whole group of abbreviations needs addressing in these articles, not just 'vlne' ? Shenme (talk) 05:30, 15 July 2019 (UTC)
I have changed the two pages that had this as abbreviation, but there is also vlně for Czeck (wool) and vlne in Slovak that are not yet in the wiktionary. Graeme Bartlett (talk) 05:55, 20 July 2019 (UTC)

Likely new compounds by frequency (n-z)Edit

(Waiting for next dump; only words with manual notes are shown below.)

woge (I'm scared of this one Shenme (talk) 04:49, 16 July 2019 (UTC))

Likely new words by frequency (n-z)Edit

(Waiting for next dump; only items with manual notes are listed below.)

Likely new words by frequency (non-English)Edit

These are good candidates to add to the English Wiktionary (which provides English definitions for words in all languages), as it seems English Wikipedia readers will frequently encounter them. This is a special manually generated report.

(Note: Overlaps with future reports for "Likely new words by frequency"; need to clear out manual cases here and deal with numbers and slashes in debug-most-common-misspellings-intl.txt. -- Beland (talk) 01:25, 30 August 2019 (UTC))

From 2019-02-01 dump:

From 2019-02-01 dump, but clearly not foreign words (need to figure out what to do with them):

Cases with notes from 2018-09-20 dump:

For WiktionaryEdit

This is a special section; putting a Wiktionary link here will cause a word to be ignored by the spell checker everywhere it appears (on the assumption it will soon be added to Wiktionary.)

1-BEdit

I think it just needs an entry in Wiktionary, then. -- Beland (talk) 19:34, 20 September 2018 (UTC)

CEdit

Most common non-English, missing from English WiktionaryEdit

These words are commonly found in English Wikipedia, are present in a non-English Wiktionary, but are missing from English Wiktionary. Word counts are from English Wikipedia. This is a special report from the 2019-08-20 dump.

Mineral wordsEdit

Several pages with lists of minerals are showing up as some of the pages with the most detected typos. Below is a list of words from these pages. I'm pretty sure some of them are misspelled, so they all require verification. I don't see anything in wikt:Wiktionary:CFI that would exclude these names; some but not all of them are IUPAC systematic. We could also add Wikipedia stubs or redirects as needed if Wiktionary doesn't want them. -- Beland (talk) 15:36, 30 May 2019 (UTC)

Needs Wikipedia article instead?Edit

Archived notesEdit

See Wikipedia:Typo Team/moss/Archive.

Articles with the most possibly misspelled wordsEdit

These are likely to be lists using non-English-language or technical words.

  • For articles that are just lists of species names, please link to the article from Wikispecies:Wikispecies:Requested articles#From_Wikipedia and delete the entry here. Those are now automatically suppressed.
  • For non-English-language words, add {{lang}} around the foreign passages and delete the row. Articles that don't do this often have formatting of non-English words that is inconsistent either internally or with the Manual of Style, so this is an easy way to fix that at the same time as helping the spell checker and screen readers do the right things.

100+ wordsEdit

50-99 wordsEdit

Possible typos by lengthEdit

Longest or shortest in certain categories are shown, sometimes just for fun and sometimes because they form a useful group. Please use strikethrough (or leave a note) for this section rather than removing lines, to avoid repeating work done while the dumps were being processed. Thanks!

Likely chemistry wordsEdit

Missing articles on single charactersEdit

Every character should either have a Wikipedia article, redirect to a Wikipedia article, or Wiktionary entry.

Probable DNA sequencesEdit

If you're sure this is a DNA or RNA sequence, tag it {{DNA sequence}}.

(All fixed from 2019-08-20; waiting for next dump!)

Repeating patterns - easy fixesEdit

For rhyme schemes, they probably need to be re-styled to follow Wikipedia:WikiProject Poetry#Style for rhyme schemes. If this ends up making them all-caps, they won't show up here on the next run. For mixed-case rhyme scheme notations, use {{not a typo}} after making sure dashes, commas, and spaces follow the recommended style.

AlphabeticalEdit

Poem rhyme schemes just need to be capitalized as explained in the article Rhyme scheme. All the bird noises can be put inside {{not a typo}}. -- Beland (talk) 05:42, 21 October 2018 (UTC)

From 2019-08-20 dump:

From 2019-03-01 dump:

Notes for editorsEdit

To be taggedEdit

Patterny words possibly for WiktionaryEdit

  • 33 hexipentisteriruncicantitruncated - a nest of specialized geometrical form names; what to do? → since this is a part of several compound names, it may need a set index or disambig page. If it has use in books, it could go in Wiktionary, but Wikipedia seems to be the source of these geometric terms.

For Beland todoEdit

  • Ignore lines beginning with spaces, or do these need actual markup tags if they are code or something?
  • Rhyme scheme hunting:
    • Sync style for articles in Category:Stanzaic form and Category:Rhyme and add to rhyme scheme list if appropriate.
    • Sync annotation style for articles that mark up poems line-by-line (use tables, not column divs or parens)
    • Manually search for patterns like:
      • a-b-a-b-a-b-c-c
      • AB,CD,AB (internal rhyme)
      • "aa", "ab", "aaa", "aab", "aba", "abb", "abc", "aaaa", "aaba", "aabb", "aabc", "abaa", "abab", "abba", "abca", "abcb", "abcc", "abcd" - probable rhyme sequences where there's an article present so it's not detected as a misspelling
  • Hmm, looks like the chemistry word detector could use some enhancement. -- Beland (talk) 16:27, 15 August 2018 (UTC)

False positivesEdit

Is there a word that is correctly used in an article, but which shouldn't be added to Wiktionary? List it here, and Beland will fix the problem.

Archived solutions: Wikipedia:Typo Team/moss/Archive

False negativesEdit

Is there a misspelled word in an article mentioned here that was not reported? Feel free to list it below and Beland will try to improve the code if appropriate.

These are currently over-ignored, but could be used to suggest correct spellings:

  • Wikipedia articles with {{R from misspelling}}, {{R from incorrect name}}, {{R from miscapitalisation}}, and redirects to these templates
  • Wiktionary entries that are known misspellings (e.g. wikt:anticiliary)
  • In cases where there are variant spellings of the same word or phrase, Wikipedia should probably pick one and stick to it except to mention the variants. This happens with:
    • Compound words - whether to use a space, dash, or nothing, as in "junebug" vs. "june bug" or "email" vs. "e-mail".
    • Words with multiple transliterations from another language (often there are multiple systems, no particular system, or a modern system different from historical systems).
    • Redirects with {{R from alternate spelling}} and redirects to that template.
  • Article Ana Recio Harvey | detected misspelling: appoinment | additional, undetected misspelling: enterpreneur
    • Looks like this was because of redirects with "enterpreneur" in the title. I have tagged them all {{R from misspelling}}, but I'll have to change the code to ignore those, as noted above. Thanks for catching that! -- Beland (talk) 23:52, 18 October 2018 (UTC)

  • Kenya, as of April 2019: "1963.Their". Probably the NLTK tokenizer found the word boundary despite the missing whitespace? -- Beland (talk) 03:52, 14 April 2019 (UTC)
  • Missing spaces after commas. Probably the NLTK tokenizer is finding the word boundary anyway? -- Beland (talk) 04:02, 14 April 2019 (UTC)
  • wikt:archibishop

Mismatched markup and punctuationEdit

Errors in punctuation (mostly quotation marks) and wiki markup generally cause confusion for readers, and also prevent the spell checker from running on these articles.

Inches and feet should not use " and ', per Wikipedia:Manual of Style/Dates and numbers#Specific units; use letters instead. (See MOS:UNITS for general guidance.) Where conversions are needed, use {{convert}}, for example: 2 feet 3 inches (69 cm)


WORK IN PROGRESS

  • Integrating these with main listings
  • Filter only unmatched " for now
    • Filter articles with non-ASCII quote marks to a separate list for JWB processing
    • Filter \d" and \d' to a separate sublist for inch/feet style conversion
  • Explain ✂ or skip snippets showing this
  • Bracketbot web UI seems to be down

-- Beland (talk) 19:03, 4 September 2019 (UTC)

cquoteEdit

MOS:BLOCKQUOTE says {{cquote}} should be replaced by {{quote}} in articles.

Find all current instances. (35500 transclusions as of October 2019)

Gender-neutral languageEdit

MannedEdit

The word "manned" and related forms like "unmanned" are used in many articles, but is not gender-neutral as required by MOS:S/HE and the NASA style guide. Gender-neutral alternatives include:

  • Crewed, uncrewed
  • Staffed, unstaffed
  • Human spaceflight
  • Defended

Not all instances need to be changed.

  • Proper nouns should remain the same, like Manned Orbiting Laboratory
  • Titles of sources and quotes should remain unchanged.
  • If the term itself is being discussed, for example to say that "manned spaceflight" is another way of saying human spaceflight.
  • There seems to be consensus on unmanned aerial vehicle that this and related phrases (like unmanned aerial system) should remain intact, since it is much more frequent than "uncrewed aerial vehicle" at the moment. However, when using Wikipedia's voice it is preferred to describe a UAV as "uncrewed" when not using the whole phrase.
  • Non-article pages that are retained for historical interest shouldn't be modified if they won't be visible to readers.
  • Redirects with this title should be left alone if they are redirecting readers to a gender-neutral title

If the word is found the names of articles and categories (except those with names directly related to UAVs), those should be renamed, and the links changed. Many articles have already been renamed, and the links just need to be updated. (Remember that to rename a category, all the articles in that category must be edited to change their pointers.)

Borderline casesEdit

These may need to be discussed before being potentially renamed.

These are generic terms, like Human mission to Mars, as opposed to proper names like Manned Orbiting Laboratory. -- Beland (talk) 19:41, 21 May 2019 (UTC)

  • Manned Venus flyby - Based on the NASA style guide, NASA probably would now refer to this as "human Venus flyby" but historical sources say "manned Venus flyby" so that's what the majority of editors commenting on the talk page currently favor. There is some question as to whether the scope of the article concerns a specific mission or this type of mission in general, which is related to the proper name exception (but then the title would be "Manned Venus Flyby"). Compare Colonization of Venus and Human mission to Mars. -- Beland (talk) 19:41, 21 May 2019 (UTC)

Objections in specific cases:

MarriageEdit

Wikipedia:Writing about women § Marriage points out:

LadiesEdit

Wikipedia:Writing about women § Girls, ladies prefers "women" to "ladies" except where part of set phrases or traditional titles (like first lady). find all lowercase "ladies"

Instructional and presumptuous languageEdit

MOS:NOTE says to avoid the following phrases when they address the reader directly. Not all instances are problematic, such as those in direct quotations.

HTML tagsEdit

Updated from 2019-08-20 dump.

You can do one of two things for these articles:

  • Remove, repair, or convert the HTML markup to wiki markup yourself.
  • Tag the article {{cleanup HTML}} and it will show up under Category:Articles with HTML markup but not on this list. Use the "tags" parameter to indicate which tags are present on the page; many editors find it hard to locate the offending HTML. For example: {{cleanup HTML|tags=table, cite}}

How to clean upEdit

See Category:Articles with HTML markup for instructions on how to find the offending tags and what to do about them.

Find all articles by tagEdit

Can't wait for the next database dump? Want to look for or fix all instances of a specific tag? Use the links below!

Not HTMLEdit

Sometimes editors use angle brackets (< and >) for other purposes. Though these are not HTML markup, they often need to be fixed.

<<...>> find all can indicate:

  • French quotation marks rendered as <<quoted text>>. These should be normalized to "quoted text" or 'quoted text', even in quotations, per MOS:CONFORM.
  • A broken citation that should be converted to {{cite web}})

Other weirdness:

  • <the> - find all - More French quoting style, bad linking, bad citation style, etc.

Known bad HTML tagsEdit

Bad link formattingEdit

Angle brackets are not used for external links (per Wikipedia:Manual of Style/Computing § Exposed URLs); "tags" like <https> and <www> are actually just bad link formatting. See Wikipedia:External links#How to link for external link syntax; use {{cite web}} for footnotes.

UnsortedEdit

Need debuggingEdit

Probably there are other broken tags on the page, since these tags should be ignored. Need to run reports on these pages manually with debugging turned on.

Notification of new dumpsEdit

"Most likely misspellings by articles" should always have work to do (if not, ping Beland to add more from the current dump). Some of the other sections are occasionally waiting for a new dump to get a useful list, either because they are ranked by frequency or a code change has been made to clean up noise in the next run. New runs are generally posted twice a month. The database snapshot from the first day of the month generally takes about 9-13 days to process, and the snapshot from the twentieth day of the month might take 4-6 days until it can be posted.

All that said, if you want to get a ping when results from a new dump are posted, you can add your name to the list below. If you are only interested in a particular section, include a note to that effect.

moss source codeEdit

moss is written in Python, and is available on github at: https://github.com/cdbeland/moss

Data is obtained from XML database backup dumps.