Long-term todo items edit

  • Javascript automation to make fixing typos faster and more fun (possibly using JWB or AWB)
    • First pass through all articles was completed on 23 September 2019, in under 1 year. Algorithmic improvements found more typos on the second pass, so it's taking longer.
  • Use Wikipedia categories like Category:Redirects from misspellings and Wiktionary categories like wikt:Category:English misspellings to highlight English misspellings that moss is currently ignoring.
  • Spell-check against only English words in the dictionary (will highlight all non-English words not inside {{lang}} and friends)
  • Spell-check non-English languages (requires {{lang}} or similar to know which language to spell-check against)

Checking specific languages edit

Just brainstorming... you may have already thought of this and/or it may not be feasible...
If/When language-recognition functionality is added (i.e. a string on WP which is not tagged as belonging to a foreign language will only count as having an entry on en.Wikt if that entry has an ==English== header), it might be useful to also check foreign-language text on en.WP for typos. For French text, one could probably replicate the current setup except with fr.Wikt being checked instead of en.Wikt, since fr.Wikt's coverage of French is as extensive as en.Wikt's coverage of English. Other major languages' Wiktionaries are much less extensive, e.g. de.Wikt has an order of magnitude fewer entries, so in the case of those languages it might make sense to check the corresponding Wikipedia instead. (I.e., if the program finds that the word magdeburgischen is used on en.WP and tagged as German, it checks if the word is used on de.WP — perhaps even requiring it to be used more than once, to help ensure the use on de.Wikt isn't a typo.) -sche (talk) 16:08, 24 January 2015 (UTC)Reply

Hmm, that's an interesting idea. Certainly it would drive the number of false alarms down, but it would no doubt also increase the number of non-English misspellings we miss if we match against Wikipedias in addition to Wiktionaries. Part of the point is also to encourage creation of dictionary entries that are needed anyway, and it's nice to have humans providing a reliable ground truth for correctness determinations. It's also a bit difficult for me to tell is such a thing is actually working if I don't speak the language, but maybe some fluent speakers of other languages could help out if I ever get around to dealing with non-English stuff in detail. (It's going to be a lot of work just to get English spelling, grammar, punctuation, and style rules coded up completely.) -- Beland (talk) 09:11, 14 April 2018 (UTC)Reply
@-sche: After a few more years of experience, I'd say the main barrier to doing this (other than the huge backlog of English typos) is that most non-English text in the English Wikipedia isn't tagged with {{lang}} or {{transl}} or similar, which is necessary to know which language to spell-check the text against. It's possible to guess the language, but we actually need {{lang}} for all non-English text anyway, so screen readers and search engines and whatnot function properly. In the long run, I will definitely want to spell-check all languages if volunteers are available. In the meantime, the system is prioritizing cleaning up two types of non-English text. The first is words that are not in any Wiktionary, which are the most likely to be actual misspellings and thus the most beneficial to give attention to first. These are found in the "Case notes" of most of the main page listings, and also the most important ones are found in the newly consolidated "Highest-frequency words missing from dictionary" section. The section is the new Wikipedia:Typo Team/moss/not English report, which focuses on articles with the longest passages of non-English text. These are most likely to be in need of translation, but may also represent some long non-English quotations that need tagging before they can be spell-checked against a specific language. -- Beland (talk) 07:15, 3 June 2022 (UTC)Reply

Exclude terms Wiktionary only records as misspellings edit

Wiktionary gives some misspellings entries (originally only very common ones, but the threshold has been steadily whittled down by various users), where they are labelled as such and readers are directed to the usual spellings. Example: wikt:tassled. I don't know if those "misspelling entries" are already excluded from counting as Wiktionary entries for the purposes of this project; if not, that's something to add to the todo list alongside the goal of excluding non-English Wiktionary entries. Or, actually, perhaps it would be simpler to search a dump of en.WP for all 1756 terms in wikt:Category:English misspellings, and make a separate list of articles that contain them, since they're known misspellings. -sche (talk) 09:06, 26 January 2015 (UTC)Reply

It's not urgent, but it might be useful to periodically check WP for entries which Wiktionary has but categorizes as either misspellings or obsolete spellings/forms (e.g. wikt:laterly, wikt:kinge). -sche (talk) 06:04, 1 March 2019 (UTC)Reply
Also, the ~60 words categorized as wikt:Category:Non-native speakers' English (although in some cases other senses of some of those words are valid English; those could be pruned off when checking for words on en.WP). -sche (talk) 08:19, 11 March 2019 (UTC)Reply
@-sche: This is a good idea which has been in the back of my mind since you made the suggestion. I think some more recent category-processing improvements in service of the experimental grammar checker will make implementing it a lot easier. I'll still have to find a solid chunk of time to work on implementation, but...yes! Especially now as we have cleaned up decades of English typos, it's getting harder to find actual misspellings, and these are an easy target. -- Beland (talk) 07:29, 3 June 2022 (UTC)Reply

Limit search to article namespace edit

In the "Likely misspellings by frequency" (etc) sections, can you construct the "find all" links to use a search URL that only looks at namespace 0? I saw a lot of Wikipedia: pages in the list for one of the entries, and I don't think that namespace is a priority for spellcheck. David Brooks (talk) 02:45, 23 September 2018 (UTC)Reply

Whoops, looks like I did that on one code branch and not the other. Fixed; it should show up in the Oct 1 run. -- Beland (talk) 22:39, 29 September 2018 (UTC)Reply
I am also finding errors in templates. When they are transcluded on many pages, the search finds the pages, but if you try to edit the error is not there! But for some kinds of templates, eg DYK nominations, spelling errors are of little importance and should not be fixed. Graeme Bartlett (talk) 03:09, 30 September 2018 (UTC)Reply
@Graeme Bartlett: Did you have an example typo with this problem? I can try to diagnose and find a workaround. -- Beland (talk) 20:53, 1 October 2018 (UTC)Reply
Well I fixed Template:Roman Catholic Diocese of Buffalo, and Template:Reach plc as part of Typo Team/moss. Now that they are fixed you won't see error in the search. Do you want me to leave the next error that I find so that you can investigate? (That is better than deliberately putting in an error). Graeme Bartlett (talk) 21:40, 1 October 2018 (UTC)Reply
Hmm, yes, that would be helpful. I found a relatively rare correctly-spelled phrase to test with, but I couldn't reproduce this problem. I'm wondering if the namespaces being searched are being controlled by the selection you make which are stored in your browser's cookies? Do you see a list of namespaces under the search text bar? -- Beland (talk) 14:17, 2 October 2018 (UTC)Reply
I am just using the default. There errors re visible on article pages, so searching articles should show the errors. But when you edit you don't see the text. These kinds of errors do need to be fixed, so I am not complaining. At least these errors are only nested one deep, unlike some other problems I have seen with speedy delete nominations, or unclosed formatting inside templates. So there is no need to search template space. Graeme Bartlett (talk) 23:40, 2 October 2018 (UTC)Reply

"Brittanica" -> "Britannica" edit

I'm wondering why I don't see the above goof in lists here. At more than 200 occurrences it ought to show up somewhere, or eventually. Is it not now in the lists because it would be a 'T2' thing? Shenme (talk) 06:39, 27 May 2019 (UTC)Reply

It would be excluded because of the capital "B". We can still fix it though. AWB is a good tool to do this. Graeme Bartlett (talk) 22:39, 1 June 2019 (UTC)Reply

"A*" mis-spelling edit

I've been away from MOSS for a while, so apologies if I'm not up to date with current conventions. But I just came across the word Archibishop, which occurs in about 100 pages (I accidentally put it in a search). It doesn't appear in the a-m misspellings by frequency; can I expect it to appear in the next dump? I'm just leaving on a trip, unfortunately, or I'd attack it myself. David Brooks (talk) 17:01, 1 June 2019 (UTC)Reply

Graeme Bartlett partly answered my question immediately above. The majority of uses have a capital A, but there are still about 25 archibishop appearances. Should that be enough to get it on the lists? (yes, I'm on the trip, sitting on Heathrow WiFi after a red-eye :-) ) David Brooks (talk) 06:51, 2 June 2019 (UTC)Reply
The spelling "archibishop" is used in some Wikipedia article titles, so that would exclude it from the typo lists. They are all redirects, for example Catholic Archibishop. It's unclear to me if this is a common misspelling or an alternate spelling. If it's a misspelling, we'd want to add {{R from misspelling}} to all the offending redirects. I could either add code to exclude redirects so tagged, or I could blacklist the bad spelling. -- Beland (talk) 17:57, 11 June 2019 (UTC)Reply
Marking {{R from misspelling}} redirects as not (necessarily) valid sounds like a good general rule to me. I'll let you decide on the priority order when the word also appears in a "regular" title. David Brooks (talk) 20:25, 17 June 2019 (UTC) ETA: Sorry, that wasn't meant to be sardonic, just not properly thought through. Of course plenty of valid words can appear in mis-spellings of multi-word titles. David Brooks (talk) 07:24, 18 June 2019 (UTC)Reply

Congratulations! edit

With the completion of the "V" subpage in the main listing today, after jumping around a bit, we've completed a first pass through an entire alphabet of articles with misspellings, in under a year. Around fifty thousand thanks are due to the volunteer editors who have been helping out - that's how many spelling and punctuation errors you've fixed! The typo counts seem to indicate we have at least that many left to fix, but they are getting a bit harder to find automatically because they are in articles which also have suspect words which are less likely to be incorrect. To keep us on this roll, I've cooked up some code that unlocks the next layer of typos. As we circle back around to the beginning of the alphabet, you'll see the main listings for A-I will have a lot of missing-spaces-after-period errors (TS+DOT) which we haven't fixed because we only started doing that on the first pass when we got to J. You'll also see more listings where moss has found likely misspellings alongside suspect words it thinks are probably OK. These should be almost entirely in articles we haven't touched yet; my goal is to have to look at each article only once (at least until new errors sneak in). I've also been experimenting with detecting other types of Manual of Style errors beyond spelling issues. I'll be posting some updated advice and a fresh batch of typos shortly. Your feedback is always welcome, and thank again for making this tremendous accomplishment possible! -- Beland (talk) 04:14, 23 September 2019 (UTC)Reply

Citation errors edit

Hey, not sure if this is suitable for the project, but there are a few hundred instances of incorrect citation last names ([1] [2]) due to a bug in the Zotero backend that WP:VE and WP:ProveIt use. @Beland: Could you check to see if this is suitable, and add it to the main page if it is? Thanks. Darylgolden(talk) Ping when replying 00:58, 15 April 2020 (UTC)Reply

@Darylgolden: moss code ignores templates, so it would be a bit of a pain to check for these problems and include them in the main listings. But the links you provided work quite well; you can see in some cases I just post links like those to the main project page. Is there a pattern to how to fix them? It looks like we'd need to manually inspect the source to find the author's last name? -- Beland (talk) 08:08, 21 April 2020 (UTC)Reply
@Beland: Unless someone comes up with a better way, it appears that manually inspecting the website would be the best way. Since the project already has listings on tasks that aren't strictly typos (templates and HTML tags), and I can't find anywhere else on Wikipedia to post this kind of task, it would be helpful to include on the main page if you think that it is within scope. Darylgolden(talk) Ping when replying 08:19, 21 April 2020 (UTC)Reply

Example: product(s) edit

wikt:product(s) for example - I don't think should be a typo. Is this a Manual of Style thing? How should I handle it? Sct72 (talk) 21:30, 17 October 2020 (UTC)Reply

@Sct72: You can leave those in the case notes for me. I exclude most of those from the listings but I need to make some algorithmic improvements to keep them all out. -- Beland (talk) 00:44, 23 January 2021 (UTC)Reply
Ok cool, thank you! Sct72 (talk) 01:00, 24 January 2021 (UTC)Reply

Bird calls, special ISO codes, empty uses of not a typo, and D&D edit

Look I know this is many things but this is what I'm like.

While cruising around learning about ISO codes (which I hate), I found that there's a few codes for basically all the stuff that doesn't fit in a code. (quotes are from ISO 639)

  • mis -> for uncoded languages; for example, many Australian languages don't have an ISO code. I'd still use the macrolanguage one, aus, but arguably this isn't appropriate, so I can understand someone choosing instead to use mis.
  • mul -> when a segment has "multilingual content (includes at least two languages in separatable parts)." This isn't super useful for us, it'd typically get used to indicate, for example, that a website as a whole contains multiple languages.
  • und -> "content includes zero, one or many languages, in arbitrary combination". Probably not the most useful for us, unless there's a section of text we simply cannot figure out. If we did that, there would probably need to be a tracking category for it, and it'd probably need to be discussed on the lang talk page.
  • zxx -> "No linguistic information at all". It can also mean not applicable. This has a bunch of uses, idk how many are covered already by other templates though. So I think this one can possibly be used for bird calls. I can't for the life of me remember where I saw it, but at least one non-wiki place recommended the code as an option for how to indicate animal calls.

Another option for bird calls may be Template:Respell, as it's a standardised English language way to represent pronunciation, which is the same as bird calls. It would require learning the rules of how to do it, but hey editing wikipedia has a bunch of learning curves anyway. The only downside I can think of is if there's actually a standardised way that bird calls are written, in which case it just kinda needs its own template.

Then, Template:Not a typo. Something I've noticed (and have done myself at least once when on the visual editor) is that the template has been placed next to the relevant word and doesn't have anything in it. Is there any way to track empty uses of not a typo? (re: bird calls, I'm not super happy with using not a typo to indicate those as it may be confusing to people using screen readers. That's an issue I have generally with the template tbh. But not the current problem.)

I've noticed a fair few articles that are coming up because they mention things from the Dungeons and Dragons games. I separated these out from other entries on the pages I noticed them on, because I think these will keep being a problem. All the time. I explained on Wikipedia:Typo_Team/moss/S#Dungeons_&_Dragons, "Proposal: remove detailed D&D articles and lists from being checked. Reasoning: every single named object in d&d has a ridiculous name, and everyone that plays it is a freak that'd immediately correct typos anyway." Having thought about it since then, there is also the option of using one of those lists as a spellcheck dictionary - a good option for that would be List of Advanced Dungeons & Dragons 2nd edition monsters. Most of the non-English words that people won't capitalise would be under monsters, and it's the most comprehensive/best written of the monsters lists. It's also an old version, so it shouldn't change too much. This method may miss some of the D&D words, but it would probably take less time than finding all the D&D articles and would mean the typos on all those pages won't get ignored automatically.

Also if it's not an issue, I may try to reorder the instructions for editors across the project page and the specific letter pages, bc at the moment trying to find the right instructions is very hard for my atrocious brain (it's not clearly organised by type, I'm guessing it's more done by when the solution was figured out). --Xurizuri (talk) 07:35, 31 January 2021 (UTC)Reply

Oh I forgot to say, one of the special codes (I think it was zxx) can be used for fragments, like suffixes or whatever. --Xurizuri (talk) 09:16, 2 February 2021 (UTC)Reply
@Xurizuri::
  • For D&D terminology, my recommendation would be to make a redirect pointing at whichever article gives the best explanation, even if that's just a list. That will let the spell checker know these are correctly spelled words, and will also help readers who are using search engines find the best article.
  • You can get a live list of empty uses of {{not a typo}} with a search for insource:/\{\{not a typo\}\}/. This does look like it is unfortunately proliferating; I'll add a link to this search as a thing to check up on.
  • Sorry about the disorganized instructions; they have indeed just accreted over time. Is there any particular top-level organizational scheme that might make more sense? It's difficult for me to get a sense of it because I'm too close to the content.
  • Having an ISO code for fragments would actually solve a lot of annoying problems, that's great! I'll have to document that.
  • Is there like an international notation standard for birdcalls or something? That would be handy to track down.
-- Beland (talk) 19:00, 6 February 2021 (UTC)Reply

TS+EXTRA decimal fractions and file extensions; contractions edit

(move from main listings section on project page)

    • Beland are we still doing TS+EXTRA+? And can I remove the ones that are correct as is, which fall under these categories, from case notes? --Xurizuri (talk) 02:27, 5 February 2021 (UTC)Reply
      • @Xurizuri: Yes, TS+EXTRA+ are included on reports, assuming there are any for that letter. (There were for J.) If you see some that are correct as written in the article, that might indicate a bug in the code which may or may not have already been fixed. Were there any in particular you were concerned about? I can take a look and come up with a diagnosis. -- Beland (talk) 08:17, 5 February 2021 (UTC)Reply
        • @Beland: I can't always tell the categories apart tbh. But I've been through all the letters multiple times looking for specific types of entries, so there's a few types of false positives I've noticed. There's about 1000000 instances of it picking up bullet calibres, sports scores, and software suffixes (e.g. .jpg): across A, C and D you can huge lists of see those. The case notes also have some issues in with "it's"/"she's"/etc picking up the possessive of longer words (e.g. Kuwait's), and spellings using brackets to indicate singluar+plural which is explicitly okay in MOS. I can't remember there being any huge lists of those, but they were kind of sprinkled throughout. --Xurizuri (talk) 09:36, 5 February 2021 (UTC)Reply
        • @Xurizuri: Ah, yes, so for calibers and sports scores and anything else that looks like a decimal fraction with no leading zero, as of the 2020-05-01 dump those have been reclassified from TS to Z, and I'm not including any Zs in reports until I figure out how to better separate the good from the bad. The upshot is, you can remove any correct-as-is decimal fractions from the case note listings. The report subsections are a bit weird in that all the typos for a given article appear on one line, so this results in some being "promoted" to a potentially incongruous section. (Like if an article has a mix of T1s and T2s, all of those will show up under the T1+ section; that's what the plus is hinting at.) For file extensions, it looks like there was a bug I fixed for those as of the 2020-04-01 dump or thereabouts. If there's an article or redirect for them, they are find to remove from the listings; otherwise I'd add a redirect. (For example, .jpg is a blue link so those can just be removed, but .uci doesn't exist and should probably be made into a redirect.) Around the same time I also disabled reporting of BWs, because of that issue with contractions you mentioned. If those are OK as-is, they can just be dropped from the listings. (I need to improve the code to handle those more intelligently.) Thanks for your attention to the case notes, by the way. Volunteers have been making such fast progress fixing typos and finding non-typo cases that I haven't been able to keep up with the output, especially for the dumps where the bugs in my code resulted in a very large number. If you happen to see any other patterns, it would be good to know so I can tweak the code as needed. Feel free to ping me. -- Beland (talk) 18:36, 6 February 2021 (UTC)Reply

Chemical formulas edit

I am going through the new section labelled Wikipedia:Typo Team/moss#Chemical formulas. There are very few unformatted chemical formulae in that list. Mostly they are not chemical formulae at all. I would suggest that these elements be removed from the match: Es|Fm|Md|No|Lr|Rf|Db|Sg|Bh|Hs|Mt|Ds|Rg|Cn|Nh|Fl|Mc|Lv|Ts|Og|R as their compounds are hardly known. (And R is not an element). Also if it is just one letter followed by digits, it is very unlikely to be a chemical formula. So I suggest that these are not included. R with digits is likely to be Rand for example. These also include postcodes, page numbers, bus routes, mutation identifiers. More complex ones starting with H can be epigenetic modifications. We do have students over the years writing articles on all of these. Where it is a meaningful chemical formula, I am redirecting. But it would be good if we could get the list of articles with unformatted chemical formula (where it is really one and not a false positive!). Graeme Bartlett (talk) 21:59, 25 March 2021 (UTC)Reply

@Graeme Bartlett: Hmm, it does sound like I need to do a better job distinguishing between anything that could be a chemical formula vs. only the things that actually are a chemical forumla. I included "R" because it is used in chemical formulas to represent a substituent. Since it would be helpful to report article names, I think I'll have to script this properly rather than simply grepping what was ignored. (Chemical formulas start with a capital letter, so they are normally assumed to be proper nouns and ignored.) That will also allow me to make use of a lot more context, though I will also try your suggestions. It will probably take me a few days to fine-tune; in the meantime, I will post some articles with a huge number of probably-misformatted chemical formulas. -- Beland (talk) 03:03, 1 April 2021 (UTC)Reply
Thanks, I would like to fix up misformatted formulas. But so far it has mostly only been reference titles that are misformatted. There are also chemical formula fragments that go up to a "(". The formula includes the brackets part too, so the fragment does not make a useful redirect. Graeme Bartlett (talk) 03:37, 1 April 2021 (UTC)Reply
  • Thanks Beland for working on Wikipedia:Typo Team/moss#Known chemical formulas that don't use subscripts. I guess you are onto the issues already - pattern strings showing up in image names, urls, or InChI strings. But it has also highlighted completely non-standard use of <chem> or <ce> html markup. I can fix these, but where <math> is used, it will be for more complex cases (not easily fixed). I have fixed some things in references. But I reckon there will be a lot of unformatted CO2 and H2O around. Graeme Bartlett (talk) 08:20, 29 October 2021 (UTC)Reply
    • @Beland: I have now been through Wikipedia:Typo Team/moss#Known chemical formulas that don't use subscripts and added subscripts for the chemicals. There are still a lot of other things that aren't chemical formulas though. It would be good to get a new list, particularly for CO2, and secondarily for H2O. Some I enclosed with proper name template -- will that keep them off this list? Graeme Bartlett (talk) 10:25, 4 December 2021 (UTC)Reply
      • @Graeme Bartlett: Oh, excellent! I had limited the listings to 25 articles per entry because I thought that cleanup would go relatively slowly. I just posted all the remaining articles for the above-25 entries, so you can tackle as much or as little of that as you care to. The next snapshot will be taken on Dec 20, so I'll try and remember to update this again after that's processed, and feel free to ping me if I forget. I'll take a look at the leftovers now to see what should be done about them. And yes, adding pretty much any template should suppress instances from being reported on this list. -- Beland (talk) 23:58, 4 December 2021 (UTC)Reply

Large group treatment edit

There has been some discussion on the use of {{notatypo}} and methods to remove some spellings from future pickup. Would be interested in hearing opinion on treating exceptions in larger groups such as 'CO2'? Tag them with 'notatypo' (and obfuscate) or let them ride (on the merry-go-round)? Exceptions can include band name, external titles and such, and the numbers will be increasing. Perhaps use of {{text}} or {{proper name}} when it's other than in a title or similar? Neils51 (talk) 01:52, 3 February 2022 (UTC)Reply

Well "CO2" is a mistake, even if it is not a typo, but due to ineptness or negligence. I asked for more CO2 to appear in the list, but then got distracted and have not correct many more. CO2 is almost always corrected by CO2, but not always as sometimes it is not carbon dioxide. Using AWB it is possible to do some mass editing as you suggest, but checking by a person is required. Graeme Bartlett (talk) 10:56, 3 February 2022 (UTC)Reply
Thanks for responding Graeme Bartlett, however, if a ref/cite title uses 'CO2', albeit a mistake, I am not going to correct it. Thought that I might seek agreement as to how to treat these so they don't appear in future lists, though I'm happy to do my own thing. Can get through these quite quickly with AWB. Neils51 (talk) 13:18, 3 February 2022 (UTC)Reply
I suppose you should just "let them ride". We are volunteers and do not have to do something. I have noticed that quite a few typos that appear on the list should have been fixed the previous time around, so we are not actually correcting everything, but as long as there is improvement, its going in the right direction. Graeme Bartlett (talk) 21:16, 3 February 2022 (UTC)Reply

Do we still have a backlog? edit

The quick link takes you to X, where it seems like most of the typos that were actually typos have already been covered. But maybe I'm missing a backlog somewhere else. If that's the case, I'm willing to help out with it. It's been awhile since I've helped out here and I miss it, I think it's a really cool concept. Clovermoss (talk) 16:54, 17 May 2022 (UTC)Reply

Most of these are not yet fixed. Some have added comments, but in reality definitions need to be added to Wiktionary, species added to wikispecies, proper names marked, non-standard things made standard etc. Graeme Bartlett (talk) 11:40, 18 May 2022 (UTC)Reply
@Graeme Bartlett: Thanks for getting back to me so quickly. I guess my quick glance wasn't enough. I haven't really done much beyond fixing typos when I was involved with the project before. How simple is moving definitions to Wiktionary? Do articles with these words actually contain definitions a lot of the time or is there something else you have to do? How would you mark proper names? Is there a list somewhere that prevents from being re-added as typos? Or am I misunderstanding what you meant by that? Clovermoss (talk) 15:12, 18 May 2022 (UTC)Reply
The instructions are at the top of Wikipedia:Typo Team/moss. If the words are not dealt with, then they should be readded on the list. Otherwise no one will sort them out. The main exception is the species that you can get away with adding to a wanted list. (I think perhaps we need something like this for chemicals, as unlimited chemical names can be constructed. Chemical formulas and maths formulas just need a smarter detection algorithm.) You can also write articles or create redirects. Don't expect to find definitions for Wiktionary in the articles. Take a look at some Wiktionary entries to get an idea of the format. Graeme Bartlett (talk) 21:57, 18 May 2022 (UTC)Reply
@Clovermoss: Oh yes, there is still a huge backlog of English misspellings to fix and other tasks to take care of. I just noticed the "X" listings were getting a bit stale, so I dealt with the remaining few actual misspellings and dumped the rest (which need tagging or investigation) into the "Case notes" section on the X subpage, which is the usual practice. A lot of folks seem to have more fun just fixing actual English misspellings and leave the non-English cleanup and linguistic research to others. Since there's always plenty of easy misspellings ready and waiting, if you're one of those people and ever want fresh listings, feel free to ping me. Recently I've been distracted not only by real life but also by maintaining other moss reports, and forgot to check the main listings. I just posted fresh typos at Wikipedia:Typo Team/moss/Y, so hopefully with your help and that of many others, we'll soon finish our second pass through the alphabet. If you're also still interested in adding Wiktionary entries (so many piled up wanting to go over!), I made Wikipedia:Typo Team/Wiktionary cheat sheet for myself, and you may find that speeds the process. -- Beland (talk) 07:25, 3 June 2022 (UTC)Reply

Nothing wrong with "№" edit

There’s nothing wrong with using № in place of "No." in table headings. Why is № being targeted? Jeff in CA (talk) 07:13, 29 May 2022 (UTC)Reply

@Jeff in CA: This symbol is contrary to the manual of style, at MOS:NUMERO. -- John of Reading (talk) 07:52, 29 May 2022 (UTC)Reply

Fresh listings edit

@Jake The Great 908, Puddleglum2.0, Schazjmd, Bradleyagin, Darylgolden, MarkZusab, Amiodarone, Zojomars, Anarhistička Maca, Clovermoss, JaAlDo, Creativecreatr, Voidify, Doghouse09, Spazure, Idell, Fehufanga, Triethylborane, Littleb2009, Normal Name, Amazomagisto, TreeReader, and Alivemussel:

This is your official notice that I've just refreshed most of the sections on the main page, and posted fresh typos to Wikipedia:Typo Team/moss/Y. We're almost done with our second pass through the alphabetical listing of all articles with typos! This update has combined several sections all looking at the most frequent words which are missing from the dictionary. As we've fixed over a decade worth of existing typos, it's becoming increasingly rare to encounter actual English misspellings that appear many times in Wikipedia, and increasingly hard to automatically distinguish English from non-English words, so I stopped trying to do that and just made one report for the highest-frequency typos.

The backlog of words to move to Wiktionary was getting very long, so I moved that to its own subpage. Folks working on that queue should feel free to rearrange it if it suits them, or let me know if there's anything that I could improve that would speed things along.

Thanks to Jonesey95, we also have a new report, Wikipedia:Typo Team/moss/not English, which finds long passages of non-English text (and other non-English garbage sometimes), if you're interested in helping tag or clean up or translate those articles.

For those working on the chemistry lists (I see Graeme Bartlett charging through a giant pile most recently!) since those had a lot of active edits recently and the new results look very similar to the old ones, I decided it might be easier for everyone not to update those lists quite yet. But if you're encountering stale listings, let me know and I'll be happy to refresh those too.

In fact, anyone should feel free to ping me if updating anything would provide you with a more satisfying experience in your favorite work queue, or if you spot any potential for improvement or have any questions. -- Beland (talk) 08:14, 3 June 2022 (UTC)Reply

Just Curious: Why do so many Indian pages appear on TS+Dot lists? edit

(Moved from Wikipedia talk:Typo Team/moss/K.)

It seems like a lot. Elfabet (talk) 19:13, 8 March 2019 (UTC)Reply

@Elfabet: Oh, hey, I was just centralizing talk page discussions so I wouldn't miss important questions and comments, and I came across this from a while ago.
That's an interesting question! I can't give you an entirely scientific answer, but based on some casual observations of demographics and typo patterns, my guess would be a combination of factors:
  • India is a large, heavily populated country, so there is a lot to write about.
  • There are hundreds of millions of people in India who speak English, so there is good coverage of India-related topics compared to countries with very few English speakers.
  • It might be that lots of words describing India-related topics begin with the letter "K", because those words are borrowed or transliterated from languages that use that as an initial sound a lot.
  • Punctuation-related errors sometimes occur because of hard-to-read wiki syntax or just because someone who knows the standard English punctuation rules has mistyped. But statistically, I do see punctuation errors disproportionately in articles where there is a reasonably long passage of at least somewhat grammatically incorrect English. I suspect this happens when someone who is not familiar with the rules of standard written English contributes new sentences and paragraphs. This happens to the degree that I actually require at least one punctuation error before listing an article on the Wikipedia:WikiProject Guild of Copy Editors/Database Report (which results in it getting a top-to-bottom copyedit for grammar and punctuation and not just spelling).
  • Hundreds of millions of people in India speak English as a second or third language, with varying levels of proficiency for the spoken and written forms. Interestingly, and unlike vocabulary and grammar, spelling and knowledge of punctuation rules are applicable only to the written form and not the spoken form.
  • Non-native speakers who have not yet become proficient at a new language tend to commit errors in systemic ways because of the language(s) they learned first. For example, if your native language(s) does not have plurals or subject-verb agreement, you will tend to make mistakes in those aspects of English, and probably master those aspects last. Similarly, native English speakers who are learning Mandarin tend to have a lot of trouble with tones because English is not a tonal language. Out of curiosity, I checked around and though it looks like sometimes Hindi is written with periods, it is often not, even in formal settings like newspapers. It would then make sense that text written by native Hindi speakers from India would have more period-related errors (which is what the TS+DOT section is reporting) than native Brazilian Portuguese speakers, where the punctuation rules are mostly the same as English. There are many other widely spoken native languages in India which may have similar differences, though I haven't bothered to catalog them.
-- Beland (talk) 02:04, 4 June 2022 (UTC)Reply

Just add Water edit

I see that for H2O there are many items (search=167) that relate to H2O: Just Add Water. Article Loreto Kirribilli is one example that Graeme Bartlett added subscripting to the display portion (single entry) in December, so the current report pickup is on the link. This could be 'fixed' with the following construct [[{{as written|H|2O}}: Just Add Water|H<sub>2</sub>O: Just Add Water]] to give H2O: Just Add Water however perhaps the better approach would be an exclusion? Neils51 (talk) 22:38, 16 June 2022 (UTC)Reply

Based on the article title, the version with the subscript seems to be more correct. -- Beland (talk) 23:04, 17 June 2022 (UTC)Reply
Umm..Beland, OK, let me re-phrase this. Why is Loreto Kirribilli in the current list and what would you do to fix it? Neils51 (talk) 00:01, 18 June 2022 (UTC)Reply
@Neils51: Ah, I see what you mean now! Sorry, I didn't read that as carefully as I needed to, perhaps because I hadn't eaten breakfast. After a hearty meal, I'm now remembering the algorithm for this report actually doesn't even look at link targets, only the display text. So Loreto Kirribilli is already fixed; it doesn't need {{as written}}. The only reason it's still on the list is that it wasn't removed when it was fixed, and I haven't updated the list since the 2021-11-01 dump. Feel free to delete any articles for which the only apparent problem is the link target. I was holding off on updating the list because it's actively being edited, but since it's started to cause confusion, why don't I go ahead and do that with the next dump. 2022-06-20 is coming soon, and it's a fast one so I should be able to sneak it in relatively easily without disrupting the cleanup process. -- Beland (talk) 00:50, 18 June 2022 (UTC)Reply
Thanks @Beland: that makes it clear. Based on what you have stated here I would like to update your "Instructions to Editors" to suggest strikethru prior to edits and delete once edits completed and the reasoning. When you say 'list', I'm assuming that means the published project page. Need to ask as sometimes assumptions......Neils51 (talk) 01:23, 18 June 2022 (UTC)Reply
Yes, I mean the lists on the project page. Mmm, well it seems like a bit of unnecessary work to do both strikethru and delete. As long as one of them happens, it will prevent other editors from attempting to fix something that is already fixed. The instructions recommend strikethru for sections that get updated all at once, which is what happens for the chemistry formula report, just so I can make sure to manually delete items from the new report that got fixed in the day or two between the dump snapshot and when the report is ready. Does that make sense? -- Beland (talk) 02:31, 18 June 2022 (UTC)Reply
I also may have fixed many pages that appear on other lists. Also I have found when I strike some off, then other editors might remove my striking. It does result in more checking than required! Graeme Bartlett (talk) 02:59, 18 June 2022 (UTC)Reply
Yeah, having multiple lists as we do makes unfreshness particularly problematic when they collide. I hope I haven't been the one accidentally undoing your strikethroughs? Sometimes I delete struck-through text when I'm preparing for an update but the snapshot hasn't happened yet, to reduce the amount of cross-checking I have to do later. If strikethrough really isn't working out, we could just say, always delete, and I will try to use page history diffs instead of strikethroughs to prevent duplicate work across updates. I don't want to confuse people who have gotten used to a particular way of doing things, but I'm open to whatever people find the most efficient and pain-free. -- Beland (talk) 08:24, 18 June 2022 (UTC)Reply
My 'concern' was that an item dealt with in December is something I have revisited recently as still in the list. I was suggesting that strikethrough and later delete would prevent that (reappearance) from happening, based on your response. I am not sure how that would work for you however what can we do that will prevent, or assist you to prevent, the ongoing presence, or return, of actioned items? Perhaps I have a misunderstanding as to how you generate the lists? On the project page you say that you run fresh scans against a recent dump to produce new lists. That implies that any item previously dealt with, no longer meeting selection criteria, should disappear, particularly when 6 months or so have elapsed. The 'implication' seems at odds with manual manipulation of items that are flagged as processed on the project page? The 'logic' doesn't make sense to me so what am I missing? The purpose here is to have you provide the best information and the editors using it to do so in an efficient manner. I know that writing a book is not always the best way to go so happy to be involved in a Zoom (or similar) call. Neils51 (talk) 09:50, 18 June 2022 (UTC)Reply
Now reading back through the comments I am getting the notion that you are adding to the lists and probably manually deleting any that have been struck so not a list refresh as such, just an append, as your original base was late last year. Do I have that correct? So some items have been actioned and either not been struck or if struck the strike was at some point removed? If this is the case then maybe the editor who has completed the item(s) should be responsible for their deletion from the list (at some point). Neils51 (talk) 10:11, 18 June 2022 (UTC)Reply
Yeah, in this case, the listings were based on a dump that was six months old. I just did a complete refresh from the 2022-06-20 dump, so staleness shouldn't be a problem anymore. The only manual inspection I really need to do is to prevent items from re-appearing if they have been completed between the time the dump is snapshotted and the time I post the update. For first-of-the-month dumps, that's typically a couple weeks, but for 20th-of-the-month dumps, that's typically a couple of days. I think the easiest thing to do going forward might be to just say "delete things as you finish them to avoid duplicate work" (unless you need to leave a note documenting a weird case) and I will take care of the rest (and try to use 20th-of-the-month dumps to make it easy on myself). -- Beland (talk) 18:58, 22 June 2022 (UTC)Reply

"convert special characters found by Wikipedia:Typo Team/moss" edit

Can anyone please explain the policy behind converting "special characters" to templates? i.e. https://en.wikipedia.org/w/index.php?title=DR_Class_130_family&diff=next&oldid=1095852548 In particular the policy basis for this.

AIUI, for some time the push has been to move "special markup" (i.e. HTML entity references) to unicode characters, i.e. &deg; to ° Is that correct, or has that now changed?

If so, why are other unicode characters like ′ now being replaced with a template?

In particular (@Beland:), why is this being done within links, where it obviously breaks the link, and without any form of checking afterwards? We've been here for years, we already knew that bulk operations by simple regex are a really bad idea, just because they cause this havoc to wikicode. Andy Dingley (talk) 15:00, 12 July 2022 (UTC)Reply

@Andy Dingley: Whoops, my bad. Changing to an equivalent Unicode character or HTML entity works in links, and changing to an equivalent template generally works in the display text part of links, but as it turns out not in the target part of links. I check all these changes manually, and I forgot the distinction between display and target part when deciding not to preview this particular page. Thanks for catching that, and I'll be more careful to double-check links in the future.
Yes, in general, consensus favors converting HTML entities to Unicode characters for ease of use, and that's what I've been doing. There are some exceptions, and in the case of characters that can be visually confused for one another, many editors prefer to have the name of the character in the markup (especially when <math>...</math> markup is used on the same page). My general assumption is that if a template exists, it's preferred over the equivalent HTML entity, because templates can have documentation attached, most editors don't know how to use HTML entities, and wikitext is generally preferred over HTML. I've assembled a partial list of visually confusing characters here, along with some notes for myself on how to sort them out: Wikipedia:Manual of Style/Character Table 2.
The prime character in particular is often misused, either as an apostrophe, single quote mark, ʻokina, or other similar character. Legitimate uses often occur right next to an italic letter, and because wikitext uses apostrophes to italicize, the markup gets hard to read with a raw character. As I've been going through and fixing misuses, I've been converting the raw character to {{prime}} both for clarity and to indicate that this instance has been checked for correctness. Since I can't do that for this article, I've just added an HTML comment to clarify. -- Beland (talk) 19:04, 12 July 2022 (UTC)Reply

Manual attention needed edit

  • 2 - 99th United States Congress - wikt:you, wikt:you. It looks like these two were caught because of another typo, which I corrected. They clearly address the readers, a no-no, but I left this entry as a reminder for now because they're part of a boilerplate paragraph used in articles covering the 38th through 111th Congresses, all of which need the same rewriting. (The other articles starting with '9' didn't have the other typo, and I suspect that's why they didn't show up on this list.) A template paragraph might be better - and easier. Ira Leviton (talk) 17:01, 30 September 2019 (UTC)Reply
User:Ira Leviton, I honestly don't think the paragraph needs to be there at all... it just explains how to click on a link and how to read the document it opens to which I think we can assume denizens of the internet to be capable of without instruction. All of the relevant information on the document is summarised in the article anyway, which is also true of the handful of other random articles I checked within the range you listed. Xurizuri (talk) 10:35, 27 December 2020 (UTC)Reply
@Xurizuri:This is still on my ever-expanding list of things to do, but I had forgotten about this it. I think that having the link is useful; it's much tougher to find this information without it. But the paragraph can be rewritten as "Complete lists of members and staff for all House and Senate standing, select, and special committees and subcommittees appear in the annual congressional directory listed at the bottom of the page in the external links section. I'm willing to insert this on all of these pages. (But I'll wait to hear from you, or if you can improve my wording.)
Ira
Ira Leviton (talk) 14:54, 28 December 2020 (UTC)Reply
That sounds solid. I'm not very across MOS for external links so I honestly don't know if there's any issues with that plan. I guess an alternative to directly mentioning the dictionary in the text is to have "Complete lists of members and staff for all House and Senate standing, select, and special committees and subcommittees as they appear in the annual congressional directory" and cite that to the ACD itself. Either way, seems like an improvement to just be getting that 2nd person language out of there. Xurizuri (talk) 15:27, 28 December 2020 (UTC)Reply
I have created and implemented a template along these lines at Template:List of Congressional Committees instructions. Wording tweaks are, of course, welcome. BD2412 T 02:20, 9 July 2022 (UTC) Sct72 (talk) 01:51, 13 July 2022 (UTC)Reply

Dashes edit

I wouldn't normally be this pedantic but since this is a typo team project... The project pages consistently use hyphen incorrectly where a dash is required, for example in all the dump lists. Example: Wikipedia:Typo Team/moss/I#Case notes. All those items that start with "1 -" should start with "1 –". GA-RT-22 (talk) 15:49, 30 July 2022 (UTC)Reply

@GA-RT-22: Hmm, the dashes are actually used in the Python and shell scripts a lot to split apart lines and whatnot. It would be difficult to use non-ASCII characters because they are not on my keyboard, it would be a big change that could break some things, and it's difficult to tell the difference between the different kinds. I think I'll leave things as they are. -- Beland (talk) 18:35, 12 August 2022 (UTC)Reply

Time for more edit

Wikipedia:Typo Team/moss/before A is almost completely finished, just case notes really now. Please @Beland: can you post the next batch? Graeme Bartlett (talk) 12:19, 12 August 2022 (UTC)Reply

@Graeme Bartlett: Thanks for the ping! "A" is now posted!
Reading through the recently resolved case notes, I see a lot of the "probably OK" words are actually real typos. It looks like these are coming from the moss "ME" class which are "coMpound English" words. I think we made a lot of progress adding legitimate compounds to Wiktionary, and a lot of what's left are instances where there's a missing hyphen or space, or a misspelling that can happens to look like two unrelated words smashed together. So, I'll be posting those after the next run, which should be for "B". Thanks to everyone who has been conscientiously putting those in case notes! -- Beland (talk) 00:59, 13 August 2022 (UTC)Reply

Trivial, not typo edit

This edit by Beland appears to be trivial, and so to be avoided. It is not a typo. Note that character-by-entity entrance is useful to achieve script support; converting to unsupported script characters is not helpful in this. Is there more background involved? Does it relate to "This page contajns script XYZ character"-categorisation & reader help? DePiep (talk) 07:49, 16 October 2022 (UTC)Reply

@DePiep: Greetings! In the spirit of MOS:MARKUP (keep markup simple) and in my past experience, there seems to be general support for replacing numeric HTML entities with the equivalent Unicode characters, or named HTML entities, or templates. There are a few exceptions, such as private use characters, where it's clearly necessary to keep the numeric entity. These changes are not trivial in the sense of WP:COSMETICBOT; they are intended to make it easier for downstream consumers to parse wikitext. For example, simplifying the representation of special characters makes it easier for search engines to find all the relevant pages to a query that contains special characters, and in my case, for my spelling, grammar, and style checker to validate content. Keeping markup simple is also supposed to make life easier for editors, since it makes the wikitext more WYSIWYG. I'm not exactly sure what you mean by achieving "script support"? If you're saying it's easier for you to input wikitext using HTML entities, you're still welcome to do that. I'm hoping the conversion makes it easier for you to read and edit the wikitext subsequently, but if that's not the case, we can discuss and make alternate arrangements. -- Beland (talk) 17:56, 17 October 2022 (UTC)Reply
Thanks, clear now. Basically, the "trivial" was the question (i.e., not an issue). My background is Unicode, so I am more interested in the numerical code :-). Long term, we want to improve script support (support rare scripts). But anyway, I am also working on character-analysis tool (return character properties), so the edits we are talking about are not a hinder. DePiep (talk) 20:24, 25 October 2022 (UTC)Reply

Needless italics around quoted speech edit

This is a problem I see very often around Wikipedia, e.g.

The author later denied the claims: "I never wrote such a thing."

Of course the italics are not necessary (and in fact merely confusing) when we already have quotation marks. Could the fixing of this very common problem be automated, or included here, too? Equinox 12:36, 29 October 2022 (UTC)Reply

B finishing up edit

@Beland: Wikipedia:Typo Team/moss/B is very closed to finished. So it could be time to break out the "C" typos. Graeme Bartlett (talk) 06:37, 20 February 2023 (UTC)Reply

@Graeme Bartlett: Thanks for the note! Wikipedia:Typo Team/moss/C is posted. -- Beland (talk) 03:05, 22 February 2023 (UTC)Reply

These pages have become too big to work with conveniently. edit

Wikipedia:Typo Team/moss/C started at well over 1 million bytes, which makes it slow to load, slow to edit, and hard to find things. Can we start splitting these up into subpages, either by sectioning issues by type, or by first letter combinations? Also, can we perhaps centrally archive the previous case notes somewhere? BD2412 T 03:35, 12 April 2023 (UTC)Reply

I notice it takes quite a few seconds to edit, save, or load these pages. I thought of taking out a section to work on, and merge the updates later, but that risks duplicate effort. But please keep things simple, otherwise we will deter helpers! So updating the one page is simpler. But we also need people to action the case notes. Old case notes in a separate page sounds OK, as they gradually fill up the page. Also splitting sounds good. Graeme Bartlett (talk) 22:25, 12 April 2023 (UTC)Reply
I think the easiest thing would be to split into smaller letter ranges, e.g. Ca-Cd, Ce-Ck, Cl-Co, Cp-Cz (or Că, if we're counting special characters). If we can generate a page by first letter, we can generate pages by first letter ranges. Still split out the old case notes. BD2412 T 22:59, 12 April 2023 (UTC)Reply
I have created Wikipedia:Typo Team/moss/Old case notes, and moved case notes from B to E there so far, with appropriate redirects added to sections of the original pages. BD2412 T 18:43, 30 April 2023 (UTC)Reply

In-text external links edit

Maybe this is something we could add to the database dumps if there's a way of automatically determining if an external link is outside the external links section? Clovermoss🍀 (talk) 21:25, 26 April 2023 (UTC)Reply

We should revise the Quickstart and Instructions for Editors sections edit

The Instructions for Editors section on the individual list pages and, to lesser extent, the Quickstart section on the main MOSS page need a revision. Some things in these sections are phrased unclearly (like the matter of what constitutes a proper name). The formatting and phrasing of the Instructions page need a copyedit anyway (e.g. the extra bullet point before "For DNA sequences" and the misplacement of the word "titles" in the line about proper names). And there are some matters that aren't listed here, like an example edit summary for adding a proper name tag (which would be the same as for the not a typo tag, but it's annoying to have to edit it after pasting every time). The main issue that led me to write this talk page section, is the matter of plurals of the subject of a page. As stated here by Beland, if e.g. "fexprs" is used on a page "fexpr", a redirect should be created. I've come across this situation multiple times and only now have figured out what to do with it. It'd be a good idea to include this in the Quickstart and Instructions sections. 110521sgl (talk) 09:13, 30 April 2023 (UTC)Reply