Wikipedia talk:AutoWikiBrowser/Typos

Latest comment: 8 days ago by StefenTower in topic Typos restructuring

Incorrectly changing en dashes to hyphens in violation of MOS:SUFFIXDASH edit

I was recently doing page clean-up with AWB, and AWB suggested changing "Academy Award–winning" (with an en dash) to "Academy Award-winning" (with a hyphen) on the article "Society of the Snow" (see this edit[1]). User Nardog noticed my edit and reverted it[2] citing MOS:SUFFIXDASH. I started a discussion about this (see: Wikipedia talk:Manual of Style#Clarification on the use of a hyphen or an en dash for "Academy Award winning"), and it appears all users who commented are in agreement that en dashes should be used in these situations and not hyphens. Thanks! Wikipedialuva (talk) 09:45, 27 January 2024 (UTC)Reply

When using the typo fixes, it's important to not save it unless you know it is correct. The typo list should be thought of as suggestions that are usually correct but cannot foresee all scenarios. That said, I've probably saved this with a hyphen forgetting the guideline. At any rate, this may or may not be fixable in the typo rules because not all of these scenarios can be covered programmatically with Regular Expressions. My hunch is we can make it pass by more of these false positives than before. Stefen Towers among the rest! GabGruntwerk 09:55, 27 January 2024 (UTC)Reply
@StefenTower: Thanks for looking into this. I want to acknowledge that I understand that the user is ultimately responsible for the edits made with AWB and that this erroneous edit was my fault for saving the edit. As was noted on MOS talk, violations of the MOS:SUFFIXDASH rule in regards to awards are extremely widespread on the project (extending into article titles, which I have been working on fixing). I also understand that it will be impossible to try to come up with every possible variation of this rule. I don't know if it would be appropriate or even possible to try to make a rule for some of the most common awards that cause the error or not. Thanks again for responding and looking into the issue! Wikipedialuva (talk) 10:24, 27 January 2024 (UTC)Reply
I'm glad there is understanding of the use of AWB Typos and how it works but to the crux of the matter, perhaps if we didn't place the hyphen when "Award" is capitalized and preceded by another capitalized word, it would cut out most false positives. That should be a simple fix. Stefen Towers among the rest! GabGruntwerk 20:58, 27 January 2024 (UTC)Reply
@Wikipedialuva   Done The simple fix I described is now complete. Stefen Towers among the rest! GabGruntwerk 00:47, 28 January 2024 (UTC)Reply
@StefenTower: Fix works great! Thanks for you help!   Wikipedialuva (talk) 07:56, 28 January 2024 (UTC)Reply

Backspace in regex name edit

Greetings! I'm recycling AWB regexes in a script that can efficiently scan database dumps on Linux. The use of backspaces in word="you'(d\ve\re\ll)_" creates some weird escaping issues. Any objection to rename it "you'(d/ve/re/ll)_"? That seems more like standard slash usage to me anyway. Thanks, Beland (talk) 01:05, 29 January 2024 (UTC)Reply

@Beland: That shouldn't cause any issues. I believe AWB itself doesn't use the names for anything. -- John of Reading (talk) 07:42, 29 January 2024 (UTC)Reply
Beland   Done As John of Reading said, changing this won't impact AWB typo correction. I wish the rule name did display in the log, but alas it does not (I need to request that :) ). Stefen Towers among the rest! GabGruntwerk 08:43, 29 January 2024 (UTC)Reply
@John of Reading and StefenTower: Thanks for the quick fix! -- Beland (talk) 20:17, 29 January 2024 (UTC)Reply

Tie scores being treated as false positives edit

1-1, 2-2, etc. scores are being overlooked for ndash replacement and with all the sports articles I work on, this is getting a little too frustrating. Tom.Reding - I found out you had made a change to make this a false positive in March 2020. Is there any way this can be made to bypass fewer of these (i.e. look for additional text to match)? A *lot* of tie scores aren't getting a correction (unless I catch it and perform the correction manually). Stefen Towers among the rest! GabGruntwerk 23:02, 3 February 2024 (UTC)Reply

@StefenTower: the relevant part of rule "0–0", (?<!\b\1[-—]\1\b), was moved the next day into rule "2–1", precisely so that "0–0" can find draws and ties. If "0–0" isn't finding ties now, it's not because of that lookbehind; it's because "0–0" needs to be expanded with more relevant keywords.   ~ Tom.Reding (talkdgaf)  13:33, 4 February 2024 (UTC)Reply
Tom.Reding Thanks for the reply. (?<!\b\1[-—]\1\b) is still in "2–1", and stopping corrections of draws/ties that don't meet "0–0". "0–0" currently doesn't cover a lot of scenarios I'm seeing in my typo correction work, but I don't know why the draws/ties need to be avoided in "2–1" 's general case in the first place. Is there any harm from removing that code? That's what I'm driving at. Stefen Towers among the rest! GabGruntwerk 17:37, 4 February 2024 (UTC)Reply
@StefenTower: the reason that lookbehind exists is because the rule was incorrectly catching journal volume numbers, e.g. "Some Journal 5-5", preceded by "5-4" and succeeded by "5-6" etc., which should not be changed/en-dashed.   ~ Tom.Reding (talkdgaf)  17:46, 4 February 2024 (UTC)Reply
I can see why we don't want a journal volume number en-dashed, but I don't quite understand why volume numbers that look like draws/ties are more problematic than those where the numbers aren't equal (e.g. "5-4"). If the "5-4" is in the middle of the series instead of "5-5", wouldn't it be falsely corrected? Stefen Towers among the rest! GabGruntwerk 17:58, 4 February 2024 (UTC)Reply
Tom.Reding Is it that a draw/tie is guaranteed to not be a number range if referring to a volume? If that's the case, I understand it better now after thinking more about it. At any rate, I am working on more cases to convert the ties to use ndash outside of this false positive check. But I wish this check wasn't needed - we really miss a lot of genuine draws/ties as it stands. Stefen Towers among the rest! GabGruntwerk 00:01, 5 February 2024 (UTC)Reply
@StefenTower: do you have a list of pages where ties are known to have been missed?   ~ Tom.Reding (talkdgaf)  11:59, 5 February 2024 (UTC)Reply
@Tom.Reding I don't keep a list of these. I just keep seeing these missed, and I have to correct them manually. It's any draw/tie that the "2–1" rule would ordinarily catch but doesn't due to the false positive catch. If you run AWB typo checks in sports articles especially, it can get rather frustrating. Stefen Towers among the rest! GabGruntwerk 16:26, 5 February 2024 (UTC)Reply
Here is an example. While it shows "17-17" and "7-7" being corrected, that happened only because I saw in my AWB viewer they weren't corrected, and I manually placed the en dash there. Stefen Towers among the rest! GabGruntwerk 16:56, 5 February 2024 (UTC)Reply

False Positive edit

Vice-President gets corrected to Vice-president, should be Vice President or vice president (I think) DarmaniLink (talk) 18:13, 11 February 2024 (UTC)Reply

With respect to companies/organizations, I defer to them using a hyphen in the title per their choice. At the same time, titles like "Vice President of the United States" definitely don't have a hyphen. At any rate, Chris the speller created rules related to this, so maybe he has some thoughts here. In the meantime, feel free to skip any typo corrections you don't feel comfortable with, or do a manual edit if you so choose. Stefen Towers among the rest! GabGruntwerk 17:38, 12 February 2024 (UTC)Reply
Stefen is correct. Note that in Europe the hyphen is generally used, while in the US it is less often used, but this has to be handled case by case. Chris the speller yack 02:06, 13 February 2024 (UTC)Reply
Is it "Vice-president" or Vice-President" for british english/europe? DarmaniLink (talk) 02:09, 13 February 2024 (UTC)Reply
Most resources seem to suggest that for 'Vice President' both are capitalized when used as a title. Neils51 (talk) 23:54, 13 February 2024 (UTC)Reply

Kuty edit

"ua" was flagged as a typo of "uk" (sorry about the previous mistakes, im sleep deprived). I'm not sure if this was the typos or something else that flagged this though. I'm not seeing a regex that would have done that, just saw the attempted diff. DarmaniLink (talk) 02:10, 13 February 2024 (UTC)Reply

@DarmaniLink: That was a general fix, not a typo fix. There is an entry in Wikipedia:AutoWikiBrowser/Template redirects that tells the software to replace instances of the redirect {{lang-ua}} with {{lang-uk}}. -- John of Reading (talk) 07:59, 13 February 2024 (UTC)Reply
ah, i assumed lang-uk was british english, I really don't know why
should have checked that, sorry DarmaniLink (talk) 15:11, 13 February 2024 (UTC)Reply

Typos restructuring edit

CoolieCoolster As this is a complicated tool rather than an article, any major restructuring needs to be discussed. Please use this topic to explain what you think ought to be done here. Also, if changes are to be done, they need to be done more piecemeal, so editors can readily see what is moving where. A lot of difficult work has gone into building the list over time, and we need to be extra careful. Stefen Towers among the rest! GabGruntwerk 04:35, 12 April 2024 (UTC)Reply

Apologies, and noted. As it stands, the current structure of the list is highly disorganized, with many fixes lasting more than a year in a general section at the top of the page. This not only makes finding which issues have or haven't been addressed difficult, but also results in situations where the same issue may be covered twice via different means, such as km² via its own unicode to sup tag listing and one that also addresses m² and cm². I don't mean to remove recent additions from the top of the list, as I understand the importance of testing them extensively before they are integrated into the main lists, but I also think that it's important to sort articles into sections after the year time period has passed in order to maintain an effective organization scheme. In terms of organization, the Capitalisation, Grammar, and sections at the top do a pretty good job at dividing rules into groups (improving navigability, facilitating standardization, and making it easy to see what's not represented), ones with too many listings inevitably become bloated and should be divided into meaningful subsections. To illustrate my point, here's a restructured collection of the current sections, to be amended when the listings at the top are sorted:
   4 Typo list
       4.1 Recent additions
           4.1.1 Unsorted
           4.1.x Common subsections (TBD)
       4.2 Academia
           4.3.1 Academic titles
           4.3.2 Academic fields
           4.3.3 College degrees
       4.3 Capitalisation
           4.4.1 Brand names
               4.4.1.1 Colleges and universities
               4.4.1.2 Companies and organizations
               4.4.1.3 Products
               4.4.1.4 Technology
               4.4.1.5 Websites
               4.4.1.6 Unsorted
           4.4.2 Placenames (high-level)
               4.4.2.1 Continents and subcontinents
               4.4.2.2 Oceans
               4.4.2.3 Geographical proper names
           4.4.3 Placenames (low-level)
               4.4.3.1 Canada
               4.4.3.2 France
               4.4.3.3 United Kingdom
               4.4.3.4 United States (states)
               4.4.3.5 United States (cities)
           4.4.4 Time
               4.4.4.1 Calendrical proper nouns
               4.4.4.2 Holidays
               4.4.4.3 Epochs, ages and dynasties
           4.4.5 Society
               4.4.5.1 Cultures, languages, and ethnic groups
               4.4.5.2 Ethnicity & language
               4.4.5.3 Religious
           4.4.6 Unsorted
       4.4 Decapitalisation
           4.5.1 Medals
           4.5.2 Miscellaneous
       4.5 Mispellings
           4.5.1 A
           4.5.2 B
           4.5.3 C
           4.5.4 D
           4.5.5 E
           4.5.6 F
           4.5.7 G
           4.5.8 H
           4.5.9 I
           4.5.10 J
           4.5.11 K
           4.5.12 L
           4.5.13 M
           4.5.14 N
           4.5.15 O
           4.5.16 P
           4.5.17 Q
           4.5.18 R
           4.5.19 S
           4.5.20 T
           4.5.21 U
           4.5.22 V
           4.5.23 W
           4.5.24 X
           4.5.25 Y
           4.5.26 Z
       4.6 Accents and diacritics
           4.7.1 Proper nouns
       4.8 Formatting
           4.8.1 Calendar dates
           4.8.2 SI unit symbols
           4.8.3 Symbols and HTML entities
       4.9 Grammar
           4.9.1 Articles
           4.9.2 Contractions
           4.9.3 Replace space by hyphen
           4.9.4 Joined words
           4.9.5 Split words
           4.9.6 Duplicated words
           4.9.7 Redundant words
           4.9.8 Euphemisms
           4.9.9 Preposition usage
           4.9.10 Punctuation
           4.9.11 Remove hyphens after adverbs ending in -ly
           4.9.12 Remove other hyphens (replace with space)
       4.10 General rules
           4.10.1 Unsorted
           4.10.2 Beginnings
           4.10.3 Middles
           4.10.4 Endings
               4.10.4.1 A
               4.10.4.2 B
               4.10.4.3 C
               4.10.4.4 D
               4.10.4.5 E
               4.10.4.6 F
               4.10.4.7 G
               4.10.4.8 H
               4.10.4.9 I
               4.10.4.10 J–K
               4.10.4.11 L
               4.10.4.12 M
               4.10.4.13 N
               4.10.4.14 O
               4.10.4.15 P
               4.10.4.16 Q
               4.10.4.17 R
               4.10.4.18 S
               4.10.4.19 T
               4.10.4.20 U–V
               4.10.4.21 W
       4.11 Incorrect phrases
While it would make the TOC longer, I think it would make it much easier for people to find issues they'd like to address (using the RegEx replacement rules to search for articles that fulfill those criteria), make it easier for people to make new rules to fill in the gaps of existing ones (such as the 'cubed' rule I previously added on the basis of the 'squared' rule), and facilitate the adding of new rules (by categorizing similar rules together, it's easier to see what's missing). For instance, by grouping together university capitalization rules into their own subsection, people may think of additional entries to add to that specific section that they might not have otherwise in looking at a more generalized list.
Apologies again for making such a drastic edit without consultation; it just seems like the potential of RegEx typo fixing would be doubled if there were a greater deal of clarity and structure in the rule categorization scheme. CoolieCoolster (talk) 05:16, 12 April 2024 (UTC)Reply
CoolieCoolster I'm looking through this and it's quite overwhelming. It would really help if you made a list of "change x to y" proposals, and possibly separating those out so they can be discussed individually. It's really difficult to grasp the value of this restructuring as a whole. What we had wasn't not working, and I can't yet see that the potential would be doubled with the proposed changes. Stefen Towers among the rest! GabGruntwerk 21:26, 12 April 2024 (UTC)Reply
It's just my proposed solution; the overall problem is that while the page mentions sorting listings into sections after a year, it doesn't appear that that is occurring in an organized manner on a consistent basis, making it difficult to interpret what is or isn't present without using the browser's search function for individual words. I don't mean to be blunt, but given that an effective reorganization would involve moving hundreds of lines to new or existing sections, discussing the moving of any individual line would be missing the forest for its trees. While moving any one line has no inherent value on its own, and should only be done if it functions identically when sorted as it did initially, the value of the sorted list as a whole is the ability to see what functions are currently unaccounted for, particularly for common typos that one might not have considered otherwise.
My intention is to help, not harm the project, so until consensus on the matter is reached I'll stick to just organizing any replacement rules I add myself to subcategories of the existing New additions section. However, given that organization is already listed as being part of the list-making process, as long as new organization keeps rules above the General rules section to avoid rule interference, it seems a shame to forgo the benefits of a sorted list for the sake of avoiding change for change's sake. CoolieCoolster (talk) 21:56, 12 April 2024 (UTC)Reply
There is no "avoiding change for change's sake" going on here. I asked for a detailed explanation of the specific x-to-y changes. That's all. I already have read your overall contention about the structure. I'm not inclined to agree to such massive changes unless and until I (and hopefully others) can understand that they make sense. I am not against change or improvement. Stefen Towers among the rest! GabGruntwerk 22:01, 12 April 2024 (UTC)Reply
Need a project approach here. Firstly, is the proposal a good idea? Namely, create a redefined list and cleanup entries. In the interests of consensus, yes. Sure what's there works however it can be tricky to navigate and there is duplication. Next, how to approach it? It makes sense to me that a parallel (new) list is built on a subpage and material coped to it, removing the potential for harm of the active page. A list of editors willing to be involved to be obtained on that page and a work list created where editors can put their name against specific components and thus spread the load. A lot of the work may be sheer copy/paste, which if divvied up won’t be so onerous. I would suggest that any entries that in the interim are added or revised in the active list have a date stamp against them (that seems to be happening) and perhaps the editor’s guesstimate as to classification against the new list. There are other aspects that will need addressing (testing, etc.) however at this stage not much point in writing a book. Neils51 (talk) 01:33, 13 April 2024 (UTC)Reply
To make sure that there's enough people that both think list restructuring would be worthwhile and are willing to help with the restructure on a parallel page (enabling everyone involved to review it so that a consensus on a functional list can be established), I'll make a signup list below. Per my statements above, it won't involve making any modifications to the current list until consensus can be reached that the structure of the parallel list meets the needs of all users involved. I think having at least three people willing to work on the list would help in splitting the workload and ensuring that the structure is a product of consensus and not unilateral editing. CoolieCoolster (talk) 22:57, 15 April 2024 (UTC)Reply
1. CoolieCoolster (talk) 22:57, 15 April 2024 (UTC)Reply
2. Neils51 (talk) 09:03, 16 April 2024 (UTC)Reply
3.
I have doubts most folks who use the typo list are looking at the structure. They're just pulling the whole lot into AWB and using them for typo hunting. At any rate, I'd be happy to review what's done and offer feedback. My main concern is the typo rules themselves staying intact in the result and that the list is no less readable than it is now. I didn't put my name on the list because I don't entirely agree with the premises as stated and my time is eaten up with too many other things. Stefen Towers among the rest! GabGruntwerk 06:35, 18 April 2024 (UTC)Reply