User talk:Trappist the monk/Archive 18

Module:Smallem edit

Latest comment: 3 years ago155 comments2 people in discussion

Hello! :)

As you know, lately I've been trying to write a script capable of making Smallem to autoupdate. To make that process easier, it would be good to have some anchors on its module results to allow for easier regexing. For example, in the results generated here: {{#invoke:Smallem|lang_lister|list=y}}, it would be good to have the list of language codes start and end with some specific strings. Given your experience, what could be the best approach on this thing, considering a good graphical solution and the easiest way to be regexed?

The same could be said about the list of regexes generated. That list, as you know, comes with an added problem that sometimes it generates this warning: skipped: got: 𐌲𐌿𐍄𐌹𐍃𐌺; from: got.wiki, which theoretically could come accompanied by other similar warnings in the future. Those warnings pollute the list of commands and whatever anchors we choose for them should also take care of leaving the warnings out. - Klein Muçi (talk) 22:49, 4 March 2021 (UTC)Reply

Something human readable? So, for example, you might do something like begin_code_list ... end_code_list and begin_regex_list ... end_regex_list.

—Trappist the monk (talk) 23:31, 4 March 2021 (UTC)Reply

Maybe something only alphanumerical? Underlines may add an extra layer of difficulty in regex, no? - Klein Muçi (talk) 23:39, 4 March 2021 (UTC)Reply

Why would you think that? Underscores don't have special meaning in regex.

—Trappist the monk (talk) 23:56, 4 March 2021 (UTC)Reply

Yes but lately I'm scared of using symbols (or even plain whitespaces) because even when regex is not a problem, they start interfering with Linux commands and you end up Google-ing for ages for escape mechanisms. But as far as I know, underscores are not generally used in Linux commands either so... - Klein Muçi (talk) 23:59, 4 March 2021 (UTC)Reply

So don't include underscores. Or make the delimiter strings all uppercase (linux is case sensitive, typically lowercase). Or make the delimiter strings distinctly non-English.

—Trappist the monk (talk) 00:13, 5 March 2021 (UTC)Reply

Maybe a camel case in English BeginCodeList, etc. ? Even though that would make it a bit hard to read but it's not like we'd spend too much time reading it anyway. Albanian strings were the first that came into my mind but we have everything in English there and I wanted to preserve that. - Klein Muçi (talk) 00:24, 5 March 2021 (UTC)Reply

Ok, so do that.

—Trappist the monk (talk) 01:16, 5 March 2021 (UTC)Reply

As usual, can you do it? Without wanting to bother you. I'm reluctant to experiment with Lua now to find the correct space and all in regard to the generated results. - Klein Muçi (talk) 01:27, 5 March 2021 (UTC)Reply

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 12:34, 5 March 2021 (UTC)Reply

Thanks a lot! Can (should?) the delimiters be without a newline separation? I believe it would make for an easier data extraction at bash, even though I'm not sure how that would look like with the regex list. - Klein Muçi (talk) 12:49, 5 March 2021 (UTC)Reply

Here's what I get as data at the moment:

CodeListBegin</p><div about="#mwt1">aa, ab, ace, ady, ... zh-tw, zh-yue, zu</div><p about="#mwt1">CodeListEnd

This is coming from me getting the HTML file of this page and grepping between CodeListBegin and CodeListEnd. We want only the codes. I can, of course, not include the endpoints but the problem remains with those command lines in-between. Can they be "fixed" somehow? Or are they necessary part of the module? Sadly, that kind of defeats the purpose of having those strings as delimiters. - Klein Muçi (talk) 13:27, 5 March 2021 (UTC)Reply

Really? When I look at the source for sq:Përdoruesi:Trappist the monk/Livadhi personal, I see:

CodeListBegin</p><div>aa, ab, ace, ady, ... zh-tw, zh-yue, zu</div><p>CodeListEnd

How is it that smallem works at all with the unordered list markup that formats the regex list? For the same page I see this:

<div class="div-col columns column-width" style="column-width:30em">
<ul><li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq-latn\2"),</li>
<li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq\2"),</li>
...
<li>(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)힌디어(\s*[\|\}])", r"\1hi\2"),</li></ul>
</div>

I wrapped the code list with <div>...</div> tags so that it would be somewhat similar to the regex list which has always been wrapped with <div>...</div> tags and which (apparently) smallem has been able to read and use.

We could, I suppose, change sq:Moduli:Smallem to accept a parameter, perhaps |plain=yes that would instruct the module to render the lists without the 'prettifying' markup and whitespace that make the lists human-readable. But, the keywords would change to include a colon just because humans will read the lists in their machine-readable forms:

CodeListBegin: ... :CodeListEnd

RegexListBegin: ... :RegexListEnd

producing:

CodeListBegin:aa,ab,ace,ady,...zh-tw,zh-yue,zu:CodeListEnd

and:

RegexListBegin:(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq-latn\2"),(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Abasinisch(\s*[\|\}])", r"\1abq\2"),...(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)힌디어(\s*[\|\}])", r"\1hi\2"),:RegexListEnd

—Trappist the monk (talk) 14:23, 5 March 2021 (UTC)Reply

Nice catch! That's because it doesn't. Until now the update of its source code has been done manually by me. I copy the regex lines by simply copy-pasting them from whatever wiki page they're generated after being rendered. Now I'm trying to make the process automatic. To do that, I'm utilizing curl in getting data from a wiki page (Smallem's personal sandbox) and according to the instructions mentioned here, getting the HTML format was the best solution. The only other one being the source one, which brought up way more extra data than the HTML option. I wish there was a way to get only the plain text we see rendered on wiki pages, without any commands but... The first step was to get the code list and utilize them in batches of 50 items to get the regex lists. Then I'd have to work on getting the regex lists and concatenate them together. Of course, both steps need to be done in a row in a cycle until the code list is exhausted. But given that I was dealing with cleaning up the code list data now, I made up my mind to worry later about the regex lists. So what you just noticed was a problem that it was on the to do list to be fixed. If I succeed in making the process automatic, we can afford to lose some esthetic elements (so far, the results look promising). So I guess you can go on with your suggestion? Even though I'd like to somehow retain the plain list (without bulletpoints) format of the regexes considering their sheer size. But I'm forced to accept whatever I can get. I like the idea of having a switch in our hands though to, well, switch, from machine to human readable form.

Everything would be perfect like it currently is if I could just get the plain text somehow but I know of no way in doing that. :( - Klein Muçi (talk) 15:22, 5 March 2021 (UTC)Reply

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 16:17, 5 March 2021 (UTC)Reply

Works very good except for... Can we add back 1 single whitespace after commas? That would be the perfect formatting bash wise I believe. - Klein Muçi (talk) 20:48, 5 March 2021 (UTC)Reply

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 22:55, 5 March 2021 (UTC)Reply

Thank you kind sir! - Klein Muçi (talk) 00:07, 6 March 2021 (UTC)Reply

I was able to utilize the regexes and create a loop to be able to get the whole list eventually. The loop works as intended but I'm running onto a strange problem. I keep getting edit conflicts every once in a while even though no one is editing that page other than me with these requests. You need to send the latest id revision of the page you're trying to edit with every request you send and I've written a function to get it and send it automatically. Sometimes it glitches, it doesn't update fast enough, or... I don't know really but it will make it appear as running unto an edit conflict, basically with itself (just because it's trying to edit an older version of the page). This would happen even when sending requests manually (before creating the loop) but then it wouldn't cause any harm because I could just re-send the request and it would get fixed. The problem now is that I send 5 codes per request and if the edit conflict happens, those 5 codes are lost and the loop goes on with the other 5 codes, which means that even 1 edit conflict would corrupt the whole end results. I tried making Smallem sleep after every request (started with 5 seconds per request) and more edits got through before running into conflicts so I tried incrementing the sleeping time. Eventually I got into a point when incrementing it wouldn't help anymore (reached 10 minutes per request) and the conflicts were rare but still persisted in a, what appears to be, random manner. You could get 5 requests go through and then get 1-2 conflicts in a row, for example. At this point, I don't really know what's causing them. Maybe some problem in regard to Wikipedia's caches or job queues... Do you have any experience with this kind of problem? I've been reading the bot help pages here and the ones on Mediawiki in regard to API requests but... - Klein Muçi (talk) 10:18, 7 March 2021 (UTC)Reply

I guess that I'm confused. It appears to me that you are attempting to recreate the regexes at a rate far in excess of the rate-of-change of the code list. The language codes are updated only occasionally so it isn't necessary to recreate the regexes until the code list changes. Shouldn't you just create the regexes once and let smallem use that set until sometime, maybe months from now, MediaWiki changes the language codes?

—Trappist the monk (talk) 12:19, 7 March 2021 (UTC)Reply

Yes, that's what I do. Every few months, I go and work with module Smallem and recreate the regexes. Now I want Smallem to go by itself to that module and recreate the regexes by itself, again, every few months. Now, when I do it manually, I feed around 50 codes at the module in one go, get the results and repeat until the list is exhausted. Smallem can't work with 50 codes in one go (one request) because it can't get that much of information in 1 request so, by trial and error, I found out that it must work with 5 codes at a time (request). So, considering we have around 330 codes, Smallem needs to send 66 POST requests (to add codes) and 66 GET requests (to get the regexes from those codes) in a row to complete the update, when the time comes (every 3 months, for example) [66*5=330] . This is done by creating a loop which, to put it simply, works like this: Gets the total list of codes and sends 1 request with 5 of them, deletes 5 codes from the list, save the regexes for the 5 codes sent, repeats until the list is exhausted. The problem is that every 5-6 requests, it encounters an edit conflict, for unknown reasons. No one other than Smallem is editing that page. What I described in details above. This doesn't cause the loop to stop which continues deleting those 5 codes from the list and going on with sending another request, so the results from those 5 codes, whatever they were, will be missing in the end. This is the problem I'm explaining. The big number of requests is just because it works in small batches just to complete 1 full regex list update (which will be programmed to run periodically every some months) not because I'm updating it over and over again all the time. - Klein Muçi (talk) 12:45, 7 March 2021 (UTC)Reply

Side quest: Can the regex list be made to be generated with the ASCII tab character separating regexes instead of plain space? I believe it would help me with code formatting in the final stage. - Klein Muçi (talk) 13:22, 7 March 2021 (UTC)Reply

If you want ...

—Trappist the monk (talk) 13:46, 7 March 2021 (UTC)Reply

Seems extraordinarily odd to me that something that you can do manually cannot be done through the api – the api should be able to fetch the largest article so fetching the regexes for 50 codes shouldn't be a problem. Perhaps your first step is to find an api expert. I'm pretty sure that I've suggested this before: pywikibot is written using python; surely sq:Moduli:Smallem can be ported to python so that all it needs to fetch is the code list and it makes the regexes itself...

—Trappist the monk (talk) 13:46, 7 March 2021 (UTC)Reply

Well, even if it did work with 50 codes (I tried it only 2 times and then went on to lower it not giving enough thought on why it wasn't working because I thought "anyway, Smallem doesn't get tired") it still would need 7 turns to finish all the codes and it would take only 1 edit conflict to make the final list miss all the regexes for 50 codes missed during that one iteration. But why not do all the codes in 1 turn then? You already know the answer to that question. (The Mediawiki limit.) So in one way or another I need to deal with the edit conflict problem. Making Smallem understand when the edit conflict happens and retry that iteration would take too much effort and I didn't want to follow that road given that running on edit conflicts should be relatively a very, very rare occasion for Smallem. Ironically, that's what's stopping it now. I can try to make it work with 50 codes but the problem persists if I don't understand correctly what's going on behind the curtain. I thought maybe you had encountered something similar with Monkbot or knew any kind of information in regard to phenomena like these but looks like Smallem will still be sailing alone on that problem. And yes, please, change it accordingly. Also, if you don't mind, tell me the correct name of that character. You're good with nomenclature. I only know it as "the tab character" or "the 5 spaces". :P - Klein Muçi (talk) 14:13, 7 March 2021 (UTC)Reply

Which is why smallem / pywikibot should be building the regexes string; no splitting the job into pieces and hoping that you got all of the peices.

Tab key § Tab characters

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 14:42, 7 March 2021 (UTC)Reply

Thank you! You were right. It works even with 100 entries (150 entries reach the MediaWiki limit). Unfortunately, the problem persists. You need 4 iterations to finish the list and I can do only 1 or 2 iterations before it runs into an edit conflict and 100 entries are lost. :P I'll experiment more to try and solve it with good ol' trial and error. Thank you for your persistent help! :)) - Klein Muçi (talk) 15:19, 7 March 2021 (UTC)Reply

I was able to get past that problem with a very crude hack. Now I'm reaching the final steps of the update where I need to sort the total list and remove the duplicates. To do that, I have to have an unique delimiter on the list of regexes that separates each regex from each other so when I get the machine like form, I can rearrange them back into a sortable list and then remove the duplicates. Can you add a character after the comma (maybe replacing the whitespace, we're talking about the machine readable list) to use as a delimiter? Anything that is not used in the regex itself. I was thinking of adding a hash character but that's used for comments and I'm not really sure if it could be a good idea. - Klein Muçi (talk) 23:21, 10 March 2021 (UTC)Reply

Isn't that why you had me put in the tab character?

—Trappist the monk (talk) 00:15, 11 March 2021 (UTC)Reply

Yes. The original thought wasn't precisely to use as a delimiter but more to have the tab character shown before every regex because in the final bash syntax, that's how those lines are arranged (cosmetic indentation). But I was disappointed when it wasn't rendered anywhere. Neither the wiki page itself, nor the bash results generated from it had any tab space in them. But please, since you reminded me, let me try something else before you do any change to it. My guess is I'll still have to put another symbol as a delimiter but first... (I'll tell you very soon.) - Klein Muçi (talk) 00:26, 11 March 2021 (UTC)Reply

Oops! It does show in the bash results even though it doesn't show on the wiki page. And it works like a charm as a delimiter. My bad my friend. :P :)) Bash scripting is a new world for me and working with it feels like re-inventing the wheel in every command I give (basically I need to search everything on Google). Putting everything together with the Wikimedia's API (again, something totally new for me) makes that new world feel like a minefield, albeit, still a nice adventure. At least you're helping me with Lua (even though, ironically, that's the only thing I know a bit compared to the other 2 concepts). Thank you for that. And sorry for the annoyance I may bring. - Klein Muçi (talk) 00:43, 11 March 2021 (UTC)Reply

Some more beauty subjects... Can we apply some formatting to the languages' codes' list like we did with the regex one? In the "human form" (not plain), given that entries are to be used in batches of 100 entries, can me format it in 4 groups of at least 100 entries each? I always spend quite some minutes counting them 1 by 1 manually until I'm able to select, copy and paste 100 codes exactly. - Klein Muçi (talk) 21:02, 11 March 2021 (UTC)Reply

Like this? sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 23:20, 11 March 2021 (UTC)Reply

Yes, thank you! Now 1 technical question: The way Smallem autoupdates itself by first copying the codes from here then using them in batches in another page and getting the results back (and removing the duplicates in the end). Now assume that code list changes by MediaWiki. Would the change be reflected in the HTML source of that page I mentioned above automatically after a while? Or would it need to be edited somehow before reflecting the change? Because if it is the latter one, I should teach Smallem to do a dummy edit before starting the process. (A dummy edit would solve that problem, no?) - Klein Muçi (talk) 23:43, 11 March 2021 (UTC)Reply

A null edit should be all that is required to ensure that the code list is up to date. I don't know if non-mainspace pages are cached. I suspect that they are (at least for a while) so a null edit would make sense to ensure that the code list is up to date.

—Trappist the monk (talk) 00:12, 12 March 2021 (UTC)Reply

Okay then. :) - Klein Muçi (talk) 00:22, 12 March 2021 (UTC)Reply

Can we add some kind of small mark (not newline) every 10 and 50 entries? The groups of 100 work very well when doing the update manually but when testing things, I still waste too much time counting to 50 one by one (and starting 3 times over when not sure if I counted that one code or not). Maybe the comma becomes bold after 50 entries? Do you think that's notable enough for the eyes? Something small that doesn't "pollute" the current format too much. - Klein Muçi (talk) 10:30, 12 March 2021 (UTC)Reply

What if, instead of a marker character, |list= takes a number and uses that number to create lines of codes that are that long?

—Trappist the monk (talk) 13:16, 12 March 2021 (UTC)Reply

So, I still get the full list but I'm able to specify how many entries does each batch/group of codes have? - Klein Muçi (talk) 14:41, 12 March 2021 (UTC)Reply

Yes. If the value assigned to |list= is not a number or the number is greater than the number of codes, defaults to lists of 100 codes. When the value assigned to |list= is a number less than or equal to the number of language codes, creates lists of number codes.

—Trappist the monk (talk) 15:05, 12 March 2021 (UTC)Reply

Seems pretty good. - Klein Muçi (talk) 15:22, 12 March 2021 (UTC)Reply

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 15:49, 12 March 2021 (UTC)Reply

Even more requests... Can the machine regex list show the total number of entries for that/those specific code/s rendered in the end of the list? The reason I want that is for a very specific problem. I was able to finish the whole autoupdate process yesterday but I have inconsistencies between the total number that I get manually (that the Lua logs give at the module) and the total number that Smallem gets automatically. My plan is to try each entry individually and see how much lines it should have and how much does Smallem get from it automatically. Maybe somewhere something is missing. I have no other idea on how to debug it. For now, I'm doing the counting process manually by copying everything on the Notepad+ but if I had the number shown below it would be 1 less step to be taken. Unfortunately, even though I believed it was working with batches of 100 entries, the results it was returning were empty. Turns out Smallem can't handle requests with more than 10 entries per request. If you surpass it, it starts returning empty results. When I completed the process with requests with 10 entries, the discrepancy was bigger. I tried comparing each group of 10 entries manually and automatically and it turned out there were some groups which were returning empty results again when the number of lines went more than 5000. So I tried lowering the number and the discrepancy got smaller and smaller until I got to 1 code per request at which time I got 1100 entries less than the total number I should be having, the smallest inconsistency I could get. Even then though, I tried comparing both lists (the automatic one and the manual one) for unique entries and my hope was that it would give us more or less 1100 unique entries (only the missing ones) but much to my dismay, it gave back 15000 unique entries so not only the final results are missing some lines but they're also not correct. And I'm sort of out of ideas what to try for debugging now. The only idea that I had was to compare the length of each code's list individually manually and automatically to see if I can find anything. It's a really tedious process but... I've got nothing else. At least removing 1 step would make it a bit easier. - Klein Muçi (talk) 10:59, 13 March 2021 (UTC)Reply

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 12:25, 13 March 2021 (UTC)Reply

Hmm... My first instinct is to suggest to add the word "lines" after the number. Is there any particular reason you haven't added it? - Klein Muçi (talk) 12:30, 13 March 2021 (UTC)Reply

Because there is only one line. Each regex is concatenated onto the next separated by a tab character so the number reflects the number of regexes not lines.

—Trappist the monk (talk) 12:38, 13 March 2021 (UTC)Reply

Oh! I was thinking now... Maybe we should have that even on the human form? Can't think of a particular reason now but it's an extra functionality so why not? And for you would it be 1 possible less request from me in the future. I mean, if you can think of any downsides, then tell me but if not... - Klein Muçi (talk) 12:46, 13 March 2021 (UTC)Reply

sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 13:51, 13 March 2021 (UTC)Reply

No lines? XD I'm mostly kidding now. Thank you! :) Sometimes I look back and think that I'd probably be adding regex lines by hand until now if you hadn't create that. I don't believe I'd be even close to finishing half of those by now. :P - Klein Muçi (talk) 14:04, 13 March 2021 (UTC)Reply

I believe there's a sort of shortcoming in the change we made. In human form it works as it should but in machine form, you get an error if you leave the |list parameter empty or if you don't put it at all. You must give it a random value for the list to be generated. And the value does nothing because the format is already fixed. Can it be made to behave like the parameter |plain? - Klein Muçi (talk) 16:13, 14 March 2021 (UTC)Reply

What that message was telling you is that |lang= is missing or empty. When |list= has a value, sq:Moduli:Smallem renders the language list; when |list= is empty or missing, Smallem renders the regex list according to the codes in |lang=. I've tweaked it so that Smallem still emits an error message, but isn't so strident.

—Trappist the monk (talk) 17:23, 14 March 2021 (UTC)Reply

No, the error was good enough on its own but the problem is that there shouldn't be an error. So, basically it should work like this: All the codes are generated no matter if that parameter is there or not. Without bringing any errors. If that parameter is there, it changes only the formatting of how the codes are generated (the number of entries per group). If it's not, it defaults to 100 per group (per human form/plain=no) or all the codes in 1 line (per machine form/plain=yes). That's how I thought it would work with your explanation in the beginning but maybe I'm overlooking something on my reasoning. - Klein Muçi (talk) 19:36, 14 March 2021 (UTC)Reply

The default case for sq:Moduli:Smallem is to create regexes. We shortstop that to create the language codes list by setting |list= to some value. At sq:Përdoruesi:Trappist the monk/Livadhi personal there are two {{#invoke:Smallem|lang_lister|...}} invokes. They both call the same function lang_lister(). In the first one, |list= is not set so lang_lister() expects to find a list of language codes in |lang=. That parameter is missing so lang_lister() cannot proceed except to emit an error message. In the second invoke, |list= is missing but |lang= has a list of language code so lang_lister() renders the regexes. |list= is the parameter that decides what smallem will do. If you want 100-item lists from smallem, you must set |list=100 so that it knows that that is what you want. When you don't tell it to list codes, smallem will assume that you want list regexes.

—Trappist the monk (talk) 20:05, 14 March 2021 (UTC)Reply

Also, I just noticed something totally unexpected now that we've add counting. This: {{#invoke:Smallem|lang_lister|plain=yes|lang=aa }} generates 857 results. This: {{#invoke:Smallem|lang_lister|plain=yes|lang=aa, }} generates 1246 results. Why is that? I'm afraid that's one of the factors that changes my expected total number of lines in the end. - Klein Muçi (talk) 19:46, 14 March 2021 (UTC)Reply

But... I'm confused... :/ It used to work without the parameter |list=. We only added that to be able to affect some more the formatting on the generated codes. Shouldn't it work again without it and just default to a specified default state? Or, asking in a more practical aspect, what am I supposed to write on |list= when generating the list of codes with plain=yes? - Klein Muçi (talk) 20:13, 14 March 2021 (UTC)Reply

Nope. To get a language-code list has always required |list=. See this diff. The old form was |list=yes and it always shortstopped lang_lister(). The new form uses a number but if you give it something that is not a number (like |list=yes) it defaults to 100-item lists.

—Trappist the monk (talk) 21:11, 14 March 2021 (UTC)Reply

Ooh... I "should" put "yes"... Now it makes sense. What about the comma problem? What's happening there? And... Can we make the plain list of codes generated without commas? Now that I see that they do make a difference... - Klein Muçi (talk) 21:24, 14 March 2021 (UTC)Reply

I have added a line that strips trailing commas from the end of the |lang= code list before handing the list to the part of lang_lister() that makes the regexes. sq:Përdoruesi:Trappist the monk/Livadhi personal

—Trappist the monk (talk) 22:24, 14 March 2021 (UTC)Reply

Sweet! - Klein Muçi (talk) 22:56, 14 March 2021 (UTC)Reply

So, my autoupdate script works fine overall more or less. It still has some problems with accessing the Wikimedia API but I can't do much about those now, because I can't find anything to help me on the documentation. It requires a lot of trial-erroring to reverse engineer it and hopefully with time I'll be able to solve those too. (I've been talking with someone about the API problems but it seems like even that conversation it's coming to an end without solving all of the problems. Do take a look if you have time.) Meanwhile I had 1 problem I wanted to ask you about, maybe you can orientate me a bit on the right direction. I don't get consistent results. Most of the times I will be getting 41063 lines but the number can vary a lot (40603, 40778, 40842, 41413, 40735, etc). This is with the list of regexes + 9 lines of commands. Now the variation and the lack of consistency is part of the API problems. (Sometimes I also get errors.) But I'd like to be able to know more specifically about the variation between lines though. Do you have any ideas how I might understand more about what's going on behind the curtains in regard to this? What's changing between every run, what's getting left behind or ignored? The exact number should be 42597 + 9 command lines. Some common debugging tactics I can use. Don't worry if you can't help much though. As you can see above, I haven't been able to get help properly even on MediaWiki so... :P :) - Klein Muçi (talk) 12:15, 17 March 2021 (UTC)Reply

I suspect that I'm not going to be much help. So many things being done by so many different processes make finding the one or few that are causing the variations in the number of returned regexes really difficult. One would think that the difference between the expected and what you actually got would be the same given that the inputs are always the same. So I guess I would start there. Are the inputs to the auto update process chain always the same? If you force the inputs to always be the same, do you get consistent output? If you break the chain at the midpoint and insert known-good data at the beginning, is the midpoint output always correct?

Aside from this, I don't think that I have any suggestions.

—Trappist the monk (talk) 15:07, 17 March 2021 (UTC)Reply

Yes, that was my hope. I was trying to somehow make it always give the same result, even if that result was wrong. But I couldn't manage even that. What I did try though was only getting results for 1 code and comparing the script generated results by those gotten manually. It always had the same number of lines. I did try it for more than 30 codes. Then I got the list of results for the code aa and compared both lists (script/manually) together. I mashed them up and sorted to only leave the unique lines, meaning, if they would be the same, the result would be an empty list. What I did get though was a short list with some entries. There I noticed that the manually generated list had some entries with empty characters on them which the script one didn't have (this worried me a bit to be honest because I understood that that's a problem I might have been having all this time while generating those codes and basically there's no way to solve it) AND that the script couldn't deal with characters like single quotes, etc. For example N'go became Ngo in the script list. That's all I could understand but I don't know what to do with an information like that. I mean, should I deduce that all the missing entries are all because of problems with special characters like these? Leaving the number inconsistencies aside, I mean. Treating 41063 as being the only result. Just brainstorming now mostly. - Klein Muçi (talk) 21:15, 17 March 2021 (UTC)Reply

manually generated list had some entries with empty characters what do you mean by that? What does an [entry] with empty characters look like? What should it look like?

sq:Moduli:Smallem doesn't do anything special for language names that contain single quote/apostrophe characters. Should it? In the aa regex list I found: Fe'Fe', Lamnso', Mi'kmaq, Mka'a, Nda'Nda', N’Ko (U+2019 – single comma quotation mark), and O'odham. Didn't find N'go.

—Trappist the monk (talk) 22:56, 17 March 2021 (UTC)Reply

Oh, not empty. Sorry. I meant to say invisible. Maybe it was N’Ko or maybe it was from another test I did with first 10 codes. I've done many tests trying to "crack the mystery" any way I can in these days. But anyway you got the general idea. Apparently my script just plainly removes those whenever it finds them. I don't know if that's coming from my bash code, the Wikipedia API or the curl command and I don't know its implications. Would the Smallem's regex still find N’Ko if it was written as Nko and apply the change? I guess not. Can Module Smallem help with that? I would answer if I knew. I don't know what's causing Smallem not to care about these characters so I don't know how you can twist the module to make that phenomenon not to happen again. :/ - Klein Muçi (talk) 23:11, 17 March 2021 (UTC)Reply

Just checked the full list. There's no N’go. It's N’Ko. Doesn't make any difference but just for clarification. :P - Klein Muçi (talk) 23:54, 17 March 2021 (UTC)Reply

Ok, so what does an [entry] with ~~empty~~ invisible characters look like? What should it look like?

If something removes single quote/apostrophe characters then a regex that is modified:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mi'kmaq(\s*[\|\}])", r"\1mic\2") → (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mikmaq(\s*[\|\}])", r"\1mic\2")

won't work as desired because the regex won't match and fix |language=Mi'kmaq → |language=mic

But, that doesn't explain the variable regex line counts. Is it possible that something in the processing chain matches and consumes single quotes and everything between them? If you have this (from the aa regex generation):

... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mi'kmaq(\s*[\|\}])", r"\1mic\2"), ... <twelve regexes> ... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mka'a(\s*[\|\}])", r"\1bqz\2") ...

then, if something consumes single quotes and their content, you would get this:

... (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Mia(\s*[\|\}])", r"\1bqz\2") ...

which is a valid regex (in form if not in function).

—Trappist the monk (talk) 00:19, 18 March 2021 (UTC)Reply

An invisible character doesn't look like anything because it's invisible. :P The terminal percieves this as a code U+something or just shows a carriage return arrow sometimes. This only happens when you copy the regexes manually from the wikipage. Sometimes invisible characters will show up at some regex lines... apparently. I didn't know that. Only noticed now and that got me worried about Smallem in general because apparently they have always been there, modifying the regexes "silently". I really like your thinking about miss-consumption of lines but I don't believe that to be the case. The reason I say that is because as I said, I get the same number of lines when I do codes 1 by 1. There are no lines missing. To be practical, I'll run the script for aa, get the results manually, concatenate the lists and keep only the unique lines, which I'll show here. - Klein Muçi (talk) 00:38, 18 March 2021 (UTC)Reply

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)español\ \(formal\)(\s*[\|\}])", r"\1es-formal\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)español\ \(formal\)‎(\s*[\|\}])", r"\1es-formal\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)magyar\ \(formal\)(\s*[\|\}])", r"\1hu-formal\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)magyar\ \(formal\)‎(\s*[\|\}])", r"\1hu-formal\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Nederlands\ \(informeel\)(\s*[\|\}])", r"\1nl-informal\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Nederlands\ \(informeel\)‎(\s*[\|\}])", r"\1nl-informal\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)N’Ko(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)NKo(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮ(\s*[\|\}])", r"\1ban-bali\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮᬶ(\s*[\|\}])", r"\1ban-bali\2"),

So, both methods showed the same number of lines. Script 858, manually 857 (script adds 1 empty line in the end of the results which is removed in the last concatenation). I then concatenated both lists, sorted them lexicographically and kept only the lines which weren't duplicated (if a line had a symmetrical friend, they both got deleted) which were the lines above. I was wrong with the U+code. It only shows the carriage return error which here, strangely enough, gets converted to that red dot (check the source code to see what I mean). So, as you can see, the lines above were the only ones who weren't fully symmetrical and that's about it in regard to the change. There are no totally new entries in a list which don't appear at all in the other list. I believe the red dot is coming from the script list judging by the N’Ko/Nko case in which the manual entri gets positioned in second place. - Klein Muçi (talk) 01:03, 18 March 2021 (UTC)Reply

Tracked in Phabricator
Task T256649

The red dot is U+200E left-to-right mark. It is present in the MediaWiki data though I don't know why it is there. I know this because I can see it when I copy these magic word renderings to https://r12a.github.io/uniview/:

{{#language:es-formal|aa}} → Spanish (formal address)
{{#language:hu-formal|aa}} → Hungarian (formal address)
{{#language:nl-informal|aa}} → Dutch (informal address)

Changing the rendered language to something other than aa doesn't change anything

{{#language:es-formal|es}} → Spanish (formal address)
{{#language:es-formal}} → español (formal)

I have added a snippet of code to sq:Moduli:Smallem that replaces U+200E LRM with an empty string. Doing that changes the total number of lines to 42596 (from 42597).

No idea where you are loosing the U+2019 single comma quotation mark in nqo or loosing the U+1B36 Balinese vowel sign ulu in ban-bali.

—Trappist the monk (talk) 14:24, 18 March 2021 (UTC)Reply

There are a handful of other codes that include U+200E left-to-right mark:

{{#language:be-tarask}} → беларуская (тарашкевіца)
{{#language:be-x-old}} → беларуская (тарашкевіца)
{{#language:zh-cn}} → 中文（中国大陆）
{{#language:zh-tw}} → 中文（臺灣）

—Trappist the monk (talk) 15:02, 18 March 2021 (UTC)Reply

Thank you! I'm curios, what happens that we get +1 line now with that change. I'll try the same test now and see what happens. And then try to see if there is any change on the grand total in the end. - Klein Muçi (talk) 20:39, 18 March 2021 (UTC)Reply

As expected, this is what I got now after redoing the same steps mentioned yesterday:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)N’Ko(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)NKo(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮ(\s*[\|\}])", r"\1ban-bali\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮᬶ(\s*[\|\}])", r"\1ban-bali\2"), - Klein Muçi (talk) 20:53, 18 March 2021 (UTC)Reply

I tried making a full run a couple of times. Unfortunately, I never get the right amount but again, as expected, if I run it with 1 code instead of 10 codes per request I get "better" results in general. I don't know what makes the difference. You have suggested in the past to run it with 50 or 100 codes in a request but that just doesn't happen. If you raise the number past 20, you start getting empty responses.

These are the results with 10 codes per request: 41063 40063 40778 40351 41063 40842 41413 40735 40543 40603 41063

These are the results with 1 code per request: 41008 41413 41008 41015 41015 41420 41013 41421

I think a good idea would be to try and compare lists by the script and manually for 10 codes, one by one, until the codes are over. And see what differences there are in each comparison. Maybe sometimes we'll see codes missing in the script generated list. Or anything strange (apart from the cases we've discussed above). What do you think? - Klein Muçi (talk) 23:57, 18 March 2021 (UTC)Reply

After that we can also try a comparison between a full script list and a full manual list and see if something changes when you run a full automatic run compared with when you do 35 requests manually with the script. I remember I tried doing this in the past but the results were too vast to check because of the U+200E character. Now that that problem is solved, maybe the list can be manageable. - Klein Muçi (talk) 01:46, 19 March 2021 (UTC)Reply

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Gwich’in(\s*[\|\}])", r"\1gwi\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Gwichin(\s*[\|\}])", r"\1gwi\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)K’iche’(\s*[\|\}])", r"\1quc\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Kiche(\s*[\|\}])", r"\1quc\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)meta’(\s*[\|\}])", r"\1mgo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)meta(\s*[\|\}])", r"\1mgo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Meta’(\s*[\|\}])", r"\1mgo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Meta(\s*[\|\}])", r"\1mgo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)n’ko(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)nko(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)N’Ko(\s*[\|\}])", r"\1nqo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮ(\s*[\|\}])", r"\1ban-bali\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮᬶ(\s*[\|\}])", r"\1ban-bali\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ዕብራይስጥ(\s*[\|\}])", r"\1he\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ዕብራይስጥ(\s*[\|\}])", r"\1he\2"),

These are the comparisons of the first 10 codes. I thought there would be only some apostrophes missing but the final results surprised me. The Red Dot is back. I'll continue with the other comparisons and hopefully only post the unexpected results, if there are any. The length of the lists continues to be the same in both methods. - Klein Muçi (talk) 11:46, 19 March 2021 (UTC)Reply

This red dot is U+FEFF zero width no-break space. It only occurs in the Amharic rendering of he:

{{#language:he|am}} → ዕብራይስጥ

I have added a snippet of code to strip U+FEFF ZWNBSP when encountered.

Here is a list of language names that contain U+2019 right single quotation mark:

Ale’uhtesch
Creole ta’ Haiti
Cànan Hawai’i
Du’ala
D’zongqa
Ge’ez
Ghomala’
Ghomálá’
Gwich’in
Ha’iihtesch
Ikigaluwa cy’Igisweduwa
Ingusch’sch
Je’orjesch
Ji’is-Ahl-Ättejohpesch
Jöödsch-Pers’sch
Kasach’sch
Kige’ez
Kon’kahnesch
Koro’eeshiyaan
K’iche’
K’ische’
Ma’ohresch
Meta’
Middelpers’sch
Mi’kmaq
N’Ko
N’ko
Ooldpers’sch
Pers’sch
Russ’sch
Sorbjan ta’ Fuq
Susbaint nach eil ’na chànan
Swahili’r Congo
Tiếng H’Mông
Tiếng Meta’
Tiếng N’Ko
Tschech’sch
Tschuwasch’sch
Uj’juhresch
Wittruss’sch
Yup’ik
alemán d’Austria
amazighe de l’Atlas central
cajun’i
crioll d’Haití
espagnol d’Amérique latine
espagnol d’Espagne
español d’América Llatina
flandrezeg ar c’hornôg
fris da l’ost
ge’ez
gheg d’Albania
gwich’in
gwich’inkielâ
inglés d’Australia
inglés d’Estaos Xuníos
isi-Meta’
isi-N’Ko
ki’che’
k’iche
k’iche’
meta’
mi’kmaq
noma’lum til
n’Ko
n’ko
n’koera
n’kó
quechua dell’altopiano del Chimborazo
same d’Inari
sami d’Inari
slavon d’église
s’čchuanská iovčina
vepsän kel’
xitoy (an’anaviy)
árabe d’Arxelia
árabe d’Exiptu
Ν’Κο
в’етнамская
гуіч’ін
комі-перм’яцька
нг’ембон
пап’яменту
т’яп
مېتاچە’
ग्विच’इन
অ’চিটান
এন’কো
কোৱাছিঅ’
গওইচ্’ইন
জোলা-ফ’নি
প’লিচ
ফ’ন
ল’ জাৰ্মান
ল’ৱাৰ ছোৰ্বিয়ান
ਗਵਿਚ’ਇਨ
એન’કો
ગ્વિચ’ઇન
ଗୱିଚ’ଇନ୍
ᎺᎳ’
曼德文字 (N’Ko)

—Trappist the monk (talk) 12:52, 19 March 2021 (UTC)Reply

Hmm, at this point I should probably stop the general testing and see if I can fix the missing the right single quotation mark. I'll try copy-pasting a single entry from that list (a language name) into a wiki page, curl it, and see what really happens and how can I make it render like it should on bash. If I don't succeed, I'll go on with the general testing until the end so we at least can fix other errors along the way and look at the remaining ones in the end. - Klein Muçi (talk) 13:08, 19 March 2021 (UTC)Reply

These are the language names that use the simple keyboard apostrophe/single quote:

Avañe'ẽ
Fe'Fe'
Gvich'in
Lamnso'
Mi'kmaq
Mka'a
Nda'Nda'
O'odham

—Trappist the monk (talk) 13:38, 19 March 2021 (UTC)Reply

I did some experiments with N'ko and I finally found out what was happening which will be hard to explain so I'll try to be as clear as possible. Turns out that the script list gets the name of the language correctly, without missing any marks. And so does the manual way. The copy-paste to the Notepad++ (the text editor that I use) works as it should without leaving anything behind. The problem comes when I copy from the Notepad++ to the terminal. As soon as I copy the text there, marks like these disappear. There is no problem with languages like Nda'Nda' or O'odham but you get the same error if you try any of the languages you've mentioned above them (and for languages in the bottom of that list, you get squares with question marks because the terminal can't render those languages at all - even though it still understands them strangely enough, as far as I've seen from some testing with the corrections in regard to languages like those). My plan for the moment will be to somehow make the terminal render correctly strings like these and maybe then do a final test between the total results from the script and the total results gotten manually and see what's changing. If I'm able to get past the terminal problem. - Klein Muçi (talk) 00:50, 20 March 2021 (UTC)Reply

Ok, so update: Turns out Windows' command line (cmd.exe) can't handle the characters we've been discussing so I switched it to git bash. I returned your changes on module Smallem and apparently git bash can "understand" the "red dot" characters we removed and doesn't show a change when compared to the manual list. I'm in a dilemma if it still would be better to remove them or not? What do you think? For the moment, just so we don't get confused, I'm leaving them removed, tell me later what should we do. I'm on a process of checking all the codes in batches of 10 entries. When I finish, I'll compare the total lists and report back here what's causing the change on the numbers. - Klein Muçi (talk) 02:56, 20 March 2021 (UTC)Reply

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ইগ্<200c>বো(\s*[\|\}])", r"\1ig\2)), Even though I'm still not finding any changes between lists, I found some lines appear like this (this was only one of many) when I was on these codes be-tarask, be-x-old, bg, bh, bi, bjn, bm, bn, bo, bpy. Usually I get that symbol (<200c>) whenever there's a "strange" character. I used to get the same with what we've talked above until you removed those. What do you think is happening there? - Klein Muçi (talk) 03:07, 20 March 2021 (UTC)Reply

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)گﻩ<200c>ﻟیﻙی\ ﺲﻛۆﺖﻠﻫ<200c>ﻧﺩی(\s*[\|\}])", r"\1gd\2"), Another line, don't know if it is the same character in both of the examples. This one was on codes cho, chr, chy, ckb, co, cr, crh, cs, csb, cu

U+200C zero width non-joiner is required by some languages to prevent adjacent unicode codepoints from combining. For ig in Bengali, the language name contains গ্ and বো (both of which are combinations of 2 and 3 codepoints). Without U+200C ZWNJ you get গ্বো but with it you get গ্‌বো. Your second example (gd – from Central Kurdish ckb) has two U+200C ZWNJ (the flipflop between right-to-left Arabic script and left-to-right Latin script is confusing). U+200C ZWNJ is not a codepoint that should be removed or replaced.

Unless there is some way that you can replace the <200c> text with the U+200C ZWNJ codepoint in the regexes, the best that we can do is have sq:Moduli:Smallem skip language names that contain U+200C ZWNJ. I have tweaked Moduli:Smallem to skip. See sq:Përdoruesi:Trappist the monk/Livadhi personal which uses the same ten language codes that you listed above. Before the skip code was added, Moduli:Smallem returned 5250 regexes; after, 5245 (Bengali (bn) and Bishnupriya (bpy) both use Bengali script so for codes ig, kpe, hmn, and lv they produce the same regex.

—Trappist the monk (talk) 14:01, 20 March 2021 (UTC)Reply

Hmm, so much to learn about languages. But... I believe even though the terminal doesn't show it, the replacement works out fine. I don't think we should remove those codes altogether. How can I test it if it works fine or not? What change should Smallem be able to make specifically? - Klein Muçi (talk) 14:49, 20 March 2021 (UTC)Reply

So not sure about what I asked above, nonetheless, I'm going on with the comparisons to see if I can find anymore "problematic" characters in other codes. And I did found these three (well, one) during the rendering of regexes for gor, got, gsw, gu, gv, ha, hak, haw, he, hi:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)<200f>אינטרלינגואה(\s*[\|\}])", r"\1ia\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)<200f>וולאפיק(\s*[\|\}])", r"\1vo\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)<200f>נורדית\ עתיקה(\s*[\|\}])", r"\1non\2"),

Maybe time to make the module skip more codes? :P Even though, as I said, I'm still wondering if Smallem really doesn't know how to make these changes or not. The reason I say this is, pardon my ignorance, because there are some other cases when I don't get the <number-something> part but instead get some squares or just plain "space" (which is not space, per se, but more like, invisible characters) and I've noticed that in those cases, the bot works fine. I mean, I've done only 1 test some time ago. I must disclose though that in those cases, when you copy the line into a text editor, the squares and the spaces are converted to what the string really is, while the <number-something> cases stay like that even when copied to somewhere else, so maybe there is a difference. Again, sorry for the lack of terminology. - Klein Muçi (talk) 23:04, 20 March 2021 (UTC)Reply

And some more...

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ಟಷೆಲ್<200d>ಹಿಟ್(\s*[\|\}])", r"\1shi\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)<200b>កៀហ្ស៊ីស(\s*[\|\}])", r"\1ky\2"),

These ones on codes kg, ki, kj, kk, kl, km, kn, ko, koi, kr - Klein Muçi (talk) 23:41, 20 March 2021 (UTC)Reply

(edit conflict)

U+200F right-to-left mark. The three that you found are (apparently) the only language names with U+200F rtl mark. This 'character' should probably be removed just as we remove U+200E left-to-right mark so I have added that to sq:Moduli:Smallem.

I don't think that it is necessary to worry too much about square-with-question-mark characters; that is just an indication that whatever tool is displaying the codepoint doesn't have a font for that codepoint; the data behind the codepoint is likely not modified. But, we need to know about the plain "space" (which is not space, per se, but more like, invisible characters). We need to know about these characters because they might prevent a regex match if they are in the MediaWiki list but not in |language= as a human would write the parameter value.

Whatever is converting codepoints like U+200C zero width non-joiner from the native codepoint to the text string '<200c>' is a problem because that 'string' won't be appearing in a language name in a cs1|2 parameter value as a human would write it. You wrote: when you copy the line into a text editor, the squares and the spaces are converted to what the string. When I copy regexes that have U+200F rtl mark from Wikipedia's Moduli:Smallem rendering to Notepad++, the U+200F rtl mark is not converted to '<200f>'.

—Trappist the monk (talk) 00:44, 21 March 2021 (UTC)Reply

U+200B zero width space, U+200D zero width joiner. I'll look into those tomorrow.

There are about thirty language names that use U+200D zero width joiner. These occur in mr, si, and tcy language sets. U+200D ZWJ is required to join certain codepoints together into a single character so we can do nothing with them. There are five U+200B ZWSP all from the Khmer language set. These are typographic indicators that can be used to indicate when a line break is permitted; I do not know if they are required or if humans writing these language names would include them.

—Trappist the monk (talk) 15:04, 21 March 2021 (UTC)Reply

—Trappist the monk (talk) 00:44, 21 March 2021 (UTC)Reply

Thank you for all the information! I have a clarification to make though: Regexes that have U+xxxX on them DO NOT get converted when you change the working space. The regexes that get converted are those with squares or "plain space". That's what I was saying and why I believe U+xxxX cases behave different from square/question mark AND plain space/invisible character cases. At this point we're suggesting the same thing apart from the plain space/invisible character cases, which I believe to behave the same as square/question mark cases, so in other words, the tool (Git Bash in my case) doesn't have a font to display them properly. But I don't know why some cases get squares/question marks and some get plain spaces/invisible characters. There are A LOT of both of these cases and, as it may be already expected now, they all happen in non-Latin characters. I'll try to paste some of them here from the K-codes mentioned above I was already working now to give some examples. - Klein Muçi (talk) 01:18, 21 March 2021 (UTC)Reply

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)토켈라우제도어(\s*[\|\}])", r"\1tkl\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)토크\ 피신어(\s*[\|\}])", r"\1tpi\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)통가어(\s*[\|\}])", r"\1to\2"),

For me, all the Korean characters above (it is Korean, no?) are shown as empty squares in Git Bash. Just 1 example.

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)ᬩᬲᬩᬮᬶ(\s*[\|\}])", r"\1ban-bali\2"),

For me, the empty squares in the regex above (squares with question marks, if you see the source code) are shown as just space/void. These cases are rarer than the empty squares cases and after searching for more than 30 codes, I was able to refind only this case. But there are codes in which they appear more often.

So to make it more clear, there are 3 cases when the strings aren't rendered as they should:

The U+xxxX cases (which stay like that everywhere)
The empty square cases (which change to text strings in other workspaces, including here and on Notepad++)
The empty space cases (which change to empty squares or squares with question marks if you see the source code in here and on other workspaces like Notepad++... I believe. Can't be too sure because I was only able to find 1 of those for the moment.) - Klein Muçi (talk) 01:41, 21 March 2021 (UTC)Reply

So where do the <200c> etc text strings come from?

—Trappist the monk (talk) 15:04, 21 March 2021 (UTC)Reply

I noticed that the 2nd case mentioned above only happens when using the manual method, so, in other words, when copying the results from the wikipage to the terminal manually. When you get the results there using the script, they render properly. This is referring only to the second case.

This is not true apparently. It appears to be random. I got the empty squares even on script generated results:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Kʼabilan\ Scots\ Gaelic(\s*[\|\}])", r"\1gd\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Kʼicheʼ(\s*[\|\}])", r"\1quc\2"),

The single quote got rendered as an empty square in both of these script generated results. Strangely enough, on the next result it gets rendered properly:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)K’iche’(\s*[\|\}])", r"\1quc\2"),

Hence, I believe it is random.

Side request: I was wondering... Should Module Smallem report an error if you put a langcode with a typo or if you have put the same code twice? For example, I want to put xal, xh, xmf, yi, yo, yue, za, zea, zh, zh-classical but instead, accidently, I put xal, xh, xmf, yi, yo, yue, za, zea, zh, zh- or maybe even xal, xh, xmf, yi, yo, yue, za, zea, zh, zh. The way I think it could work is basically understand what codes are possible and if it finds alphabetical characters that are not creating a code or are creating the same code twice, it renders an error accordingly. - Klein Muçi (talk) 02:19, 21 March 2021 (UTC)Reply

The characters 'ʼ' and '’' are not the same codepoints. The first one (from the group of three regexes) is U+02BC modifier letter apostrophe and the second is U+2019 right single quotation mark. I suspect that the U+02BC is probably the more semantically correct codepoint (the Kʼicheʼ language article here uses that codepoint). Nothing for us to do with these code points. Because they are different, sq:Moduli:Smallem will include both in the regex list.

I've added a check for language codes in |lang= that end with a hyphen. See sq:Përdoruesi:Trappist the monk/Livadhi personal.

—Trappist the monk (talk) 15:04, 21 March 2021 (UTC)Reply

The empty space cases (which change to empty squares or squares with question marks if you see the source code in here) are not always true either. I just got a whole lot of these cases in codes krc, ks, ksh, ku, kv, kw, ky, la, lad, lb. 4 examples:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)קריאולית\ \(האיטי\)(\s*[\|\}])", r"\1ht\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)קריאולית\ \(סיישל\)(\s*[\|\}])", r"\1crs\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)קריאולית\ לואיזיאנית(\s*[\|\}])", r"\1lou\2"),
(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)קריאולית\ מאוריציאנית(\s*[\|\}])", r"\1mfe\2"),

Everything in Hebrew (it is Hebrew, no?) for me gets rendered as only empty space. I'm nearly sure I've seen Hebrew getting rendered properly before in my terminal so I'm starting to think this is random too. :/ - Klein Muçi (talk) 02:45, 21 March 2021 (UTC)Reply

In the above regexes, the Hebrew text is right-to-left so the escape character '\' precedes spaces and the parentheses on the right. There are no 'peculiar' code points in those language names.

—Trappist the monk (talk) 15:04, 21 March 2021 (UTC)Reply

This has been the first time I've been dealing with Unicode characters at this level so only now I'm really starting to understand what's happening around with all those non-printing characters and marks. Yesterday I was reading a bit about Unicode and codepoints and I hope I'll have a better vocabulary when referring to these problems in the future. For that I thank you because it was only after your explanations that I started going further into understanding the computeric work done after scripts different from Latin ones. I have a naive question: Shouldn't we make Smallem skip those languages that use those 2 characters mentioned above like we did in other cases? What's changing in case? - Klein Muçi (talk) 15:32, 21 March 2021 (UTC)Reply

PS: The error is good. Shouldn't there be an error even for cases like this though: xal, xh, xmf, yi, yo, yue, za, zea, zh, z? Or is that not possible? - Klein Muçi (talk) 15:37, 21 March 2021 (UTC)Reply

You mean: should sq:Moduli:Smallem skip U+200B zero width space and U+200D zero width joiner like it skips U+200C zero width non-joiner? Don't know. Perhaps, perhaps not. You didn't answer my question: So where do the <200c> etc text strings come from? If that text string is created by you actually editing the regex, then we should not bother to skip those language names. If those strings are created by a machine, then, yes, language names with U+200B ZWS and U+200D ZWJ should also be skipped.

Here is an interesting oddity:

Tracked in Phabricator
Task T278059

{{#language:sq}} → shqip – 'to' language tag not specified so returns autonym
{{#language:sq|sq}} → shqip – 'to' language tag specified as Albanian so returns Albanian language name in Albanian
{{#language:sq|sq-}} → Albanian – malformed 'to' language tag not ignored as one would expect but instead, returns Albanian language name in English
{{#language:sq|sq-L}} → Albanian – malformed 'to' language tag not ignored as one would expect but instead, returns Albanian language name in English
{{#language:sq|s}} → shqip – malformed 'to' language tag is ignored as expected so returns Albanian autonym

I've changed sq:Moduli:Smallem so that all codes in |lang= must be found in MediaWiki's English list of language codes.

—Trappist the monk (talk) 17:54, 21 March 2021 (UTC)Reply

Hmm, I thought I had made that clear already... Those kind of text strings together with other oddities that I mentioned above (those 3 cases I explained) are all regexes that appear in my bash (shell). I use Git Bash (used to use cmd.exe) to SSH to ToolForge where I have an account. I've written a script (part of a bigger script aiming to deal with the whole update of the source code of Smallem - the list of regexes basically) that gets the result from this page using cURL and saves them in a file. When I see these results, the ones I've sent here appear like I've sent them. With the <200c> etc. text strings.

Interesting results the ones you showed. I'm curious if other languages may manifest problems like these as well. I don't know who exactly "looks after" these codes. Are they all dealt with from MediaWiki developers not related directly to the languages they deal with? Or are there global volunteers somehow helping in this subject? I believe we've already talked about this in the past. Either way, I expect small wikis' languages to suffer more from inaccurancies like this. - Klein Muçi (talk) 18:53, 21 March 2021 (UTC)Reply

Still isn't crystal-clear – too many words talking about something else. Regardless, I have tweaked sq:Moduli:Smallem so that it skips language names that have U+200B zero width space, U+200C zero width non-joiner, or U+200D zero width joiner codepoints.

—Trappist the monk (talk) 13:29, 22 March 2021 (UTC)Reply

I'm really sorry. Let me rephrase: I've created a bash script that basically logs in Wikipedia's API through Smallem's account, gets the regex lines from that page I mentioned above and saves them in a Linux file in ToolForge servers. When this happens, some results on the saved file look normal, some look like what I've sent you above. That's it. If I manage to make the process work fine, Smallem will ultimately use this file to do its regex replacements it usually does, so that's basically an autoupdate for it. Is it more clear now? - Klein Muçi (talk) 13:39, 22 March 2021 (UTC)Reply

I finished comparing the lists (script/manual) of every code. There are no U+xxxX characters anymore anywhere and every list compared is totally the same. So basically the script method is as good as the manual one. But I'm yet to compare the total script list with the total manual list, which is where usually the errors happen, no idea why. I will do that tomorrow and report with the results. Meanwhile, some features related to the module:

I believe module Smallem should report an error if we're using |list=<number> and |plain=yes. That's because it literally is an error. You can't group the codes if you have plain mode activated and it is confusing if you forget that detail because nothing changes.
Is it possible somehow that the total number of regex lines gets rendered in the bottom of the language codes? (In both modes) That would be a gamechanger. I'm talking about the ultimate total.
Would it be better if when module Smallem skipped a certain language, along the "skipped x language, from x.wiki" message was also shown the reason why it was skipped? This way we could easily remember the characters that are problematic now and possibly try to find ways to handle them without plainly skipping in the future.
- This reminds me about the first languages that we skipped because my command line couldn't read it. I switched back from cmd.exe to Git Bash lately and unlike cmd, Git Bash gets updated more frequently by its maintainers and maybe it could handle it now. Adding on that, maybe it would be wise to simplify the "skipping mechanism", have it standing somewhere aside on the module code so I can make changes to it easily for testing purposes. Maybe adding new languages to be skipped could be harder but removing them should be easier enough to be able to use it as a switch. Maybe you just remove the name from a certain "list" or comment out a certain line. I'm thinking something similar to Module:CS1/Configuration. Maybe it already is like that. I'm just brainstorming without much prior knowledge to the actual code arrangement in Module:Smallem. When it was only 1 language skipped, things looked a bit different. But now we're having many languages skipped and the number could potentially grow when trying to fix the problems with the total number. Couple that with the fact that ideally Git Bash can change for better in a near future and the whole thing takes more importance now. - Klein Muçi (talk) 04:05, 23 March 2021 (UTC)Reply

Is |list=<number> and |plain=yes really an error? |list= has to have some sort of an assigned value else sq:Moduli:Smallem attempts to make a regex list from the values assigned to |lang=.

The only way to get the maximum-possible count of regexes that Moduli:Smallem thinks that it will produce is to run lang_lister() twice; once against the whole list of possible language codes (output suppressed – no error or skipped messages unless something is so wrong that it can't coninue) and the second time against the list of language codes in |lang=. I can imagine Moduli:Smallem overrunning its allocated time if we try to do that.

I have thought about that now that the list of things-to-skip is growing.

I'll think about that.

—Trappist the monk (talk) 14:17, 23 March 2021 (UTC)Reply

Hmm but maybe a warning whatsoever? Call me stupid if you want but I set |list=10 in the past and I spent a couple of minutes not understanding why nothing had changed. Then I noticed |plain=yes and I remembered... In general the |list parameter confuses me because, at least in my mind, it "doubles" in function considering the values it can take. Those can be either yes or no OR numbers. I understand I can put anything instead of yes/no but just simplifying here. And you basically have 2 concepts merged in one parameter. At least in my mind. Coupling that with the |plain=yes situation makes it more confusing. Sad about the lack of possibility to show the total number but... What to do? - Klein Muçi (talk) 15:53, 23 March 2021 (UTC)Reply

Hey man! I'm sorry that I kinda disappeared, especially when I was supposed to basically do the last step in testing, the reason why we've been doing all the aforementioned adjustments above. I made some changes to my internet at home and that required some days to take effect. I made Smallem make a full run of autoupdate. The total I get is 40635 lines. From that, remove 17 lines which are other regexes and command lines, and you get 40618 regex lines. I believe the number is still off the right total no? Even after making the module skip some "problematic" entries. I'm just asking now because I haven't checked the new total after we made the changes. - Klein Muçi (talk) 10:23, 30 March 2021 (UTC)Reply

sq:Moduli:Smallem says 42446.

—Trappist the monk (talk) 12:50, 30 March 2021 (UTC)Reply

I got 40864 results now with 1 code per request. The first was with 10 codes per request. I'm gonna try creating the list manually and see what's missing. Is there any way I can show you the missing results here? I mean, we're talking about 2k results and I don't wanna overflow your page. Any other way? Maybe you can get an idea on why those codes are being left behind when you see what's actually being left behind. Because it is really strange. We basically tried together every code and there is literally no discrepancy between the script list and the manual one. Only when I make the script run all the codes automatically I start having problems. Which sort of seems bizarre to me. - Klein Muçi (talk) 21:28, 30 March 2021 (UTC)Reply

Apparently we're talking about exactly 406 lines. Still a lot to copy-paste them here. I'll try to find a webpage to host them temporarily and I'll post the link here when I can do that. Meanwhile I saw the new messages that you get when languages were skipped with Module Smallem. I like them a lot. - Klein Muçi (talk) 22:54, 30 March 2021 (UTC)Reply

These are the lines that are missing. (At least in this case. Unfortunately I still get different results every time.) Notice any pattern whatsoever in them? - Klein Muçi (talk) 23:00, 30 March 2021 (UTC)Reply

Zulu (zu) last code in the list of codes. Perhaps the results of the last code aren't being saved?

—Trappist the monk (talk) 23:34, 30 March 2021 (UTC)Reply

So all those lines are all coming from 1 single code? Pretty interesting! But Zulu has 859 lines. You're saying that only a part of those isn't being saved? I'm redoing the full run now (1 code per request) and I'll see what discrepancies I get this time. I so hope they're again from the Zulu language. That would mean I would have finally found a pattern. - Klein Muçi (talk) 00:02, 31 March 2021 (UTC)Reply

I can't say that they all come from code zu. Most of the language names begin with 'isi-' so I spotchecked several of them; all that I checked were zu. I have looked at all of the language names at the top and bottom of the list and all of them are zu except 'Patois' (jam):

Patois; – {{#language:jam|zu}} → Jamaican Creole English

Don't know where Patois comes from.

It isn't surprising that Zulu by itself has 859 lines but the list you gave me has only 406. I suspect that there are a lot of codes that fall back to another language's name (probably English) so those would be duplicates that should be removed. That suggests that these 406 are not duplicated by any other language.

—Trappist the monk (talk) 01:02, 31 March 2021 (UTC)Reply

Second list of discrepancies. Much to my amazement, the change between iterations wasn't that big. There were the same missing results except: (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Patois(\s*[\|\}])", r"\1jam\2"), which in the second iteration wasn't missing. 405 results in total in the second run. I'll try it again because I believe the close numbers were just a coincidence. I'll post the results here in the same method when I have them. Looking forward to 5 or 10 tests. - Klein Muçi (talk) 00:56, 31 March 2021 (UTC)Reply

Hmm, missed that detail. But that's even better. I'm currently running a third test as I said but after that is completed, if the results again stay on the "Zulu realm", can we make module Smallem not generate the zu code temporarily? I can then do 2-3 test and see what happens. If we get no discrepancies in any of those tests, we would have successfully isolated the problem. My belief though is that the test that I'm running now won't stay on that realm. :/ - Klein Muçi (talk) 01:22, 31 March 2021 (UTC)Reply

Completed test 3. Got exactly the same results as test2. This is strange. May it be that we have fixed the problem of results changing in every test by fixing the problems with those 3 nonprinting characters? I still believe this is just pure luck. I'll run 2 more tests tomorrow and see what happens. If it really doesn't change then we can go on with the removal of zu from Module:Smallem. But this is really strange and inspiring at the same time. - Klein Muçi (talk) 02:40, 31 March 2021 (UTC)Reply

Completed test 4. Again exactly the same results. I'm genuinely surprised that we might have solved the inconsistency problem. After some hours I'll try the last test. - Klein Muçi (talk) 11:12, 31 March 2021 (UTC)Reply

Instead of dropping zu, perhaps it would be better to invert the language code sort so that zu is first and aa is last. If the result is the same then that confirms your something-wrong-with-zu-handling hypothesis. If the result shows that all of the lost regexes are from aa then that suggests that somewhere the last groups of regexes isn't being correctly handled.

Easy to invert the sort; let me know when you're read for that.

—Trappist the monk (talk) 11:22, 31 March 2021 (UTC)Reply

Test 5 complete. Again no difference whatsoever. Apparently we've solved the inconsistency problem. That was one of 2 of my main problems with the script so it does look unbelievable for me. You can go on with the code inversion now. - Klein Muçi (talk) 23:42, 31 March 2021 (UTC)Reply

Done.

—Trappist the monk (talk) 00:00, 1 April 2021 (UTC)Reply

Test 1 with inversion complete. Things are starting to go into the unknown again. I got 41269 lines in total. Remove 17 "other" lines and you get 41252 lines of regexes. I'll soon post here the exact missing lines. - Klein Muçi (talk) 01:20, 1 April 2021 (UTC)Reply

Soo... I'm feeling extra confused by now. I concatenated the manual full list with the script full list and asked to get only the unique lines, which is what I've been doing all this time. And... I get only 1 unique line:

(r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Qafár\ af(\s*[\|\}])", r"\1aa\2"),

But that doesn't make much sense considering the total number I have and the total number I should be getting. Does that mean anything to you or is it a glitch and I should redo the whole procedure somehow? - Klein Muçi (talk) 02:05, 1 April 2021 (UTC)Reply

Test 2 complete. Again 41269 lines. At least we can be secure that the inconsistency is gone for good. I'll try sorting and comparing again. May it be that the manual list I have is outdated? Could it become outdated just for 2 days? - Klein Muçi (talk) 11:01, 1 April 2021 (UTC)Reply

And again, only that 1 line above gets chosen as unique. I'm confused. - Klein Muçi (talk) 11:26, 1 April 2021 (UTC)Reply

I think that what it means is that aa has only one language name written in Afar, its autonym. All other language codes fall back to some other language (likely English):

{{#language:aa|aa}} → Qafár af
{{#language:de|aa}} → German
{{#language:en|aa}} → English
{{#language:fr|aa}} → French
{{#language:sq|aa}} → Albanian

So, this means that there is nothing wrong with zu and nothing wrong with aa. Your aa results tend to lend credence to the notion that the last group of regexes isn't being included in the final list of regexes (it's the automated list that is coming up short, right?)

—Trappist the monk (talk) 15:12, 1 April 2021 (UTC)Reply

Yes, that was my initial belief too, ever since you mentioned that all the problems arise with zu, which was the last code. The reason for that being in the way the script works. Speaking in general lines, it gets the list of codes from a wiki page and then transforms that concatenated one-line list into an one-code-per-line list (using the commas as delimiters). Then it starts a loop that gets 1 line from that list (so 1 code), creates an API request with it, saves the results (the list of regexes generated) in a file and removes that 1 line (so 1 code) from the overall code list. Then it gets 1 line from that list (which is the next code in line, because the line before that just got deleted), creates an API request with it... So on until it runs out of codes. Being that I'm not good on scripting, I may have designed badly the loop and the results from the last API request fail to be saved, whatever that is. So, you arising at the same conclusion is a good sign because that would mean that I just need to fix the loop and one of my 2 problems would be solved. The problem that confuses me though is that the automated list, which is the one that is coming short, is not coming short of just 1 entry compared to the manual list. If we do a simple subtraction, the results missing are around 1000 lines, no? Or am I being misled in the logic that I'm using? Either way, in order for us to be totally sure that this really is the problem that we're having, I have to ask, how easy is to twist module Smallem to create the environment for this experiment: Make it only produce 2 codes. One is en and the other one being a "popular" language that we know that it doesn't have any fallbacks. Maybe Italian (it)? Then we can see what happens. If all the regexes from it go missing, then we're sure that the problem is on the loop and I have to fix that. I don't know how easy/or hard is it for you to do that, though. If it is hard, maybe I can twist my script to work with only those 2 codes but I'm a bit reluctant to do further changes in it before being sure what the current problem is because of being afraid in introducing even more bugs. - Klein Muçi (talk) 15:40, 1 April 2021 (UTC)Reply

If I understand you, you want the code list returned from {{#invoke:Smallem|lang_lister|list=yes |plain=yes}} to be CodeListBegin:en, it:CodeListEnd. In that case, sq:Përdoruesi:Trappist the monk/Livadhi personal.

—Trappist the monk (talk) 16:21, 1 April 2021 (UTC)Reply

Yes, make the code list generate only 2 hardcoded codes. En and another one that doesn't have fallbacks. I'm currently on the move and can't properly check what you've brought but I'll write as soon as I have my laptop back. - Klein Muçi (talk) 16:28, 1 April 2021 (UTC)Reply

Redid the test. (You had done exactly what I had asked.) I get 876 results (with 17 added "other" lines). Manually I get 1464 (without the 17 added "other" lines). I'll see what lines are missing now. - Klein Muçi (talk) 20:00, 1 April 2021 (UTC)Reply

The missing results. 605 results. It has Regex count: 859 though. Do you think fallbacks are to be blamed for the discrepancy? - Klein Muçi (talk) 20:07, 1 April 2021 (UTC)Reply

When I compare the Italian total results with the 605 results above, these are the unique results: Here. Basically, these are the fallbacks and when you see the results, it makes sense to treat them as fallbacks. My hope was that it wouldn't have any fallbacks so the total number of regex lines from that code would also be the number of missing lines from the grand total (given that there would be only 2 codes and only en would be used). And that would secure the fact that the problem is coming from the loop not saving the results from the last code. But maybe the experiment above is enough as proof nonetheless? - Klein Muçi (talk) 20:16, 1 April 2021 (UTC)Reply

876 (the total list) - 17 = 859 though, the exact total of en so I guess... - Klein Muçi (talk) 20:18, 1 April 2021 (UTC)Reply

I was able to fix the loop and now I get 1481 lines all the time, which is 1464 + 17, precisely how many lines there should be. I'm excited! Can you revert your change about the hardcoding to module:Smallem so I can do a full run with all the codes now to see what happens? - Klein Muçi (talk) 22:53, 1 April 2021 (UTC)Reply

done; ascending order sort restored.

—Trappist the monk (talk) 23:00, 1 April 2021 (UTC)Reply

Much to my dismay, I get 40865 lines in total and not 42463. Judging by that number it does look like I haven't changed anything. But I did try it 3 times and it gave me the total number when only 2 codes were involved with the changes that I made to the script. :/ I'll try to see what specific lines are missing now that I fixed the problem that we had. - Klein Muçi (talk) 00:39, 2 April 2021 (UTC)Reply

Aaand, no. I get the zu regex lines missing again. :/ I'll retry switching the module to 2 codes/debug mode to be totally secure it works fine in that mode. :/ - Klein Muçi (talk) 01:03, 2 April 2021 (UTC)Reply

No, it works fine like that. Tried it some more times now. Any suggestions before I'm forced to try the brute force way by adding the codes one by one in debug mode to see when it starts malfunctioning? - Klein Muçi (talk) 01:13, 2 April 2021 (UTC)Reply

So, I went on with the "brute force" method I mentioned above. I first added 10 codes in the debug mode, I got no discrepancies. Then I went on and added 10 more. Here things start behaving not as you'd expect them to. The manual list gives 4484 results. I get 4483. (r"(\{\{\s*cit[aeio][^\}]*\|\s*language\s*=\s*)Qafár\ af(\s*[\|\}])", r"\1aa\2"), This one goes missing. Any idea what might be happening? Do you think it's still the same problem that we're seeing? (The results from the last code are being ignored?) These were the codes used: aa, ab, ace, ady, af, ak, als, alt, am, an, ang, ar, arc, ary, arz, as, ast, atj, av, avk Keep in mind that it worked perfectly fine with aa, ab, ace, ady, af, ak, als, alt, am, an. - Klein Muçi (talk) 10:20, 2 April 2021 (UTC)Reply

Meh, redid some more runs with the same codes. Sometimes I get that line missing, sometimes I don't. I'm disappointed that we got back the inconsistencies between runs. I know they are related to non-printing characters somehow confusing my script but I don't know how. Because the problems started happening as soon as the module started skipping results for the first time. Is there anything special with that single line? I believe whatever is happening in this case, is happening throughout the whole process and the missing lines get accumulated in the end, to give the 1k short total. There must be another character somehow (not necessarily in that line) that we also need to skip. I'm gonna take a close look at the whole list of results, see if I find anything strange. - Klein Muçi (talk) 10:40, 2 April 2021 (UTC)Reply

Nope, I couldn't find anything too strange. :/ At this point I'm out of suggestions. I don't know why I start getting inconsistencies with these specific codes and why I get the total to be 1k short in the end. I think I did fix the part at the script which was responsible for failing to save the last results but somehow... :/ Maybe I haven't? - Klein Muçi (talk) 11:00, 2 April 2021 (UTC)Reply

In that group of twenty language codes, there are 59 regexes that use non-ASCII characters in their names. In those 59, the only 'invisible' character used is U+200E left-to-right mark which sq:Moduli:Smallem removed. So the problem is not with invisible characters.

I can mimick the results that you got by removing aa from the group and from the |lang= parameter. The Qafár af regex is the only unique language name contributed by aa (the autonym) all other language codes fallback to English which are contributed by the other languages in the group. Perhaps your 'fix' to the tail end disrupted the head end?

I have tweaked Moduli:Smallem so that |lang= gets the same language codes that are listed in the debug lang_codes (line 104). What happens if you swap aa with ab?

—Trappist the monk (talk) 14:08, 2 April 2021 (UTC)Reply

Hmm, I'm not sure if I do understand quite well your last paragraph. If I've understood correctly what you have changed, I'm not sure if my script can work properly with module:Smallem anymore. And what do you mean with swapping exactly? :/ - Klein Muçi (talk) 20:41, 2 April 2021 (UTC)Reply

Apparently it kept working. I removed aa from Module:Smallem (that's what you meant, no?) and tried a full run. I got no problems. 4500 results in total: 4483+17 "other" lines. What does this mean? I'm following blindly now. - Klein Muçi (talk) 21:06, 2 April 2021 (UTC)Reply

Did 3 runs. I always got 4500 lines. No inconsistencies. - Klein Muçi (talk) 21:24, 2 April 2021 (UTC)Reply

I meant change line 103 to:

lang_codes = {'ab', 'aa', 'ace', 'ady', 'af', 'ak', 'als', 'alt', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk'};	-- debug

—Trappist the monk (talk) 21:29, 2 April 2021 (UTC)Reply

Okay then. I'll do that now. And what's the hypothesis we're testing with that? - Klein Muçi (talk) 21:35, 2 April 2021 (UTC)Reply

Perhaps your 'fix' to the tail end disrupted the head end?

—Trappist the monk (talk) 21:42, 2 April 2021 (UTC)Reply

I get 4501 lines. Consistently. So I've ruined the head? :P - Klein Muçi (talk) 22:00, 2 April 2021 (UTC)Reply

I don't know. But since I could mimick the problem by removing aa, perhaps you did...

—Trappist the monk (talk) 23:03, 2 April 2021 (UTC)Reply

I switched to

lang_codes = {'aa', 'ab', 'ace', 'ady', 'af', 'ak', 'als', 'alt', 'am', 'an', 'ang', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk'};	-- debug

. I still get 4501 results now. The correct number. Without doing any changes to the script. :/ Should we revert the debug mode now and try a full normal run? Even though, this doesn't make much sense. We didn't do any change basically, no? :P - Klein Muçi (talk) 23:22, 2 April 2021 (UTC)Reply

I'm confused. Are you suggesting that something, somehow, fixed itself?

—Trappist the monk (talk) 23:40, 2 April 2021 (UTC)Reply

Well... Basically, yes... I mean, my initial idea was to keep adding codes into the "old" debug mode gradually until I started having problems (which happened when I added the second group of 10 codes). Even then, the problems weren't consistent. They happened only half of the times. Then you changed module:Smallem to introduce the "new" debug mode which automatically takes care for the code and regex list. I tried a run with that with aa missing. Then tried it with aa in second place (after ab) and then tried it with aa in the beginning. All the tests brought the correct, expected results. Without doing any change to the script. Now I don't know what to do. :/ We either switch it to the normal form, or continue adding other codes in this mode and see what happens. The problem with the second option though is that you have changed module:Smallem to add the codes automatically to |lang=. When more than 10 codes are used simultaneously, the script tends to malfunction in getting the generated regex lines and starts saving empty results. I don't know how it will behave if we add 10 more codes and basically are forced to do 30 codes simultaneously. - Klein Muçi (talk) 23:53, 2 April 2021 (UTC)Reply

And, yes. Tried adding 10 codes more (changing the codes added to |lang= from 20 to 30) and the end results after the script completed its run were with only those "other" 17 results, empty from any regex lines, as expected. :/ If we want to keep experimenting and add codes gradually by 10, we need to turn back to the old debug mode where codes weren't added automatically to |lang=. I don't know what else to try now, really. Because everything seem to be working fine at the moment. :P In this mode with only those 20 codes that is. - Klein Muçi (talk) 00:32, 3 April 2021 (UTC)Reply

I made sq:Moduli:Smallem automatically render the code and regex lists from the same set of code so that you don't have to edit the module and then edit whatever page it is that Smallem gets the code and regex lists for each test you wanted to do. You can disable the debug mode by simply commenting-out lines lines 103 & 104. You can fix the |lang= value to some subset of lang_codes by writing a replacement for line 104:

args.lang = 'ab, ab, ace';

or whatever you want.

Perhaps some time away is in order. Do something else for a while and then come back to Smallem...

—Trappist the monk (talk) 00:39, 3 April 2021 (UTC)Reply

Yeah, I fully understand your intention and it does help a lot if you do it manually but when you put the script to work, it starts bringing out more problems than it should. And yes, soon I'll be forced to accept that suggestion. My last try is to switch back to the old debug mode and retry entering the codes gradually, see if I see any pattern or anything else of the sort "I can grab to", if you understand what I mean. If that fails as well, I'll give up. At least for now. To be honest, this part of the Smallem project has taken me way more time than I thought it would and I have been ready to abandon it a couple of times but only went forward after the help you gave me and the thought that I've already written a full script able to do 90% of the job correctly on it. Maybe I annoy you with some last messages as I run the last tests for these days (what I described above) but I'd wish to clarify not to take all my messages as requests. Sometimes, as you may have already seen, I'm really just brainstorming, and some other times, I just share the information here so if I need to ask for help in the future, we're both on the same page. Only answer whenever you can. I'm sorry for the continuous red +1 (or more) you might have because of me. Every once in a while I forget and I try using emails but you don't have that option. :P - Klein Muçi (talk) 01:06, 3 April 2021 (UTC)Reply

Hey there! I tried everything I could but I noticed that the results still weren't consistent all the time. As a last resort, I tried asking for help on Stack Overflow. Here, if you want to take a look. I don't think I'll really find help there (24 hours have passed and I still haven't got 1 single answer) but at least I tried everything I could anywhere. I wanted to ask a question out of personal curiosity, feel free not to answer it. What do you work/deal with in everyday life apart from Wikimedia? Programming? I really find it surprising you find the time and nerves to deal with my and other's from all around the world technical requests in details. :P I'm, of course, thankful for that but I was also curious. As I said though, feel free not to answer if you don't want. :) - Klein Muçi (talk) 09:36, 5 April 2021 (UTC)Reply

SS Jacona (1918) edit

Latest comment: 3 years ago3 comments2 people in discussion

I have noticed that you have some interest in this article I created. I am on a quest to make as many Good Articles as I can this year. If you think this is a possibility, would you be interested in copy editing the article and nominating it GAN. You would get credit for a Good Article as the nominator and I would get credit as the creator of the article when it gets promoted. I will be glad to solve most of the issues the reviewer brings up.--Doug Coldwell (talk) 14:34, 6 April 2021 (UTC)Reply

I don't have any interest in SS Jacona (1918) except to fix deficiencies when I see them. I don't think that I've looked at that article since I made the one edit that I did make 5-ish years ago. And, alas, I am wholly immune to enticements like GA credit.

You might want to look at the article's referencing ... Business week, Electric Journal, Hearst Magazines, and Marine Engineering are not authors so do not belong in |last=.

—Trappist the monk (talk) 15:16, 6 April 2021 (UTC)Reply

Thanks for hints on authors. I'll look into that.--Doug Coldwell (talk) 12:04, 7 April 2021 (UTC)Reply

developer-discouraged edit

Latest comment: 3 years ago1 comment1 person in discussion

Regarding: I don't know what to think about developer discouraged. Outside of the RFC, is there such a thing?
No, I literally just made it up. I'm not exactly wedded to the term, and I justed wanted people to understand what the exact status of these parameters were (to replace the old definition of deprecated but also describe something where support won't ever necessarily be removed).
I was exactly thinking it was going to be used for maintenance category names (figured it would be something like Category:CS1 maint: nonhyphenated parameter), but that's not really for me to weigh in on. –MJL ‐Talk‐^☖ 18:14, 7 April 2021 (UTC)Reply

A cupcake for you :D edit

Latest comment: 3 years ago1 comment1 person in discussion

How long have you been working, and this is the only way I know how to send a message haha, and thanks for being here, I saw you working on tons of articles before! Ilikememes128 (talk) 14:23, 13 April 2021 (UTC)Reply

Template:Cite letter edit

Latest comment: 3 years ago4 comments3 people in discussion

Hello, Trappist,

All of a sudden, this template is showing a red link category but the most recent edit was by you in May 2020 so I'm not sure what caused things to change. Typically when red link categories appear, they are due to a recent edit by a new editor but that's not the case here. I don't like to edit templates, well, except userboxes that are causing problems, so I was hoping you could look this over and see what the problem is. Thank you in advance. Liz ^{Read! Talk!} 21:13, 13 April 2021 (UTC)Reply

(talk page stalker) It looks like this edit to the documentation caused the change. If the edit was accurate, it appears that the category should be created. – Jonesey95 (talk) 21:23, 13 April 2021 (UTC)Reply

Created.

—Trappist the monk (talk) 21:44, 13 April 2021 (UTC)Reply

Thanks to you both, Jonesey95 and Trappist the monk. That was a swift response! Liz ^{Read! Talk!} 23:42, 13 April 2021 (UTC)Reply

Date validation again... edit

Latest comment: 3 years ago3 comments2 people in discussion

Hi, Trappist. I hate to bother you about this, but would you mind re-adding this code you added a while back (or a revised version of it) into the current version of the sandbox? I tried to add it back, but the local month names didn't work. Here are two testcases pages. – Srđan (talk) 21:45, 12 April 2021 (UTC)Reply

The supported date formats listed on your testcases pages are working.

—Trappist the monk (talk) 23:44, 12 April 2021 (UTC)Reply

Could you check why some date ranges don't work here? Stuff like "1–2 March 2021" works if it's in English, for that example, but doesn't if I use the local format. – Srđan (talk) 16:26, 16 April 2021 (UTC)Reply

Category:CS1 maint: discouraged parameter has been nominated for deletion edit

Latest comment: 3 years ago1 comment1 person in discussion

Category:CS1 maint: discouraged parameter has been nominated for deletion. A discussion is taking place to decide whether this proposal complies with the categorization guidelines. If you would like to participate in the discussion, you are invited to add your comments at the category's entry on the categories for discussion page. Thank you. Fram (talk) 17:14, 16 April 2021 (UTC)Reply

If you ever have time edit

Latest comment: 3 years ago6 comments2 people in discussion

@Trappist the monk:, hello from el.wiktionary. I am so grateful for your lessons for Lua's a-b-c and how to make data modules. It helped to make possible many modules for our small wiki, like this one! My understanding of Lua is limited to "if bla then xxx end" but unfortunately we have no Luaists anymore around, so, we have to procede with whatever we can.
If you ever have time, could you help with an extra question (it is not urgent at all, and things function ok as they are now). Some modules are becoming SO big! I do not know how to extract a large part and place it in an outside module, or at a subpage. They are not data, they have lots of "ifs". For example

This declension module (which makes tables like this one) has a large section for Articles (from line 215 to 418). Is it possible to put that at a separate module? I have done so at Module:el-articles, but I do not know who to phrase the necessary introduction or how to call its args.

Thank you, and excuse my bothering you with such questions! Sarri.greek (talk) 11:03, 26 April 2021 (UTC)Reply

Thank you, thank you for your help @Trappist the monk: and your edits at the above modules. I will study them and use them! Sarri.greek (talk) 14:47, 26 April 2021 (UTC)Reply

Don't be in such a hurry. Next step for you is to replace lines 215–418 with this:

require ('Module:el-articles').articles (args)								-- modifies args{}

I did that and previewed el:wikt:Πρότυπο:el-κλίση-'αγρός' with the modification and it looked to me like it worked but I don't read Greek so I don't really know.

I added the is_set() function so that tests like this:

if args['ακε'] ~= '' and args['ακε'] ~= nil then args['ακε'] = args['ακε'] else args['ακε'] = '' end

can be reduced to:

if not is_set (args['ακε']) then args['ακε'] = '' end

That can be further reduced but I'll save that for another day.

If the utility functions create_link(), stem_color(), etc are used by multiple modules, you might consider creating a separate utilities module that other modules in this family of modules might use.

—Trappist the monk (talk) 15:15, 26 April 2021 (UTC)Reply

Thank you again @Trappist the monk: for taking so much time to teach. And you understand greek articles very well, because you immediately spotted my mistake: I forgot one function (tin).

For tiny functions like create_link, it would be nice to see it in the module (because elsewhere, a language link is involved). For larger sections, like these articles (which could also be used in other modules: for nouns, for adjectives, ...) it would be helpufl to have them as standalone.

Of course I have tried to use m_art. But i called it in a wrong way. I am going there to test again. Also testing Module:grc-articles-test which is not operating at the moment, so any mistake would be of no consequence.

I do not understand the is_set thing, but never mind. PS That 'local' thing has been driving me crazy for many months. Finally, I decided never to use it again in my life.... yes yes I know it is needed... Thank you, mentor! Sarri.greek (talk) 15:49, 26 April 2021 (UTC)Reply

It works!!! hooray, hooray, m_art = require... was not needed. Thanks. This is going to make things easier for many modules!! You are great. Sarri.greek (talk) 16:32, 26 April 2021 (UTC)Reply

because elsewhere, a language link is involved What does that mean? There is no language link involved in require ('Module:el-articles').articles (args) and there would be no language link involved if you created el:wikt:Module:Utilities. Functions in Module:Utilities would be required into the module that needs them just as el:wikt:Module:tin is required into el:wikt:Module:el-nouns-decl and also into el:wikt:Module:el-articles. The advantage is that these utility function live in a single place so when changes are necessary, those changes occur in only one place not in many places.

is_set() is a function that returns a boolean (true or false). Here is the example I used above:

if not is_set (args['ακε']) then args['ακε'] = '' end

The purpose of the example code snippet is to ensure that args['ακε'] has a non-nil value because the value assigned to a_klenstr is a concatenation of args['ακε'] and \n. In Lua, you can't concatenate a nil type to a string type. When |ακε= is included without an assigned value in the template or module call, args['ακε'] gets an empty string value. When |ακε= is omitted from the template or module call, args does not have a key ['ακε'] so args['ακε'] returns nil.

Lua calls is_set (args['ακε']) to determine if args['ακε'] has an assigned value that is anything but blank. Blank means empty string or nil. If args['ακε'] is empty string or nil, is_set() returns false indicating that args['ακε'] is not set. In the snippet, not inverts the value returned from is_set() so a false return becomes true for the if ... then test indicating that args['ακε'] is not set so args['ακε'] = '' ensures that args['ακε'] is not nil for the concatenation that makes a_klenstr.

I didn't read the Greek to discover that tinti was missing. I hacked a crude regex to search Module:el-article for text that looks like a function call. I think that the regex was [a-z] *\(.

—Trappist the monk (talk) 17:04, 26 April 2021 (UTC)Reply

Citation template for medRxiv preprints edit

Latest comment: 2 years ago2 comments2 people in discussion

What is the process for making new citation templates? medRxiv preprints are usually put in the cite journal template, which triggers a citation error, or cite web, which seems clumsy. There is Template:Cite bioRxiv, so it may be beneficial for there to also be Template:Cite medRxiv. Velayinosu (talk) 01:51, 3 May 2021 (UTC)Reply

The best place to raise this issue is at Help talk:Citation Style 1.

—Trappist the monk (talk) 13:15, 3 May 2021 (UTC)Reply

Category:CS1 errors: unrecognized parameter edit

Latest comment: 2 years ago6 comments3 people in discussion

Hello Trappist the monk, the category is empty. Yipee !!! Lotje (talk) 12:24, 16 May 2021 (UTC)Reply

Well, yes, the category is empty now, but I think it unlikely that the category will stay empty. Not everyone checks their edits before publishing so, sad to say, there will always be a need for this category and others like it.

—Trappist the monk (talk) 12:32, 16 May 2021 (UTC)Reply

To encourage contributors, do you think adding Last fully cleared: May 16, 2021 to the category would make sense?

Lotje (talk) 12:37, 16 May 2021 (UTC)Reply

Like in Category:Articles with missing files Lotje (talk) 12:38, 16 May 2021 (UTC)Reply

I'm skeptical. If the clearance-date could be automated then, perhaps. But, these categories, especially those that are empty or nearly empty all the time, tend to stay empty or nearly empty because (I think) somehow it is easier and more rewarding for editors to fix templates in categories that are nearly empty than to fix templates in categories with thousands or even hundreds of articles. I don't think that it is possible to automate the clearance-date because anytime an empty category is refreshed, the clearance-date will update to the current date. To be meaningful in any way, the clearance-date must be manually set. I think that editors will ignore or forget to update the clearance-date.

—Trappist the monk (talk) 13:37, 16 May 2021 (UTC)Reply

Maybe no one asked for my 2 cents or this is not the correct place to say it but I've been saying that categories may need some small technical changes for quite some time now. First of all, it would be nice to be notified somehow when a category starts getting filled. This would help on basic clean up and maintenance categories without needing robots to "third-party" the notification part. Also the clearance-date feature mentioned above would be pretty nice to have because of the reasons mentioned above. In general, IMO, it would be good if the categorized entries would be treated as part of the category page somehow and not as something "transcendental" to it. In general, people are interested on categories because of entries primarily and would like to have functions that deal with the said entries and their categorization process, not to be able to change the category description. - Klein Muçi (talk) 00:48, 19 May 2021 (UTC)Reply

Danish Wikpedia module: da:Modul:Citation/CS1 edit

Latest comment: 2 years ago32 comments3 people in discussion

Hi! I noticed that you have made many edits to Module:Citation/CS1/Configuration so I hope you have a little time to help me/Danish Wikipedia.

Some time ago the English module was copied to Danish Wikipedia and modified whenever there was a problem. When InternetArchiveBot was introduced a local user made some edits and we started up the bot. Sadly the setup was not 100% correct and the user that used to fix it have not been active a few month. Last comment was something about bad health so we do not know if the user can ever return. So I started to look at it.

I think it would perhaps be best to copy the modules from enwiki to dawiki and make the conversions from scratch. So I have copied the modules to da:Modul:Citation/CS1/sandkasse etc. (sandkasse = sandbox).

So my question for you is where I should make edits to make it localized?

I know I have to uncomment 3 lines like here: da:Special:Diff/10754534 and to make a lot of edits in da:Modul:Citation/CS1/Configuration/sandkasse (so far I only made a few adjustments related to dates like da:Special:Diff/10754497).

I know da:Modul:Citation/CS1/Date validation/sandkasse should be modified if we want to allow 'yMd'. I have not done this yet because I got errors right from the start: da:Speciel:PermanentLink/10754556. And the problem does not seem to be related to ymd.

Should I set language to 'da' somewhere? As far as I can tell the module should be able to figure out that it is da.wiki.

Do I need to change any of the other sub modules to make it work?

My plan is do document the changes needed on da.wiki so that more than one user knows how to set it up locally. --MGA73 (talk) 18:21, 30 April 2021 (UTC) --MGA73 (talk) 18:21, 30 April 2021 (UTC)Reply

Because you are using 'sandkasse' instead of 'sandbox', you need to change da:Modul:Citation/CS1/sandkasse to use that module name when referring to the sandbox version of the module suite. da:Modul:Citation/CS1/sandkasse/styles.css doesn't exist so you should create it and make sure that Sidens indhold er is set to Sanitized CSS.

Most localization is done in ~/Configuration. If you find something that isn't localized there but could/should be, let me know so that I can fix it in future releases. Date validation localization is done in ~/Date validation especially when a wiki uses date formats that are not supported at en.wiki.

You should not need to tell the module suite that the local language is Danish; that is done in ~/Configuration.

I applaud your intent to document this. When you are done, can I have a copy so that I might use it as a basis for a generic document for other wikis in the same situation?

If you need further assistance, ask.

—Trappist the monk (talk) 19:08, 30 April 2021 (UTC)Reply

Thank you very much!

Doh! I knew it could find out itself if it was live or in sandbox but I forgot that the Danish word is sandkasse.

I created da:Modul:Citation/CS1/sandkasse/styles.css but I'm not sure what you mean with "Sanitized CSS"? You are welcome to fix it if it is not correct now.

Sure I would be happy to share with you. It will make it so much easier for all other wikis if there is a guide on en.wiki. About the 3 lines that should be uncommented it is perhaps not super clear if it is actually 4 lines or it is 3 lines after the next line. Another thing that would make it easy if there is a specific term that is used every time something can or should be localized. For example "On xx.wiki: <whatever>". That way we can search for xx.wiki and easily find places where we can change stuff.

There are a few things that are different on dawiki. One is date_hyphen_to_dash where it is reversed in Denmark :-) But I will try to make a list once we get it fixed so I do not have to spam your talk page. --MGA73 (talk) 19:43, 30 April 2021 (UTC)Reply

Listed at :da:Oplysninger om siden. It used to be that when creating a css page for TemplateStyles, a sysop had to change page content model setting. I guess that they have fixed that.

—Trappist the monk (talk) 22:16, 30 April 2021 (UTC)Reply

I have a question. In Denmark we usually write months with small letters and I can make the module accept it by duplicating months like da:Special:Diff/10754679 or I could use [Dd] but is there a better way to indicate that the preferred way on dawiki is with small letters? And what is the best way to make it accept dates with a dot like in 11. december 2020? --MGA73 (talk) 20:51, 30 April 2021 (UTC)Reply

I just found out that we should always use small letters so I simply changed like da:Special:Diff/10754717 and we also use a period/dot when we shorten the months. If there is a better way please let me know, --MGA73 (talk) 21:45, 30 April 2021 (UTC)Reply

(edit conflict)

If the correct way to write month names in a date is to use lowercase, write the name in lowercase. The date_names['local']['long'] and date_names['local']['short'] are inverted for use by reformatter(). It is not possible to invert ['December'] = 12, ['december'] = 12 to [12] = 'December', [12] = 'december' and have access to both forms of the month name. If both 'December' and 'december' are needed, some sort of special code will have to be written to support that. If all that you need is the lowercase form, that is all that should be included in the date_names['local']['long'] and date_names['local']['short'] tables.

Create a pattern in patterns{}; perhaps something like this:

																				-- day-initial: day. month year
	['d.My'] = {'^([1-9]%d?)%. +(%D-) +((%d%d%d%d?)%a?)$', 'd', 'm', 'a', 'y'},

and create a test in check_date(); perhaps something like this:

	elseif mw.ustring.match(date_string, patterns['d.My'][1]) then				-- day-initial: day. month year
		day, month, anchor_year, year = mw.ustring.match(date_string, patterns['d.My'][1]);
		month = get_month_number (month);
		if 0 == month then return false; end									-- return false if month text isn't one of the twelve months

—Trappist the monk (talk) 22:16, 30 April 2021 (UTC)Reply

Hi! Thank you very much. I'm super happy that you help me. Regarding the months I think they should always be lowercase so I simply changed the local values to lowercase. Regarding the inverted way that was just a mistake.

I created da:Special:Diff/10754667 because the danish month May is Maj (also 3 letters) but now I think perhaps I do not need that because of the translation above?

Regarding the pattern ['d.My'] then I think it should perhaps be ['d.my'] because we use lowercase months. That makes me wonder if I should change all the codes with the "M" variants to lowercase or if the code can handle capital letters?

Also is using a dot like I did in ['dec.'] = 12 I guess that is a bad idea? It would be better to create a pattern for that like when we use a dot after day? So we should have a ['d.m.y']? --MGA73 (talk) 08:27, 1 May 2021 (UTC)Reply

da:Special:Diff/10754667 is correct. I suspect that most European names for 'May' are similarly short but I don't know nor do I know about non-European names. Perhaps I'll do some research today.

In ['d.My'] the uppercase 'M' is intended to mean 'month-as-name' while lowercase 'm' is intended to mean 'month-as-digit' so I think that you should not change these.

['dec.'] = 12 is correct. Those patterns that look for month names, are looking for anything that is not a digit (['My'] is an oddity). The patterns don't care if the month name has punctuation. The whole month-name capture is used to index into date_names['local']['short'].

—Trappist the monk (talk) 13:28, 1 May 2021 (UTC)Reply

Because of your edit at da:Special:Diff/10754667, I have rewritten is_valid_month_range_style() in the en.wiki sandbox; see Module:Citation/CS1/Date validation/sandbox#L-246. The original code there was quite old (December 2014) and, I think, precedes any real notion of i18n. You might want to replace your is_valid_month_range_style() with the en.wiki/sandbox version. You can be our testbed.

—Trappist the monk (talk) 16:17, 1 May 2021 (UTC)Reply

Hello again! Great then things are going the right way :-) I changed my sandbox with a copy from your sandbok (see da:Special:Diff/10755535). I tested it in my own sandbox like da:Speciel:PermanentLink/10755537. It looks like it can handle "maj" like it should. But not "Maj" and that is okay because Danish months use lowercase. I guess the reason it accepts "December" is because it also accepts English months. So it works? --MGA73 (talk) 17:26, 1 May 2021 (UTC)Reply

Those tests don't test is_valid_month_range_style() because none of the dates are ranges. is_valid_month_range_style() is used for these (copy them as you see them here and paste them into your sandbox):

*{{cite web/sandkasse |title=Title |url=//example.com |date=oktober–december 2021}} – no error
*{{cite web/sandkasse |title=Title |url=//example.com |date=okt.–december 2021}} – error
*{{cite web/sandkasse |title=Title |url=//example.com |date=okt.–dec. 2021}} – no error
*{{cite web/sandkasse |title=Title |url=//example.com |date=oktober–dec. 2021}} – error

As currently constructed, it is not possible to distinguish between 11. december 2021 and 11. December 2021 but the auto-translation of the month names at least causes the module to render 'december' from December. In your testcases, Maj cannot be translated because that name is not an English month-name. We might tweak the auto-translation code so that it adds a maintenance category whenever a translation is made. One of your gnomes then can monitor that category and fix the dates in the wikisource.

—Trappist the monk (talk) 18:09, 1 May 2021 (UTC)Reply

Oops. My mistake. I did da:Special:Diff/10755587 To see how it works. --MGA73 (talk) 18:17, 1 May 2021 (UTC)Reply

Yeah a category would be great. We have a dude that likes to fix everything with a bot. So he will probably make a script to change English months into Danish months. Same for dates without the dot after the day.

I made this edit da:Special:Diff/10754802 because the old code on dawiki put pages with limited access in special categories. I will hurry up and revert that change and next time I will not edit at that time in the night ;-) --MGA73 (talk) 18:36, 1 May 2021 (UTC)Reply

I have tweaked the en.wiki sandboxen so that the date auto-translator is enabled/disabled by setting a boolean in ~/Configuration (lines 2034 & 2035). I have also added support for a maintenance category when dates are auto-translated (Module:Citation/CS1/sandbox lines 2996–3001). Not tested.

—Trappist the monk (talk) 23:30, 1 May 2021 (UTC)Reply

I updated da.wiki da:Special:Diff/10755980. As I understand it there is no longer a need to uncomment and/or change settings true/false because the code can now self detect. If that is incorrect then the help text should mention it.

I also updated da.wiki da:Special:Diff/10755985. I added a few other lines from the sandbox too as I guess they are about to be implemented too?

About the ['d.My'] we talked about earlier. They use dot too in no:, nn:, fi: and fo: (perhaps other wikis too). So perhaps you could add the support in the code so that the wikis that need it can just uncomment it? --MGA73 (talk) 08:53, 2 May 2021 (UTC)Reply

da:Special:Diff/10755980 does not need to be uncommented because it uses cfg.date_name_auto_xlate_enable and cfg.date_digit_auto_xlate_enable which you have set in ~/Configuration. I do not see this as self detect because a human has to enable/disable the auto-translation – the module knowing that it is on da.wiki is self detect.

I'm sure that there is a way to do what you suggest that does not involve large amounts of dead code in the module and perhaps someday I'll stumble upon that solution.

—Trappist the monk (talk) 13:11, 2 May 2021 (UTC)Reply

Sorry my comment was unclear. It is perfectly fine that we have to set true/false in ~/Configuration. So no need to write a lot of code just to avoid a simple change of true/false. I was just trying to make sure that when you removed the "uncomment the next three lines" it is because we (local wikis) do not have to change anything in the main module after this new code.

Looking at da:Speciel:PermanentLink/10756155 it seems that the basics now work. I get an error related to "maint_date_auto_translated". I did da:Special:Diff/10756079 inspired by your edits in enwiki sandbox but it did not fix it. --MGA73 (talk) 14:46, 2 May 2021 (UTC)Reply

Fixed it.

—Trappist the monk (talk) 14:53, 2 May 2021 (UTC)Reply

Oooops thank you!

Okay now I think we have only one problems left and it is related to "hyphen to dash". It seems dawiki uses the opposite of enwiki and to make it even more fun there are special rules related to when to use spaces around it. In the old dawiki code it was fixed with da:Special:Diff/9956579 and da:Special:Diff/10181460 in ~/Date validation and da:Special:Diff/10181461 in main module. (There were also some changes that fixed the problem that "maj" has 3 charachters and a shortend month "jan." has 4 charachters just like "juni" but your fix eliminated both problems.)

I have no idea if dawiki is the only wiki in the world that are opposite enwiki but I would like your opinion on what the best way to fix our problem is? --MGA73 (talk) 16:03, 2 May 2021 (UTC)Reply

I don't know either; haven't heard of it but then I didn't know that da.wiki is different from en.wiki in this regard so ... If your existing date_hyphen_to_dash() does what you want it to do, then use it.

—Trappist the monk (talk) 16:36, 2 May 2021 (UTC)Reply

Thank you! Will do that. --MGA73 (talk) 17:26, 2 May 2021 (UTC)Reply

Hello again! The old dawiki "hyphen to dash" code did not seem to work good with the new code from enwiki. Is the code in ~Date validation supposed to fix all hyphen/dash related to dates and the code in main module is supposed to take care of all other hyphen/dash cases? --MGA73 (talk) 17:59, 9 May 2021 (UTC)Reply

You haven't given an example so I don't know what did not not seem to work good with the new code means to you. There are two hyphen/dash converters one in ~/Date validation for dates and the other in the main module for |pages=, |issue=, etc.

—Trappist the monk (talk) 12:30, 10 May 2021 (UTC)Reply

Thank you! No I wanted to see if I could figure it out myself firs. I just wanted to be sure that the 2 converters does not conflict with each other (unless I make a mess lol). --MGA73 (talk) 14:54, 10 May 2021 (UTC)Reply

Hello again! We tried again and da:Modul:Citation/CS1/Date_validation/sandkasse#L-1059 and forward seems to show us what we want now :-D

However it also changes some wrong dates to correct dates without showing us errors. Examples can be seen in da:Skabelon:Citation/testcases#Interval_med_datoer/dansk "(Forventet: OK)" means "Expected OK" and "(Forventet: fejl)" means "Expected error".

We use "-" without spaces around here:

1963-1968
maj-juni 2019
3.-5. juni 2019

We use " – " with spaces around here:

3. maj – 5. juni 2019
3. maj 2018 – 5. juni 2019
maj 2018 – juni 2019

Perhaps you could have a look at the code and tell us if we could do something much smarter? For example I can't help think that the original code define different patterns for dates so we should perhaps be able to figure out how to use that. --MGA73 (talk) 13:19, 15 May 2021 (UTC)Reply

The examples at da:Skabelon:Citation/testcases#Interval_med_datoer/dansk; shouldn't those be using {{Citation/sandkasse |...}}?

—Trappist the monk (talk) 13:47, 15 May 2021 (UTC)Reply

Yes thank you! Fixed at da:Special:Diff/10768117. Also some tests were in my opinion wrong so I changed the expected result at da:Special:Diff/10768119. That should help us. --MGA73 (talk) 14:59, 15 May 2021 (UTC)Reply

Hello again! I made da:Special:Diff/10769630 because I suddenly realized that we have other variants than ['d.My'] that include a dot. We also have ['d.-d.My'], ['d.M-d.My'] and ['d.My-d.My']. It elimilated a lot of the problems we had :-) --MGA73 (talk) 17:14, 17 May 2021 (UTC)Reply

I read the conversation above and I wanted to say that I'm intrigued on the idea of having a kind of manual of CS1 localization. If you pair this idea with a table to keep tracks of specific differences that specific Wikis might need (for example, SqWiki has +2 more added categories for tracking compared with the EnWiki version) + their statuses (up to date or obsolete) you'd have the first steps of what I've talked in the past about CS1 globalization, which could lead into a better standardization and shorter update times on the future. @MGA73: Needless to say, I do encourage you to go on with your idea. :) - Klein Muçi (talk) 00:22, 19 May 2021 (UTC)Reply

@Klein Muçi: Yes a table would be a good idea. Perhaps there could also be an intro in the top of each (sub) module telling what the module does and what local edits that could be relevant. It does not have to be a long text. Perhaps something like "This is the main module. It calls the relevant submodules. Local wikis may want to do local changes 1) Change sandbox to local name for sandbox, 2) Change hyphen/dash, 3) ?" And in places where local changes may be relevant it could be helpful if there is a specific text that makes it easy to search for. Example "For those wikis..." (In ~Configuration there is of course no reason to mention it like 300 times. There it would be enough to mention it in the intro or perhaps in the top of each section if it is relevant to translate something.) --MGA73 (talk) 19:23, 21 May 2021 (UTC)Reply

@MGA73: Yes, adding internationalization information locally could be a good idea. But I still believe that maybe it would be better that this whole infrastructure was set up at Meta, not here. Paired with a global tracking table and multi-language tech support from different users. CS1 is the de facto module every wiki uses for handling citations. It would better to treat it like that as de jure too. Not to mention the facilities Meta provides in handling multi-language "manual books", if we decide to take that road. - Klein Muçi (talk) 08:57, 22 May 2021 (UTC)Reply

Seeking for help editing an article edit

Latest comment: 2 years ago2 comments2 people in discussion

Hello Trappist the monk I am a student who is new to Wikipedia. One of my courses consists in editing and updating the Kirindy Forest Wikipedia page, I saw that you edited the Kirindy Mitea National Park page and was wondering if you would be able to provide me with some feedback and help me improve it.

Any assistance would be greatly appreciated :) — Preceding unsigned comment added by Marie Salichon (talk • contribs) 01:47, 22 May 2021 (UTC)Reply

Marie Salichon: My robot has edited Kirindy Mitea National Park but I have not. Any questions that you have about editing en.wiki articles are probably best asked at Wikipedia:Help desk, a venue specifically set up to help editors of all skill levels.

—Trappist the monk (talk) 11:38, 22 May 2021 (UTC)Reply

stop! edit

Latest comment: 2 years ago1 comment1 person in discussion

141.91.210.58 (talk) 06:45, 4 June 2021 (UTC)monkbot is bad!Reply

Question edit

Latest comment: 2 years ago4 comments3 people in discussion

Hello Ttm, I was curious to know if you think it would be possible to add the ability to watch a specific talk page thread, as opposed to the entire page itself. Perhaps as a feature, whereby adding threads to your watchlist? Or follow along by some other means, where maybe there's some kind of notification each time a post is added to the thread?

I'm asking you first, to see if pursuing this is even worthwhile. Is it impossible? Would you know if this has come up before? Otherwise, I suppose I would start at Phabricator and create a new feature request, then go from there? Any feedback would be appreciated. Cheers - wolf 02:39, 2 June 2021 (UTC)Reply

(talk page stalker) Here are some possibly useful Phab tickets with links to various fun things: T2738, T263820. – Jonesey95 (talk) 03:42, 2 June 2021 (UTC)Reply

Yep, that ↑. There may be other phab tickets. I seem to recall that this is one of the features that is commonly requested.

—Trappist the monk (talk) 13:00, 2 June 2021 (UTC)Reply

Yes, either of those tickets would do the job (though one addresses sections of talk pages and the other of articles, I suppose it's 6 one way, ½ dozen t'other). The first one is from 2004... so that's doesn't inspire much confidence. The second one is from last year, so maybe there's more hope for that...? Anyway, thank you both, for the replies and links. - wolf 01:48, 5 June 2021 (UTC)Reply

nanvag rajputs edit

Latest comment: 2 years ago2 comments2 people in discussion

You have written on nanvag rajputs from were you get it Brother Kunalpratapsingh35 (talk) 16:07, 6 June 2021 (UTC)Reply

I think that you are mistaken.

—Trappist the monk (talk) 16:10, 6 June 2021 (UTC)Reply

I did preview edit

Latest comment: 2 years ago2 comments2 people in discussion

I did preview. It's my opinion that bare urls do not reflect well. In some cases, the refill tool does a nice job of improving the ref. In some cases it produces a broken citation often because the underlying URL has a problem. It's my opinion that highlighting the problem, so the editors with SME can see them and fix them, is better than leaving them as bare URLs. It appears you think leaving bare URLs even when they are flawed is a better option. We disagree.--S Philbrick (Talk) 16:24, 27 June 2021 (UTC)Reply

The tool converted this:

[http://www.damligan.se/player.html?id=1517462  ''Player Bio'' damligan.se] {{webarchive|url=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462 |date=2010-08-13 }}

Player Bio damligan.se Archived 2010-08-13 at the Wayback Machine

to this:

{{Cite web|url=http://www.damligan.se/player.html?id=1517462|archiveurl=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462|deadurl=y|title=''Player Bio'' damligan.se|archivedate=August 13, 2010}}

"Player Bio damligan.se". Archived from the original on August 13, 2010. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

and this:

[https://archive.is/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1  ''Umeå player roster'' eurobasket.com]

Umeå player roster eurobasket.com

to this:

{{Cite web|url=https://www.eurobasket.com/team.asp|archiveurl=http://archive.today/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1|deadurl=y|title=Udominate Basket Umea basketball - team details, stats, news, roster - EUROBASKET}}

"Udominate Basket Umea basketball - team details, stats, news, roster - EUROBASKET". {{cite web}}: |archive-url= requires |archive-date= (help); Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

Neither of those conversions are from bare urls (I understand bare url to mean something like: https://www.example.com). In both, the base url is dead but that isn't a problem because in both cases there is an archive snapshot. Neither of those conversions require a subject-matter expert to fix. The error messages tell you what is wrong and provide links to help text if you don't understand the messages.

It is your responsibility as the tool operator to ensure that the tool does the right thing; if it doesn't, fix or undo what it did and complain to the tool maintainers so that the tool can be improved – the more operators who do that, the better the tool will become. I don't use the tool but I have grown weary of cleaning up after those who do, especially when, as in this case, the repair is so simple:

{{Cite web |url=http://www.damligan.se/player.html?id=1517462 |archiveurl=https://web.archive.org/web/20100813191257/http://www.damligan.se/player.html?id=1517462 |title=Pamela Rosanio |website=Damligan |language=sv |archive-date=August 13, 2010}}

"Pamela Rosanio". Damligan (in Swedish). Archived from the original on August 13, 2010.

{{Cite web |url=https://www.eurobasket.com/team.asp |archiveurl=http://archive.today/20121117010216/https://www.eurobasket.com/team.asp?Cntry=Sweden&Team=8211&Page=1 |archive-date=2012-11-17 |title=Udominate Basket Umea basketball team |website=Eurobasket}}

"Udominate Basket Umea basketball team". Eurobasket. Archived from the original on 2012-11-17.

—Trappist the monk (talk) 17:05, 27 June 2021 (UTC)Reply

:de: edit

Latest comment: 2 years ago5 comments3 people in discussion

I try - after a conversation with Graham87 - to avoid direct links to foreign Wikipedias. How can we do that in cite templates, such as Karl-Günther von Hase? --Gerda Arendt (talk) 20:23, 26 June 2021 (UTC)Reply

You have me at a disadvantage. I am not privy to your conversation with Graham87.

At the top of {{ill}} is a notice box that has this image:

. That notice box tells editors that {{ill}} is not to be used within cs1|2 templates. There are a couple of reasons for that. When one template (in this case {{cite book}} contains another template ({{ill}}), the inner template ({{ill}}) is rendered before the outer template ({{cite book}}). When the page is processed by MediaWiki, the {{cite book}} parameter value |editor2={{ill|Rolf Breitenstein|de}} is processed first so that what {{cite book}} gets is:

|editor2=[[Rolf Breitenstein]]<span class="noprint" style="font-size:85%; font-style: normal; ">&nbsp;&#91;[[:de:Rolf Breitenstein|de]]&#93;</span>

Everything in and including the <span>...</span> html tags is not the editor's name so does not belong in |editor2=.

The other problem is an issue of accessibility. {{ill}} sets the  [[[:de:Rolf Breitenstein|de]]] to a font size of 85% of the current text size. That is ok when {{ill}} is used in the body of an article because 85% of the normal '100%' article body font size yields the smaller text at 85% of the normal font size. But, MOS:FONTSIZE instructs us to not set smaller font sizes inside infoboxen, navboxen, and reference sections. {{refbegin}} (used in Karl-Günther von Hase) sets the references to 90% of normal. {{ill}} then sets  [[[:de:Rolf Breitenstein|de]]] to 85% of 90% or 100 × 0.90 × 0.85 = 76.5% which is smaller than the allowed 85% font size. Here is a simple illustration of how that works (for clarity, you may need to addjust your browser's zoom setting):

[<span style="font-size:90%;>[<span style="font-size:85%;>[]</span>]</span>]

[[[]]]

The outer brackets are at 100%, the middle at 90%, and the inner at 85% of 90% or 76.5%.

Because of these issues, I changed |editor2={{ill|Rolf Breitenstein|de}} to |editor2=Rolf Breitenstein |editor-link2=:de:Rolf Breitenstein.

—Trappist the monk (talk) 22:33, 26 June 2021 (UTC)Reply

@Gerda Arendt: Yeah, it's easiest not to have the interwiki templates in citation templates for those reasons ... even for me. Gerda and I had a conversation about this way back. Graham87 05:39, 27 June 2021 (UTC)Reply

This conversation having apparently died, I am reverting Editor Gerda Arendt's partial revert of my edits.

—Trappist the monk (talk) 13:17, 29 June 2021 (UTC)Reply

Sorry, I had other things on my mind. The best thing would be to create stubs for the missing article. - My name is Gerda, and I don't need title Editor specified, although I agree with Hammersoft that it is the highest title Wikipedia has to offer ;) - I hope I'll remember the explanation above in the next case, - thank you for your patience. --Gerda Arendt (talk) 13:43, 29 June 2021 (UTC)Reply

Add topic