User talk:JL-Bot/Journal Testing

Latest comment: 12 years ago by JLaTondre in topic Run 3

Run 1

edit

The current contents (22:34, 23 June 2011) consist of a subset of trial output. I included the first 100 rows as an example for review and then added some additional rows for which I had particular comments/questions:

  • 414: This is a single citation with two targets. This is a case where the displayed text is a redirect to one target, but it has a piped link to a different target. The second piped link (BBC History (magazine)) is actually a redirect to the first. On the principle of keeping it simple, I'd rather not have to parse the targets and figure out if they are redirects or not.
  • 592: The journal field is [[CA: a cancer journal for clinicians]] which is an invalid link. It gets parsed as a interwiki link, but fails to display anything (I'm guessing it's not a valid interwiki suffix). Can I just list without linking anything that looks like an interwiki link (starts with two letters and is followed by a colon)?
  • 1464:This journal has an invalid character (a character not supported in Wikipedia page titles). I plan on listing an invalid titles as non-links.
  • 1734 & 1889: These are similar to 414 in that the displayed text is a redirect to one target, but it has a piped link to a different target. However in these cases, the redirect goes to a different target than the piped link.
  • 1905: This is also similar to the 414, 1734, & 1889 cases. However in this case, it's not a redirect and the same journal name shows up with two different piped links.
  • 2025: This is a result of a parsing error that I need to fix. I listed it just to remind myself.
  • 2849:This is another journal that has invalid characters (brackets are not supported in Wikipidia page titles). I could assume anything in brakets is not part of the title, but not sure if that is the right approach.

Let me know if your thoughts on these cases. Also if you see any issues with the other ones. -- JLaTondre (talk) 22:53, 23 June 2011 (UTC)Reply


  • 414: BBC History is the target, so that's the one that should be mentioned as a target.
  • 592: That's.... weird? I guess in those cases you should probably replace the colon with a spaced endash (CA – A journal for clinicians), but pipe the correct version. Aka [[CA – A journal for clinicians|CA: A journal for clinicians]].
  • 1464 Cool.
  • 1734 & 1889: Screw the Foobar part of [[Foobar|Barfoor]]. We're interested in Barfoo in this case, so linking to Barfoo would be the desired behaviour, regardless of what the citation did.
  • 1905: If you find a "Foobar" that has a corresponding, non-redirected, Foobar (journal/magazine), just behave as if the citation was to Foobar (journal/magazine). AKA link to [[Foobar (journal/magazine)|Foobar]]
  • 2849: Just strip the brackets. Foobar [barfoo] → Foobar barfoo

Headbomb {talk / contribs / physics / books} 10:40, 24 June 2011 (UTC)Reply

Also, I've been thinking... The current setup handles existing pages (bold), redirects (italics), non-existing pages (redlinks) but disambiguation pages are left unmarked. I suggest bold underlined for direct links to disambiguation pages, and italics underlined for redirects to disambiguation pages.
So in the table, you would have thinks like AAP Headbomb {talk / contribs / physics / books} 18:55, 24 June 2011 (UTC)Reply
Okay, I'll look into adding that. -- JLaTondre (talk) 20:15, 24 June 2011 (UTC)Reply

Your above response has me a bit confused. I need to boil these down to general purpose "rules" so that I can actually code them.
This is what I am gathering should be for the Journal field (ignoring formatting for redirects, etc.):

Case 1: No links
|journal = Text
Journal column = [[Text]]
Case 2: Link with no piping
|journal = Optional Prefix [[Text]] Optional Suffix
Journal column = [[Optional Prefix Text Optional Suffix]]
Case 3: Link with piping
|journal = Optional Prefix [[Other Text|Text]] Optional Suffix
Journal column = [[Optional Prefix Text Optional Suffix]]
Case 4: Link with piping where piping contains "(journal)" or "(magazine)"
|journal = Optional Prefix [[Text (journal)|Text]] Optional Suffix
Journal column = [[Text (journal)|Optional Prefix Text Optional Suffix]]

Where the Target column will only be filled in if the Journal column is a redirect.
Is that all correct? Are there any other cases (other than the invalid characters)? -- JLaTondre (talk) 20:15, 24 June 2011 (UTC)Reply


The general rule is take what the reader sees, and that's what we care about.

Article has...
|journal = Whatever [[Foobar|barfoo]] [[boofar]] {{color|pink|random stuff}} RAndom TeXt!1!!1
Compilation should treat it[1] as...
|journal = Whatever barfoo boofar random stuff RAndom TeXt!1!!1
  1. ^ If template-handling stuff is too complicated, don't bother with it. It shouldn't occur very often anyway.
And place it in the "Journal" column as
[[Whatever barfoo boofar random stuff RAndom TeXt!1!!1]]
Unless there is a corresponding (journal/magazine) page which exists [and which is not a redirect], in that case place it in the column as
[[Whatever barfoo boofar random stuff RAndom TeXt!1!!1 (journal/magazine)|Whatever barfoo boofar random stuff RAndom TeXt!1!!1]]

In all cases the "target" column should be given, regardless of whether it's a redirect or not. Don't bother piping the (journal/magazine) link in this one. AKA

Journal Target Citations
Nature Nature (journal) 3
Sheng li xue bao: Acta physiologica Sinica 1
The Shorter Routledge Encyclopedia of Philosophy Routledge Encyclopedia of Philosophy 1

Headbomb {talk / contribs / physics / books} 20:42, 24 June 2011 (UTC)Reply

Okay, I think I got it. I'll update the code and post a new run. It will take a day or two as I have some commitments. -- JLaTondre (talk) 21:03, 24 June 2011 (UTC)Reply

Run 2

edit

Here is an updated list consisting of the first 10 (by alphabetical order) citations starting with each letter of the alphabet and also numbers. It includes the handling for dab pages that you requested above. Please review and let me know if you see any issues. There are still some additional cases I need to properly handle (like correctly parsing citations that are external links) and then I need to work on the logic for outputting the results to the proper pages. Once that's done, I'll file the BRFA. -- JLaTondre (talk) 02:06, 29 June 2011 (UTC)Reply

Looks all good with minor tweaks.
Headbomb {talk / contribs / physics / books} 02:46, 29 June 2011 (UTC)Reply

Run 3

edit

Another formatting question: <abbr title="Journal of Algorithms">J. Algorithms</abbr> - This code displays "J. Algorithms" in the citation text, but if you hover your mouse over it, it will then display "Journal of Algorithms" (see Sylow theorems for an example). Which one do you want displayed in the output? -- JLaTondre (talk) 20:13, 3 July 2011 (UTC)Reply

It is pretty much complete. I've had it update this page with the first output page (A1). I'm sure when the whole list is up, there will be some cases that need special handling. I'm going to go ahead a file the bot request. I've updated the number of records per page to 250 (it was 100) to cut down on the number of page writes. Let me know if you think that will be an issue or not. -- JLaTondre (talk) 16:54, 9 July 2011 (UTC)Reply

Bot request filed at Wikipedia:Bots/Requests for approval/JL-Bot 7. -- JLaTondre (talk) 17:24, 9 July 2011 (UTC)Reply

It was approved for a trial run which has been completed, Please look through the output at WP:JCW at let me know if you want any changes or if you see any issues. A couple of items in particular:

  1. If you are fine with the 250 records per page, I'll delete the old, extra pages. If you prefer it remain at 100, I'll re-run once the trial is approved.
  2. If the journal field starts with a lowercase letter, it will not correctly recognize if a matching page exists as page titles always start with a capital letter. I could put in a work around, but the easiest thing would be to always convert the first letter of the journal field to an uppercase letter. Would this be acceptable or do you want it displayed as lowercase if that was the way it was in the article? Always uppercasing the first letter has the added benefit of not getting two different entries for "Title" and "title".

-- JLaTondre (talk) 23:53, 9 July 2011 (UTC)Reply


Headbomb {talk / contribs / physics / books} 04:45, 10 July 2011 (UTC)Reply

  • <abbr title ...>: That was my guess for how to handle it and what is already implemented so no change needed.
  • Extra Pages: I deleted them.
  • First Letter Case: I've changed it to always uppercase.
  • Oxford Dict. of Biographies: This entry consists of Oxford Dictionary of National Biography, {{ODNBsub}} where {{ODNBsub}} generates an external link and causes the internal link to fail. Higher up on the page, you requested that templates be kept which works in most cases. However, it will fail in some cases like this. I can put in special handling to take care of specific issues. There are several ways to approach it. Let me know if you prefer one of there or something else:
  • Yeast & Zzap!64: I wasn't properly excluding existing pages in my comparison logic for missing popular pages. It's been fixed.
-- JLaTondre (talk) 13:48, 10 July 2011 (UTC)Reply
Ah, I see. I don't really see what logic you could use to determine the pipe target, but as far as I'm concerned, something like "Oxford Dictionary of National Biography, {{ODNBsub}}" would be fine (target = Invalid). Headbomb {talk / contribs / physics / books} 15:14, 10 July 2011 (UTC)Reply
I'll add a special case for this template. I'm sure there will be others that come up over time, but it's easy enough to add more cases if needed. -- JLaTondre (talk) 19:25, 10 July 2011 (UTC)Reply

I notice that (e.g.) Nature is cited 22044 times in 22044 articles. I know for a fact that many articles cited Nature multiple times, so by all logic, the citation count should be superior to the article count. Headbomb {talk / contribs / physics / books} 15:28, 10 July 2011 (UTC)Reply

Silly mistake. It was displaying the citation count when the article count was over 5. Fixed and I should have an updated run posted later today. -- JLaTondre (talk) 19:25, 10 July 2011 (UTC)Reply
Update posted. -- JLaTondre (talk) 02:11, 11 July 2011 (UTC)Reply
Cool. That seems to be working fine. Although I did notice some more weirdness. For example in Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/A1 Academic Medicine is reported twice, once bolded with target, the other unbolded, without target. I think it has to do with the number of spaces (aka some articles have Academic_Medicine and others have Academic__Medicine. Obviously those spaces should be stripped before processing. Headbomb {talk / contribs / physics / books} 03:18, 11 July 2011 (UTC)Reply

I also notice that the "rank" is a bit odd. Like you'll have 1/2/3/3/3/4/4/4/4/4/5/6/6... A more "standard" way to rank them would be 1/2/3/3/3/6/6/6/6/6/11/12/12. Headbomb {talk / contribs / physics / books} 06:41, 11 July 2011 (UTC)Reply

Sidenote The most recent dump is from June 20 I think. Don't know which you are using, but the fresher the data, the sexier the JCW compilation is. Headbomb {talk / contribs / physics / books} 16:33, 12 July 2011 (UTC)Reply
It is the June 20 dump. I fixed the spaces issue and the rank. I don't plan on posting a new version until more issues are found or the next dump is available (or if I find more changes that make a critical mass as I'm still need to finish going through the pages myself). Let me know if that's an issue. -- JLaTondre (talk) 21:13, 12 July 2011 (UTC)Reply
Usually [as in when stuff is fleshed-out], one run per dump is all it takes. The rank/spacing issues were the only "common" ones remaining, so AFAICT, any run after the spacing/ranking fixes should be the last for this dump. There may be some other tweaks to be made, but nothing warranting a new run before the next dump.
As a point of curiousity, how hard would it be to do something similar for publishers (using |publisher=) instead of journals? Headbomb {talk / contribs / physics / books} 21:23, 12 July 2011 (UTC)Reply
If the logic is the same, just showing the contents of the |publisher= field instead of the |journal= field, it would be easy. I may have some free time next week to look at it. I assume it would be Wikipedia:WikiProject Academic Journals/Publishers cited by Wikipedia paralleling WP:JCW? -- JLaTondre (talk) 20:30, 14 July 2011 (UTC)Reply
It might be better hosted at Wikipedia:WikiProject Books, since these would also include {{cite book}} in addition to the usual {{cite journal}} and {{citation}}, and the majority of publishers deal in books rather than journals. I'll ask them what they think about it (see Wikipedia talk:WikiProject Books#Publishers cited by Wikipedia).
As for the rest, the only code updates I can think of with regards to the crunching is to take |publisher= instead of |journal=, add {{cite book}} (and redirects) to the list of templates considered, and instead of the (magazine)/(journal) disambiguator fancy logic, it should be done using the (publisher) disambiguator. Some limited trial would be needed to make sure that we're not overlooking anything. I never really dealt with publishers much, so I don't know the idiosyncracies that comes with them. Headbomb {talk / contribs / physics / books} 21:00, 14 July 2011 (UTC)Reply
Ah, I thought you were interested in publishers of journals. Broadening it is not a technical issue (other than taking longer to run, but that's fine). The publisher field is used wider than books and journals. For example, {{cite web}} has a publisher field. A lot would be driven by what project wants the list and what the attend to do with it. -- JLaTondre (talk) 22:51, 19 July 2011 (UTC)Reply

Found something in need of a tweak. "Most popular" list shouldn't depend on whether the corresponding article exists or not. For example, the most popular missing journal was "Ap J." with 483 citations, which should put it somewhere around #102 on the list of "most popular". Headbomb {talk / contribs / physics / books} 08:01, 16 July 2011 (UTC)Reply

Okay, I'll change that. -- JLaTondre (talk) 22:51, 19 July 2011 (UTC)Reply