Wikipedia:Longstanding unreferenced articles

Proposal for finding longstanding bad articles edit

Thomas Ranch is about to be deleted, having existed in Wikipedia for over fifteen years without a single inline reference, and without ever having an external link to an independent source. Grub Smith, though a bit longer, has been in a similar state for an equally long time. I happened to come across these two not while searching for suspect articles but while doing general cleanup around the given name Thomas and the surname Smith. How many more articles are out there in this condition? I propose that the best way to find out, if technically feasible, would be to generate a list of articles that have never had an inline ref tag over the course of their existence, sorted by age, and push through them from the oldest on forward. If someone has the Wiki-fu to generate such a list, please have at it. BD2412 T 00:20, 5 April 2021 (UTC)[reply]

Never over the course of their existence is probably a tall order, but WP:RAQ is probably a better first stop. Izno (talk) 01:51, 5 April 2021 (UTC)[reply]
Yes, a tall order, but great efforts yield great rewards. Thanks for the pointer. BD2412 T 02:41, 5 April 2021 (UTC)[reply]
  • BD2412, I seem to remember that the database replicas only include metadata, not the actual page contents. To get the contents, you need to go through the API (which is much slower). Somebody needs to check me on that. Just as a proof-of-concept, I wrote a trivial little python script that iterates over all pages in mainspace and searches for <ref> anywhere in the text. It's processing about 10 pages/second. We've got about 6 million articles (from WP:STATS, which I assume is talking about mainspace when it says, "6,281,819 articles"). So, we could scan every mainspace article in about a week. One could envision doing that. I'm guessing the number of revisions is 2 orders of magnitude higher, so searching every revision of every article would likely be prohibitive. First guess, a couple of years. I'm not sure what value it would add anyway. -- RoySmith (talk) 02:49, 5 April 2021 (UTC)[reply]
    We could cut that down significantly by first removing from the search every article that has a ref tag right now (which should be a substantial majority), and of those that remain, only searching articles created before, say, 2007. The value would just be clearing out old garbage. BD2412 T 03:00, 5 April 2021 (UTC)[reply]
An database scan should work. I tried searching for an ref tag on a lot smaller wiki than enwiki and it seems to work. Queries would not work, because they are not logged in an sql table.--Snaevar (talk) 08:00, 5 April 2021 (UTC)[reply]
Snaevar, What query did you perform? Can you link to a Quarry page? -- RoySmith (talk) 14:02, 5 April 2021 (UTC)[reply]
They're likely talking about database dumps. Quarry won't work as you say – it doesn't have page texts. Pinging HaeB who I think can comment on the feasibility of processing the dumps for this task. – SD0001 (talk) 16:55, 5 April 2021 (UTC)[reply]
I did an database scan using AWB, on an small wiki and I simply searched the main namespace for pages who do not have <ref>. That kind of search is done offline. The dump file contained all pages with the most current version only (pages-meta-current in the filename). The dump file was downloaded from dumps.wikimedia.org.--Snaevar (talk) 19:33, 5 April 2021 (UTC)[reply]
As noted in the WP:RAQ discussion, inquiries can further be narrowed to articles categorized in the Category:Companies category tree, or articles categorized in Category:Living people. That's where these sorts of problems appear most likely to arise. BD2412 T 15:15, 5 April 2021 (UTC)[reply]
@BD2412: why not begin with something like Category:Articles lacking sources from December 2006? For companies specifically use something like Cleanup listing for WikiProject Business (currently does not exist for WikiProject Companies but you can request one). – Finnusertop (talkcontribs) 16:17, 5 April 2021 (UTC)[reply]
Neither of the articles noted above were so tagged, perhaps because they have external links (even though these link to sites unusable as sources). I am looking for things that have really slipped through the cracks. BD2412 T 16:20, 5 April 2021 (UTC)[reply]
Sure, BD2412, but if some type of action is needed on articles with these specific problems, it would make sense to start with the ones that have already been identified. But I get your point. I sometimes run across low-traffic articles with obvious problems that have slipped through the cracks as well and tag them. – Finnusertop (talkcontribs) 16:24, 5 April 2021 (UTC)[reply]

Does this seem plausible? -- RoySmith (talk) 02:27, 7 April 2021 (UTC)[reply]

BTW, https://github.com/roysmith/bad-articles -- RoySmith (talk) 02:31, 7 April 2021 (UTC)[reply]
Thanks. Somewhat remarkably, up until now none of those article were tagged as needing sources. I suspect that there are more out there, though. BD2412 T 04:04, 8 April 2021 (UTC)[reply]
@RoySmith scan 994,949 pages in just under 6 minutes using API or dumps? Either way, wow!
The latter sounds implausible though as there's a list of unreferenced BLPs at User:SDZeroBot/Unreferenced_BLPs - it's a lot more than 7. Or did you mean pages that already don't have an unsourced tag? – SD0001 (talk) 20:27, 8 April 2021 (UTC)[reply]
SD0001, Through the API. This is running inside a WMF data center, so I assume it's got a lot more bandwidth to the API servers than you or I would from a remote machine. Yeah, it seemed amazing, which is why I was looking for confirmation that it made sense. Clearly something isn't right, I'll dig more. -- RoySmith (talk) 20:55, 8 April 2021 (UTC)[reply]
Oh, my. Apparently I flunked Programming 101, as well as playing hookey during both semesters of Software Engineering 101 where they introduced the concept of testing one's code before publishing it, compounded by writing code past one's bedtime and further exacerbated by failing to perform even the most rudimentary sanity checks.
I've got a new version now. It should process articles at about 0.01% of the speed of the original, but with the offsetting advantage of actually doing something useful. It finds, for example Odalys Adams, which makes up for the lack of references by having 22 categories. It was on track to finish in about 24 hours, but I killed it to avoid beating up on the servers too much. I'm working on getting access to the dump files and I'll switch to using those once that's done. -- RoySmith (talk) 21:26, 8 April 2021 (UTC)[reply]
😂. Fetching a million pages in minutes did sound suspicious! As for dumps, I see they're there on toolforge at /mnt/nfs/dumps-labstore1006.wikimedia.org/enwiki/latest. – SD0001 (talk) 13:24, 9 April 2021 (UTC)[reply]

Has an enwiki search with -insource:"<ref" been mentioned? The results include XXX (a disambiguation page) and Hitler Youth (has heaps of references using {{sfn}} but no <ref> tags). To limit the search to a category, use incategory:"Living people" -insource:"<ref" Johnuniq (talk) 00:33, 9 April 2021 (UTC)[reply]

Timeout at 200k but I skimmed some and most did not have inline ref on the page in question. Izno (talk) 00:47, 9 April 2021 (UTC)[reply]
Johnuniq, Can you actually search for "<ref", with the leading punctuation? I thought all punctuation was ignored by the search indexer. -- RoySmith (talk) 02:17, 9 April 2021 (UTC)[reply]
You are correct—from Help:Searching#insource:, "non-alphanumeric characters are ignored". I'm a newbie at searching and started with insource:/regexp/ (slashes instead of quotes) and (I believe) that does allow punctuation following regex rules. However the regexp timed out so I switched to quotes without much thought. At any rate, my point was that dab pages need to be ignored (which happens automatically if using Category:Living people) and that some found pages may have no ref tags yet still be good because they use one of the new-fangled referencing methods. Johnuniq (talk) 03:06, 9 April 2021 (UTC)[reply]
  • @BD2412: I just got access to the dump files, so I took another whack at this. This is not by any stretch of the imagination production-quality code, but here's the script I ran. The heuristics it uses for detecting what's a BLP, what's a redirect, what's a reference, etc are quite crude. I haven't done any performance measurements, but my gut feeling is most of the time was just plain I/O, decompressing the bzip stream, and parsing the XML, so adding more sophisticated checks wouldn't slow things down appreciably. This was done with a brute-force single-threaded process. If we wanted to turn this into a production tool that was run on a regular schedule, I could easily see 1-2 orders of magnitude speedup.
    This was run against the enwiki-20210320-pages-articles.xml.bz2 dump file (only current revisions). It looked at approximately 21 million pages (all namespaces), of which just shy of 1 million appeared to be BLPs, and found 43,399 pages in about 5.8 hours. It produced this list. -- RoySmith (talk) 16:11, 14 April 2021 (UTC
    @BD2412: is this what you were looking for? -- RoySmith (talk) 13:55, 15 April 2021 (UTC)[reply]
    I'm not sure how to read that, honestly. The output I'm looking for is very old articles (BLP's being a priority among them) that have never been sourced. BD2412 T 16:02, 15 April 2021 (UTC)[reply]
    It's a list of article titles that are probably BLPs and probably have no refs in their current revision. (Though without looking at the source I can't imagine how M would be in either of those groups. Why not just look for Category:Living people and absence of <ref> in the wikitext?) Is it any more useful when the page age is added, as at quarry:query/54124? —Cryptic 16:54, 15 April 2021 (UTC)[reply]
    Cryptic, Wow, I have no idea how M found its way in there. I'll work on figuring that out, thanks for spotting it.
    As for looking for "<ref>", the problem was that some articles have things like "< ref >" in them (really, I've seen that). Just looking for "ref" seemed like a reasonable first pass that could be done quickly. I was thinking that if this at least got us close, I could do a second pass with something that actually parsed the wikitext. Similarly for "Category:Living people" vs "Category: Living people" (space vs no space after the colon).
    I'm also thinking once we've got a definitive list of current versions that have no references, then we could delve into the history of each article. But at least start with the current versions and get that right. -- RoySmith (talk) 22:13, 15 April 2021 (UTC)[reply]
    Well, I've figured out where M came from. It turns out that in the XML dumps, something like:
    <title>Heckler & Koch BASR</title>
    generates a title node with multiple child nodes to hold the text. I was just grabbing the first child and assuming I had the full title. The really weird thing is I got different results if I uncompressed the bzip2 stream in line with the XML parsing, or uncompressed it first into a (very large) text file. I guess XML parsers are free to break the text up into smaller chunks or return it as one piece as convenient?
    Life was so much easier when there was no XML, no unicode, and "big data" meant more than one box of cards. -- RoySmith (talk) 00:30, 17 April 2021 (UTC)[reply]
  • BTW, does anybody know how to pin this discussion so it doesn't get archived yet? Or is the archiver smart enough to note that the thread is still active? -- RoySmith (talk) 22:16, 15 April 2021 (UTC)[reply]
    You need {{subst:pin section}}. --Redrose64 🌹 (talk) 22:50, 15 April 2021 (UTC)[reply]

Progress edit

@BD2412:, I've made some progress here, mostly on getting my head around how to work with the XML data dumps. I can make a scan through the current version of every article in about 6 hours. Going back through all the old revisions of an article is roughly 50 times more data. How important is it to differentiate between "article that have no references" and "article that have never had any references in past revisions"? Note that the way the dumps are organized, you pretty much have to do a linear scan through all the data. Optimizations like, "Only look at articles in a certain category" don't save you any work.

My current list is about 43k articles that are likely to be BLPs and have no references in the current revision. Even that list is too long to be useful if the intent is that a human is going to check them over.

What do you want to do with redirects? I'm assuming that anything which is currently redirected we don't care about, regardless of the previous history? -- RoySmith (talk) 14:38, 18 April 2021 (UTC)[reply]

My interest in finding articles that have never had references is that they likely have never been vetted. An article for which past references have been removed either had bad references or has been vandalized. If the latter, I suppose that too would need fixing, but if someone has removed bad references but kept the article, they might have exercised the judgment to determine that the topic itself is notable. BD2412 T 03:01, 20 April 2021 (UTC)[reply]