Wikipedia:Request a query

(Redirected from Wikipedia:SQL requests)
Latest comment: 5 hours ago by CactiStaccingCrane in topic Query to get all articles without inline citations

This is a page for requesting one-off database queries for certain criteria. Users who are interested and able to perform SQL queries on the projects can provide results from the Quarry website.

You may also be interested in the following:

  • If you are interested in writing SQL queries or helping out here, visit our tips page.
  • If you need to obtain a list of article titles that meet certain criteria, consider using PetScan (user manual) or the default search. Petscan can generate list of articles in subcategories, articles which transclude some template, etc.
  • If you need to make changes to a number of articles based on a particular query, you can post to the bot requests page, depending on how many changes are needed.
  • For long-term review and checking, database reports are available.

Quarry does not have access to page content, so queries which require checking wikitext cannot be answered with Quarry. However, someone may be able to assist by using Quarry in another way (e.g. checking the table of category links rather than the "Category:" text) or suggest an alternative tool.

Women in Red edit

I'm not exactly requesting a query, but help with one we already have: Wikipedia:WikiProject_Women_in_Red/Redlinks/C2. This version ran in 37 seconds. Some time between 20 and 25 February, it stopped working, instead throwing SQL Error: ER_DATA_TOO_LONG: Data too long for column 'subcat' at row 2. This is because it initially populates a table with one row, a fairly short category name, then attempts to add longer names. This used to work but something in the database or engine must have changed, as the same SQL now fails. I've fixed it by seeding the table with a long dummy value which doesn't affect the results, but that version takes nearly two hours to run. Does anyone know what changed? Is part of the forthcoming categorylinks "upgrade" causing the failure and wrecking performance? (A simple search for a link to details of that change timed out too – perhaps there is a more general temporary performance problem but VPT has nothing, and performance was just as bad last night.) Certes (talk) 15:24, 24 March 2024 (UTC)Reply

I poked at this a little when you first asked it, and some more today, and I wasn't able to come up with a version that avoided a full table scan on pagelinks - taken in isolation, the CTE completes instantly, and I can do something like WITH names AS ( /* the long cte */ ) SELECT COUNT(*) FROM names JOIN pagelinks ON pl_namespace = 0 AND pl_title = name GROUP BY name, but every variant of "pl_title starts with name" I could come up with - name=SUBSTRING_INDEX like you used, pl_title LIKE CONCAT(name, '%'), LEFT(pl_title, LENGTH(name)) = name, and so on - was unindexed. Which is why your query taken as a whole is looking through all of page first, when I suspect it was able to make use of the Cn..D range limit before. Irritatingly, putting the CTE results directly in the query instead like quarry:query/81913 does work, but I don't know a workaround to force use of the index when we don't have access to the real pagelinks table, just a view. (And no, I don't know what changed.)
Something like SELECT CAST('Feminine_given_names' AS VARCHAR(256)), 0 would be cleaner than your long 'This category does not exist but...' dummy value, but obviously that's not the real problem here.
What, precisely, are you trying to do? Maybe we can find another way to do it? —Cryptic 17:50, 10 April 2024 (UTC)Reply
I was listing frequently used redlinks which resemble women's names, so that the editors at Women in Red can consider writing articles on them. Of course, this is heuristic and contains false positives such as Ms May Rebellion, but they found it helpful and have created many wanted articles from it. I split it by initial letter to prevent timeouts by allowing index use, and to avoid overwhelming the authors by requesting hundreds of articles at once. Further details: Wikipedia:WikiProject Women in Red/Redlist index#Alphabetical (most frequently redlinked), User talk:Certes#WIR Redlist problem, User talk:Certes/Archive 9#Frequently redlinked women. Certes (talk) 19:05, 10 April 2024 (UTC)Reply
Is WHERE pl_from_namespace IN (0, 10) ... HAVING MAX(pl_from_namespace) = 0 meant only to exclude titles linked from any template? The query completes quickly if the first half of that is changed to WHERE pl_from_namespace = 0. Explain output for query 81446 as written (sql-optimizer seems to choke on queries with ctes):
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| id   | select_type     | table         | type   | possible_keys                                                                | key             | key_len | ref                             | rows     | Extra                                                     |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
|    1 | PRIMARY         | page          | index  | PRIMARY,page_name_title                                                      | page_name_title | 261     | NULL                            | 57680411 | Using where; Using index; Using temporary; Using filesort |
|    1 | PRIMARY         | pagelinks     | ref    | PRIMARY,pl_namespace,pl_backlinks_namespace,pl_backlinks_namespace_target_id | PRIMARY         | 8       | enwiki.page.page_id,const       |        4 | Using where                                               |
|    1 | PRIMARY         | <derived5>    | ref    | key0                                                                         | key0            | 258     | func                            |        1 | Using where                                               |
|    1 | PRIMARY         | page          | eq_ref | page_name_title                                                              | page_name_title | 261     | const,enwiki.pagelinks.pl_title |        1 | Using index                                               |
|    5 | DERIVED         | <derived2>    | ALL    | NULL                                                                         | NULL            | NULL    | NULL                            |        2 | Using where; Using temporary                              |
|    5 | DERIVED         | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey      | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    5 | DERIVED         | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY         | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
|    2 | DERIVED         | NULL          | NULL   | NULL                                                                         | NULL            | NULL    | NULL                            |     NULL | No tables used                                            |
|    3 | UNION           | NULL          | NULL   | NULL                                                                         | NULL            | NULL    | NULL                            |     NULL | No tables used                                            |
|    4 | RECURSIVE UNION | <derived2>    | ALL    | NULL                                                                         | NULL            | NULL    | NULL                            |        2 | Using where                                               |
|    4 | RECURSIVE UNION | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey      | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    4 | RECURSIVE UNION | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY         | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
| NULL | UNION RESULT    | <union2,3,4>  | ALL    | NULL                                                                         | NULL            | NULL    | NULL                            |     NULL |                                                           |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+

and the same for WHERE pl_from_namespace = 0:

+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| id   | select_type     | table         | type   | possible_keys                                                                | key                    | key_len | ref                             | rows     | Extra                                                      |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
|    1 | PRIMARY         | pagelinks     | range  | PRIMARY,pl_namespace,pl_backlinks_namespace,pl_backlinks_namespace_target_id | pl_backlinks_namespace | 265     | NULL                            | 50732398 | Using where; Using index; Using temporary; Using filesort |
|    1 | PRIMARY         | <derived5>    | ref    | key0                                                                         | key0                   | 258     | func                            |        1 | Using where                                               |
|    1 | PRIMARY         | page          | eq_ref | page_name_title                                                              | page_name_title        | 261     | const,enwiki.pagelinks.pl_title |        1 | Using index                                               |
|    1 | PRIMARY         | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY                | 4       | enwiki.pagelinks.pl_from        |        1 | Using where                                               |
|    5 | DERIVED         | <derived2>    | ALL    | NULL                                                                         | NULL                   | NULL    | NULL                            |        2 | Using where; Using temporary                              |
|    5 | DERIVED         | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey             | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    5 | DERIVED         | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY                | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
|    2 | DERIVED         | NULL          | NULL   | NULL                                                                         | NULL                   | NULL    | NULL                            |     NULL | No tables used                                            |
|    3 | UNION           | NULL          | NULL   | NULL                                                                         | NULL                   | NULL    | NULL                            |     NULL | No tables used                                            |
|    4 | RECURSIVE UNION | <derived2>    | ALL    | NULL                                                                         | NULL                   | NULL    | NULL                            |        2 | Using where                                               |
|    4 | RECURSIVE UNION | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey             | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    4 | RECURSIVE UNION | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY                | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
| NULL | UNION RESULT    | <union2,3,4>  | ALL    | NULL                                                                         | NULL                   | NULL    | NULL                            |     NULL |                                                           |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+

The first doesn't look like it'll look through so many more rows, but it's definitely taking forever, and the table order makes less sense. —Cryptic 20:03, 10 April 2024 (UTC)Reply

quarry:query/81916 if my reading of your intention above was correct. —Cryptic 20:13, 10 April 2024 (UTC)Reply
And quarry:query/81918 is a more (ahem) straightforward fix. —Cryptic 20:29, 10 April 2024 (UTC)Reply
  Fixed @Cryptic: Thank you so much. So it needs a STRAIGHT_JOIN rather than a JOIN, to force the tables to be processed in the suggested order. I've put that into C2 and it works, so I'll update the other queries.
I've also neatened the table creation as you suggest. I had something similar originally but replaced it by a simpler kludge, because the CAST was the only code I'd changed when the timing went off and I suspected it was causing the problem, perhaps by making columns to be joined have incompatible types. Certes (talk) 20:54, 10 April 2024 (UTC)Reply
...and yes, HAVING is to exclude people in templates, who might be linked from hundreds of articles just because their name appears in one navbox. Certes (talk) 20:56, 10 April 2024 (UTC)Reply
I'd bet money that what changed is just that pagelinks is incrementally bigger. The optimizer's wildly overestimating the cost of scanning it, so that looking at page AS Pf became more attractive. Removing namespace 10 helped not because the database horrifically couldn't use two disjoint ranges anymore, but because it lowered the estimated cost just enough to scan it first. What confused me last month is that I thought it was looking at page AS Pt instead, and that it was primarily the cte that was meant to cut the number of rows down. —Cryptic 21:06, 10 April 2024 (UTC)Reply
That sounds very plausible. I remember from my SQL-writing days decades ago that query plans would suddenly tip over from the desired order to something unhelpful when data sizes changed. I've applied the fix to the similar reports with other initials, rerun the broken ones and everything's now working. Thanks again. Certes (talk) 21:20, 10 April 2024 (UTC)Reply
I've also updated the first chapter of Men in Red, though currently no one seems interested in its output. There's a biologist there with over 100 links (from actual articles, not navboxes) and articles in French, German and Spanish. Certes (talk) 15:21, 11 April 2024 (UTC)Reply

@Cryptic: Just as we got this working, the WMF is about to improve the pagelinks table by removing the useful columns. I've tried updating a query but of course it now times out. Any further help would be very welcome. Certes (talk) 20:45, 18 April 2024 (UTC)Reply

On Toolforge, I got results for quarry:query/82121, a straightforward update from 81918 above, in 4 seconds, and the query from your diff in 4 minutes 37. They had identical query plans, modulo the extra union from 'This category does not exist...', except that the first estimated 1743728 rows from linktarget (index lt_namespace_title) and the second 3057442. If SDZeroBot's timing out, I'd try with a narrower title range. —Cryptic 21:50, 18 April 2024 (UTC)Reply
Thanks for investigating. We may be able to complete this task before the columns we need are dropped. If not then I'll take your advice. Certes (talk) 15:12, 19 April 2024 (UTC)Reply

Redirects connected to a Wikidata item that target a disambiguation page edit

I attempted this in quarry:query/82243, but I may well have missed things that would make it more efficient. I’m wondering if the time it takes to run is purely a symptom of having to look through two very large categories, or if anyone’s aware of things that might improve it. All the best :) ‍—‍a smart kitten[meow] 18:14, 24 April 2024 (UTC)Reply

The GROUP_CONCAT() followed by FIND_IN_SET() is going to perform a lot worse than just joining the second category - quarry:query/82254. —Cryptic 20:10, 24 April 2024 (UTC)Reply
Ah, thank you! I don't know why that didn't occur to me. {{self-trout}} ‍—‍a smart kitten[meow] 20:48, 24 April 2024 (UTC)Reply

Check for company articles that use both Parent and Owner field. edit

I'm wondering if it would be possible to check for articles that use both the Parent and Owner field in infobox company. The reason being, I have noticed that many articles incorrectly use the infobox company template to show higher-level ownership. This is not allowed per the infobox documentation. I have fixed this on most pages. However, there are many that I cannot find. WiinterU 22:45, 27 April 2024 (UTC)Reply

@WiinterU: Quarry can't do that, because it doesn't have access to the wikitext. This search should catch most of the offenders but will contain a lot of false positives, such as articles with parent= and owner= in different infoboxes. Certes (talk) 08:35, 28 April 2024 (UTC)Reply
Okay, thank you. WiinterU 12:31, 28 April 2024 (UTC)Reply
(edit conflict) Hah. I typed up almost the same thing as Certes, so will defer to their great answer. –Novem Linguae (talk) 09:05, 28 April 2024 (UTC)Reply

Query to get all articles without inline citations edit

Can somebody here help me write a query for all articles that don't have <ref> or {{sfn or {{sfnm or {{harvnb in the article's wikicode? Here is the context of the discussion: User_talk:CactiStaccingCrane#Untagged_unreferenced_articles CactiStaccingCrane (talk) 13:41, 4 May 2024 (UTC)Reply

I don't believe occurrences of <ref> are stored in any table to which Quarry has access. Certes (talk) 20:09, 4 May 2024 (UTC)Reply
Agreed. According to my sandbox, ref tags aren't placed into any categories, so we can't search the categorylinks table either. The others that OP mentioned can be queried via the templatelinks table. Or, for new pages, you can use Special:NewPagesFeed, and tick the filter for "have no citations", which should detect most of these but maybe not all (I forget which ones). WP:PETSCAN could also be useful since I think it lets you search by templates and/or categories. OP, what direction would you like to go in with this? –Novem Linguae (talk) 20:30, 4 May 2024 (UTC)Reply
It's also possible to query for use of {{Reflist}} and similar, though many articles use a bare <references /> tag. Certes (talk) 20:43, 4 May 2024 (UTC)Reply
We could generate a list of pages without any external links at all, or no external links except to specific sites like imdb. That's functionally the same thing - there's no real difference between a formally unreferenced page that has its subject's official site properly placed in ==External links==, and one that plugs it into ref tags instead. —Cryptic 22:10, 4 May 2024 (UTC)Reply
Or pages that don't transclude anything in the Category:Citation templates tree. Which will miss bare <ref>[external link]</ref>s, of course; I suppose we could mitigate by also excluding pages in Category:All articles with bare URLs for citations, and tag the remaining false positives as they're noticed. —Cryptic 22:19, 4 May 2024 (UTC)Reply
You're looking at quite a lot of pages. For example, there are about 30,000 with titles beginning with A. [1]. Many of them have a References section with plausible looking citations but no inline refs. Certes (talk) 22:56, 4 May 2024 (UTC)Reply
CactiStaccingCrane, I wonder if it would be easier to start with the known 50,000 articles that transclude {{no footnotes}}? Once editors are done with those, come on back and we'll figure out how to find more. – Jonesey95 (talk) 03:39, 5 May 2024 (UTC)Reply
The linked discussion shows the original user was specifically looking for unreferenced pages that aren't already tagged. (I can't imagine why, unless the intent is merely to tag them without actually looking for sources, but whatever.) —Cryptic 07:13, 5 May 2024 (UTC)Reply
Yeah that's exactly the point. Both of us are on WP:WikiProject Unreferenced articles and we want to tag all articles that don't have citations before launching another drive around June-July. CactiStaccingCrane (talk) 09:09, 5 May 2024 (UTC)Reply