Wikipedia:Bots/Requests for approval/AnomieBOT 58

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

AnomieBOT 58 edit

Operator: Anomie ⚔

Time filed: 03:07, Saturday October 22, 2011 (UTC)

Automatic or Manual: Automatic, unsupervised

Programming language(s): Perl

Source code available: User:AnomieBOT/source/tasks/ReplaceExternalLinks4.pm

Function overview: Replace URL redirector links with direct links to the target URL.

Links to relevant discussions (where appropriate): WP:BOTREQ#Google external link changes, Wikipedia talk:External links#Google redirection URLs

Edit period(s): As needed

Estimated number of pages affected: 5000

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N): Yes

Function details: The bot will search for external links using Google tracking-and-redirection URLs; most are of the form http://www.google.com/url?..., but there are many other Google domains involved and a few different path patterns. For any found, it will replace the external link with the URL extracted from the link's url or q parameter.

The same may be done for other URL redirector external links, as necessary.

Discussion edit

{{BAGAssistanceNeeded}} It's a little soon for the template, but people are complaining. See Wikipedia:Administrators' noticeboard/Incidents#Google blacklisted. Anomie ⚔ 01:54, 26 October 2011 (UTC)[reply]

Well, since no one has raised any objections

Approved for trial (7 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --Chris 02:09, 26 October 2011 (UTC)[reply]

FYI, edit summaries for this trial will match the regular expression /Bypassing \d+ Google redirection URL/. Anomie ⚔ 02:16, 26 October 2011 (UTC)[reply]

Comment - I've mentioned this bot request at meta:Talk:Spam blacklist#Google redirect spam where there's an ongoing discussion. So there may be some more input to this discussion. Hope that's OK with everybody. Best regards. - Hydroxonium (T•C•V) 03:00, 26 October 2011 (UTC)[reply]
I'm no expert, but as I commented on meta, a bot is a far less BITE-y solution for this than the spam filter with an obfuscating message which appears to the average newbie to simply ban google. It seems to me that the original URL is unambiguously encoded in the Google one. Uʔ (talk) 05:01, 26 October 2011 (UTC)[reply]
If the spam blacklist entry is removed, I would have no problem continuing to run this task indefinitely. Anomie ⚔ 11:22, 26 October 2011 (UTC)[reply]
A question-suggestion that has been implicitly raised on meta by someone else: what happens if after the rewrite the URL turns out to be one of the spam-blacklisted ones? Would the bot edit simply fail leaving in place the original Google-masked URL? Perhaps the bot should simply remove the URL entirely in that case (on a 2nd edit attempt), assuming of course that the bot can figure that its edit failed for this specific reason, i.e. decoded URL is spam-blacklisted. Or perhaps the bot can/does consult the blacklist before attempting its edit? Uʔ (talk) 05:29, 26 October 2011 (UTC)[reply]
The edit will fail, and will be logged (see line 229 in tasks/ReplaceExternalLinks4.pm). Bots have no more ability to bypass the spam blacklist than any other user; not even admins can bypass it, except by editing the blacklist or the whitelist.

If this turns into an ongoing task, I'll have the bot automatically report those somewhere. Anomie ⚔ 11:22, 26 October 2011 (UTC)[reply]

What Uʔ says was one of my points - could I please have a list of pages where that is the case (obviously, some will have been in good faith being lucky not to trip the filter, but it may already have been abused, so there the evader may need to be warned, tagged, or even sanctioned)?

Another point, on the Wikipedia_talk:External_links#Google_redirection_URLs discussion there was a point that it is not only google.com, it are a lot of tld's that have the same possibility. Are you scanning for that? Thanks. --Dirk Beetstra ^{T C} 07:11, 26 October 2011 (UTC)[reply]

Actually, even with my crude understanding of programming, I can see from his source code that he does scan for multiple domains and prefix URLs. I don't see any checking for the spam list though. Uʔ (talk) 07:42, 26 October 2011 (UTC)[reply]

Hey, perl-code .. I'll have a check.

Difficult to check for blacklisted sites, it will probably just fail saving. Actually, a log will remain - the links that are still there after the bot-run is finished. --Dirk Beetstra ^{T C} 07:45, 26 October 2011 (UTC)[reply]

It issues a warning on failure to save - I hope Anomie logs that somewhere. Other concern seems to be solved. --Dirk Beetstra ^{T C} 07:48, 26 October 2011 (UTC)[reply]

Yes, it is logged. I'll produce a list after the bot finishes with the existing glut of links. And if this turns into an ongoing task, I'll have the bot automatically report these somewhere. Anomie ⚔ 11:22, 26 October 2011 (UTC)[reply]

The bot will now start posting any problem pages it finds at User:AnomieBOT/ReplaceExternalLinks4 problems. Anomie ⚔ 18:31, 27 October 2011 (UTC)[reply]

Nicely done. ASCIIn2Bme (talk) 09:16, 30 October 2011 (UTC)[reply]

You may want to take a closer look at this one. Is the encoded URL simply too long? Actually, there are quite a few of those reported. Maybe you're missing some variation of the encoding method? ASCIIn2Bme (talk) 09:21, 30 October 2011 (UTC)[reply]

In that one, the url doesn't begin with "http" or "https". It appears that the URL points to a relative path at Google, which is itself the URL redirector. There seems to be one other with a relative path. The rest marked "Invalid/obfuscated" have no valid q or url at all. I'd hardly call 2 "quite a few". The 14 pages using the redirector for blacklisted URLs, that's quite a few. Anomie ⚔ 15:29, 30 October 2011 (UTC)[reply]

You're right. I didn't pay much attention to the details in the other ones. ASCIIn2Bme (talk) 16:14, 30 October 2011 (UTC)[reply]

Just curious. Is it still possible to decode that long one if you skip to where the "http" part starts in the "url=" parameter, i.e. from "http%253A%252F%252Fwww.defensereview.com" onwards? If that works, you may be able to implement a trivial change to skip anything in the parameter before http. If it's too uncommon, it might not be worth the trouble though. ASCIIn2Bme (talk) 16:18, 30 October 2011 (UTC)[reply]

Also this other one appears to redirect to a Google Books entry. It's probably worth fixing something like that. ASCIIn2Bme (talk) 16:20, 30 October 2011 (UTC)[reply]

I'm still hoping sanity will prevail and the blacklist will be updated to catch all of these Google redirect links instead of just the google.com ones. Even if not, though, those two represent only 0.03% of the links the bot has fixed so far. Anomie ⚔ 19:34, 30 October 2011 (UTC)[reply]

I just removed 10 fixed problems from the errors page. The remainder are mostly user talk or AfC/XfD, should we treat those differently? Also a code typo "Invalid/obfuscated Goole redirect". — Train2104 (talk • contribs • count) 21:32, 30 October 2011 (UTC)[reply]

Hah, yeah, typo fixed. Anomie ⚔ 23:59, 30 October 2011 (UTC)[reply]

I don't know the actual situation, but are other redirect/shortener services already blacklisted? And if not so, can you add this as a new feature? mabdul 14:50, 31 October 2011 (UTC)[reply]

Most are already blacklisted, and any new redirector discovered is blacklisted on sight. This Google redirector is proving somewhat problematic to blacklist both because some editors don't understand that the blacklist is not for all of Google but just for the redirector and because Google uses at least 70 different domains and several different path patterns within the domains.

If any new redirectors are discovered, I have worded this request such that AnomieBOT could bypass those as well. Anomie ⚔ 15:36, 31 October 2011 (UTC)[reply]

I don't think they should be blacklisted, that's a bit WP:BITEy. However, the bot should run, maybe issuing user notices. Also there is one AfD where a user is reverting the changes. What should we do about closed discussions/archived talk pages? — Train2104 (talk • contribs • count) 17:54, 31 October 2011 (UTC)[reply]

OTOH, that's what the blacklist is for: preventing addition of links that should never be used. The AFD issue should be taken care of now too, as I have removed the entry from MediaWiki:Spam-whitelist that was allowing that user to revert (it was marked "Until AnomieBOT 58 removes all the links", which has now been done).

I see no particular reason for avoiding talk page archives or closed discussions either; unless the discussion was specifically about the blacklisted redirector, bypassing the redirector is not going to be changing anything that makes any difference. Anomie ⚔ 18:54, 31 October 2011 (UTC)[reply]

Trial complete. The bot made 4568 edits to bypass 6720 redirects. There was one bug reported during the trial involving the URL parser, which was fixed and the affected articles corrected. Anomie ⚔ 10:47, 2 November 2011 (UTC)[reply]

Impressive. Rjwilmsi 12:09, 2 November 2011 (UTC)[reply]

Approved. Looks good. --Chris 12:39, 2 November 2011 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.