linkchecker

Hi !

A little bug in your tool :

In Humanitarian response to the 2004 Indian Ocean earthquake, there is the source text

[[Sweden]] || [[Swedish krona|SEK]] 500M (USD 72.2M)<ref>http://www.regeringen.se/sb/d/4823/a/36245</ref> || || [[Swedish krona|SEK]] 1100M (USD 159M)<ref>[http://www.frii.se/index3.shtml Frivilligorganisationernas Insamlingsråd - Aktuellt<!-- Bot generated title -->]</ref> || || 177.2 || 0.5,

and it is converted into

[[Sweden]] || [[Swedish krona|SEK]] 500M (USD 72.2M)<ref>{{cite web |url=http://www.regeringen.se/sb/d/4823/a/36245 |title=]</ref> || || [[Swedish krona|SEK |archiveurl=http://web.archive.org/web/20050101025709/http://www.regeringen.se/sb/d/4823/a/36245 |archivedate=2005-01-01}}] 1100M (USD 159M)<ref>[http://www.frii.se/index3.shtml Frivilligorganisationernas Insamlingsråd - Aktuellt<!-- Bot generated title -->]</ref> || || 177.2 || 0.5

:)

NicDumZ ~ 19:47, 15 February 2008 (UTC)

Its been fixed, it was a problem with me not wanting capturing the space between the URL and title on normal links. — Dispenser 06:21, 16 February 2008 (UTC)

On another note, this is a bug too. The Evil Spartan (talk) 07:13, 17 February 2008 (UTC)

Sure it's not a feature? :) I mean, it looks like a more useful link name than was before ...
... or are you confusing DumZiBoT and linkchecker? (The diff is from DumZiBoT, but this section is about a linkchecker bug, so I almost have to ask.) — the Sidhekin (talk) 07:31, 17 February 2008 (UTC)

Bot droppings

Since other human editors and all other bots (that I've seen) do not label their invidual contributions, I think it would be a good idea if the <!--Bot generated title--> comment was not inserted in every single place a reference was converted in the article. The edit summary is the place to disclose the bot has been at work, and anything that increases the amount of dead text we have to wade through while editing is not moving in the right direction. I recognize there may be utility in tagging these changes so that a human being can make sure the change is sensible; from other comments I've read here, that may be a big concern.(Some templates for citations are now amazingly long and complex when you see them in the editor, and it's easy to cut off a tag and the end and scramble half an article. That's the same sort of problem.) --Wtshymanski (talk) 15:47, 19 February 2008 (UTC)

This is fairly standard practice for bots. It a signature but rather a comment informing what the content is. This is quite often done with other things, like with the introduction of the cite.php system, WP:DOC adds comments about where to put the interwikis, and we have other bots which put images that have been deleted into comments. If you wish to change standard practice I'd recommend that you bring it up at the Bot owners' noticeboard. — Dispenser 20:34, 21 February 2008 (UTC)


Non English sites

The first edit your bot made in this diff [1] is total garbage. I think this is because the bot is reading something in chinese, or similar. Although the reference is spam anyway (thanks for bringing it to my attention) and I am going to delete it, it does show up a flaw in your bot if the site is not in english. SpinningSpark 20:12, 21 February 2008 (UTC)

The bot correctly handles texts in other scripts, just when they've actually been specified. Reading through the documentation of UnicodeDamnit, it should if the page doesn't specify the encoding it should use statistical method for determination. This doesn't always work and results in the garbage that you see. — Dispenser 20:34, 21 February 2008 (UTC)
If the bot is not guaranteed to get it right then it should have human review. I thought it was a principle of bots that the operator was responsible for the actions of a bot, not the community at large to correct its mistakes. SpinningSpark 08:48, 22 February 2008 (UTC)
You are right: During approval process, bot owners are required to prove that their bots get it right most of the time. Antivandalism bots sometimes get it wrong, and so does DumZiBoT !
What happened is simple, and pretty rare : the erroneous link does not give any encoding in its code, hence you can't know what set of characters should be printed. The fact is that no automated tool can be 100% sure that the character set it prints is meaningful (My firefox is not even able to show properly the title/the page !) : Only a human can, and that's why every tool is required, per international standards, to tell users what encodings they are using. Without that information, automated tools behaviors are unpredictable. DumZiBoT actually is a bit more clever than *usual tools*, because a lot of pages do not give any encodings : When no encoding is specified, he tries to guess which encoding it is. And I have to say, that most of the time, it works (I can give you examples ;) ! ). Fact is, when a non-occidental charset is used, DumZiBoT is sometimes mistaken, true. I'm sorry, but keep in mind that these borderline cases are very rare..! NicDumZ ~ 13:44, 22 February 2008 (UTC)
This is what your bot wrote: ªF½åÅ]³N¤è¶ô¸ê°Tºô. I don't see how that could be mistaken for any western language even with the most simple of tests. SpinningSpark 15:56, 22 February 2008 (UTC)
It is not only about western languages : Some linked pages are in Chinese, Vietnamese, or Japanese, and I don't want to ignore them.
As you can see it, all the characters are *valid* : They could be found in, for example, a French title. Only the combination of the characters is some garbage, and the analysis of the semantic of natural languages is very hard, especially when you don't know what was the original language...
But if you have in mind some simple tests, go ahead, the source is online : User:DumZiBoT/reflinks.py :)
NicDumZ ~ 16:02, 22 February 2008 (UTC)
I'm not going to mess with the code. Even if I had the skill, it is not my resposibility. I can think of any number of simple tests which would have rejected that garbage. How about six or more characters in a row that are not simple alphanumeric without diacratics? I have not seen a word in any language which uses diacratics on six succesive letters, not even polish. Anyhow, this particular example would have failed the test even if diacratics were allowed. SpinningSpark 16:36, 22 February 2008 (UTC)

(unident) You are probably right : With some tweaks, this test would work when applied to occidental alphabet. However, what about "東賢魔術方塊資訊網" ? This title is meaningful (it is the title of the same link, when using the Chinese Big5 encoding) but is not made of alphanumeric characters : Your test would reject it, but actually, it's a valid title ! Also, in an other hand, there are many ways to produce garbage using Chinese or Japanese characters, and such a title could easily result from an erroneous unicode conversion by DumZiBoT, so a test "six or more characters in a row that are not alphanumeric and that are not Japanese, Chinese, or Vietnamese characters" would also raise false positives (i.e. would detect a bad title when it's valid) or false negatives (i.e. detecting a valid title when it's not). Trust me, it's not this easy. NicDumZ ~ 17:07, 22 February 2008 (UTC)

I cannot understand your reasoning here. If the bot had been capable of recognising the chinese encoding there would have been no problem (putting aside the issue that my browser will not display it). The fact is, it did not recognise the encoding. In those circumstances, doing nothing is safer than what it did do, presumably assume a western encoding. SpinningSpark 17:30, 22 February 2008 (UTC)
I cannot understand your reasoning here. If the bot had been capable of recognising the chinese encoding there would have been no problem You're wrong here. What I'm trying to say is : There is no way, at the end of a decoding process, to know if the text is meaningful or not. In an other way, there are no differences for DumZiBoT, if it decodes a byte sequence into "ªF½åÅ]³N¤è¶ô¸ê°Tºô" or into "東賢魔術方塊資訊網" ! Your statement, If the bot had been capable of recognising the chinese encoding there would have been no problem assumes that there is a way to know if the tried encoding is correct or not, but such a test does not exist ! Sure, there is a way to know that some encodings don't fit to decode some byte sequences. For example, UTF-8 is not working here : The result is "�F���]�N����T��", where the question marks state that the byte sequence has no equivalent in UTF-8. But there are a lot of charset that actually decode this byte sequence, and there is no way to know which one is meaningful. The first encoding is windows-1252, the second is Big5. DumZiBoT could have tried ISO 8859-5, ("ЊFНхХ]ГNЄшЖєИъАTКє") TCVN ("êFẵồỀ]́NÔốảụáờ̀TẲụ"), ISO 8859-10 ("ŠF―åÅ]ģNĪčķôļę°Tšô"), IBM 864 ("ﺕFﺵﻣﻊ]٣N¤ﻭ٦ﻬ٨ﻳ٠Tﻑﻬ"), HZ ("狥藉臸砃よ遏戈癟呼"), and so on, and so on...
My implementation choice was to fall back to windows-1252 when no encodings are found because it is a flexible representation of occidental alphabets : Since most of the links are occidental (We're on an English-speaking encyclopedia !!), using a default occidental charset is meaningful.. However, before using it :
  1. I try fetching an encoding from the HTTP header
  2. I try fetching an encoding from the HTML source
  3. I try, for domain names that ought to use an exotic charset ( .ru, .zh, .jp, etc...), to use their national charsets
Now, for the very, very small amount of links that A) are not compatible with windows-1252 B) do not follow international standards, and C) Do not use their national charsets, I don't care. I'm not going to change the behavior of my bot for less than 0,1 % (0,01% ?) of the links. That make only 60 (6 ?) wrong edits over 60K+ : Again, I'm sorry for the inconvenience, but it's definitely a won't fix.
NicDumZ ~ 14:35, 23 February 2008 (UTC)
Keep in mind that bots are tolerated on Wikipedia, not pandered to. It's very important that bots do not become a nuisance. Converting bare references is not a task that is so important that it's worth frustrating users for. I've read all you've written in this thread and I have to say that I can't tell you how to improve this bot, but please try to be friendlier with people who point out problems rather than telling them you don't care and that you will continue without making any further effort to fix the problem. When trying to think of a solution, keep in mind that it would be better for the bot to give up on 33% of references than for it to aim for 100% and end up sometimes adding garbage (or what looks like vandalism in my latin-tag case below). --Gronky (talk) 20:50, 4 March 2008 (UTC)
Number 3 is a really smart way to deal with things. In the case at hand, the url is a .com.tw adress. Is .tw currently in the list for "exotic" encodings? It would have probably resulted in the correct result then. Martijn Hoekstra (talk) 21:09, 4 March 2008 (UTC)

Report page

It might be a nice idea to have a special page where users can report mistakes the bot has made, something along the lines of what user:clueBot does. (see cluebots edit history) Martijn Hoekstra (talk) 21:19, 5 March 2008 (UTC)

Bare references change breaks comments

FYI: This change demonstrates that fixing a bare reference that is commented out will produce undesired results. Cburnett (talk) 06:10, 9 March 2008 (UTC)

Raúl Fernández

Your bot keeps adding the a link to fr:Raúl Fernández, who is a completely different person than English Wikipedia's Raúl Fernández. Is there any way to stop this? Cheers, CP 07:46, 18 February 2008 (UTC)

This is just one of several bots doing this. The root of the problem is that the French and the Italian wikipedia both claim these are the same. Presumably one editor (mistakenly) did this, and the bots are now copying it. I've now removed the inter-wiki links from and to the French article, so the bots should have have nothing left to copy. Unless they have a cache or something ... :) No, I think this should stop it. :) — the Sidhekin (talk) 07:56, 18 February 2008 (UTC)
Well, thanks for monitoring my talkpage, and fixing these little problems. I've been away these day, and it's a pleasure to have all the problems fixed when back :) NicDumZ ~ 08:16, 18 February 2008 (UTC)
Great! Thanks a lot! Cheers, CP 17:47, 18 February 2008 (UTC)


Hi, not sure how to get in touch with you, but you made some changes to a page I look after and changed some links and misdiscribed them. I know no one 'owns' the wikipedia, but please only change things you know about. The entry was for British Baseball Federation. I've changed them back, but it's time I'd prefer not having to waste doing this. Many thanks. John (PS. I'm not overly familiar with this site, but couldn't see any other way to post you a note, so attached it to this one. Maybe the site should be more user-friendly!!, but thats for the wikipedia management I guess). —Preceding unsigned comment added by John Walmsley (talkcontribs) 12:48, 23 March 2008 (UTC)

the bot added some garbage

In this edit: [2] the bot wrongly added a tag to a reference to wrongly say the reference was in Latin. Can you look into fixing this? Thanks. --Gronky (talk) 20:36, 4 March 2008 (UTC)

What really needs fixing is that web server: "Content-Language: lang" indeed! :-P I don't blame the bot for thinking this is Latin!
Still, I guess it would be better if the bot were to ignore it as an invalid language code. Can't expect all the world's misconfigured web servers to shape up, now can we? :) — the Sidhekin (talk) 21:12, 4 March 2008 (UTC)
Yup, Sidhekin got it :)
DumZiBoT only takes the first two letters of the language code, that's why it ended using latin.
I'm trying to know if your suggestion would work, because there are strange values sometimes. For English, you can find en, en-US, en-UK, en_UK, en US, english, or... whatever, because actually I can't think of any tools actually using the Content-language tag.
NicDumZ ~ 14:15, 5 March 2008 (UTC)
Good point, but not necessarily decisive.
I don't think underscores and spaces are permitted, but at least underscores are so common that it may be well worth considering merely the first sequence of alphanumerics, whatever the separator. (The first sequence should always be the language code, except for the grandfathered forms, of which there's a limited number, few of which are very interesting.) What's trickier is english, but on the other hand, on this wikipedia at least, we could well ignore it as invalid as we don't tag English language sources anyways. :) (Alternatively, you could make it an exception, I suppose.)
I think the language subtag registry gives (among other things) all current language codes as well as all grandfathered forms. "Content-Language: i-klingon", anyone? :) — the Sidhekin (talk) 15:23, 5 March 2008 (UTC)
Whether the webserver is misconfigured or not isn't the question. I'm highlighting another fail case for this bot. Hopefully some constructive thinking can be done with the information in these failure reports. --Gronky (talk) 15:47, 5 March 2008 (UTC)
DumZiBoT failed to consider the proper titles for these two bare edits made on 6 February 2008. Neither of them has any language hints for the user agent, but one of them has a .co.jp domain. Not that it matters much, as the content of the <title> element isn't consistent with the actual title and still needs manual correction. --Kakurady (talk) 23:52, 22 March 2008 (UTC)

Request

Hey Nic, would you be able to run your DumZiBoT through the Öser page? Thanks. Khoikhoi 03:55, 23 March 2008 (UTC)

Dispenser just did it ;) NicDumZ ~ 11:10, 24 March 2008 (UTC)

The da Vinci Barnstar

  The da Vinci Barnstar
You are awarded this barnstar for enhancing Wikipedia by programming DumZiBoT, a reliable robot that has both greatly improved Reference lists and increased the productivity of Wikipedians. EconomistBR (talk) 21:32, 24 March 2008 (UTC)
Wooha !
Thanks, I appreciate it ;)
NicDumZ ~ 21:53, 24 March 2008 (UTC)

Chile

Please go through the Chile article again. I was forced to revert your edits as they were made on top of a vandalized version. Thank you. ☆ CieloEstrellado 02:04, 25 March 2008 (UTC)

Your edit got reverted.
Anyway, please consider using http://tools.wikimedia.de/~dispenser/view/Pywikipedia, it's great !
NicDumZ ~ 15:52, 25 March 2008 (UTC)

Question

Hello. I have a question if DumZiBoT can make one fairly simple task, it is quite similar to one he is already fixing so nicely. For details look here. Thank you very much. - Darwinek (talk) 15:39, 25 March 2008 (UTC)

Yes. Adding the refs tag is fairly easy : It's a common script from pywikipedia.
I have not even considered running it, for it is really easy to handle, and I thought that some other bot would already be doing it.
I'm afraid that running my bot on this would require another BRFA, tho.
NicDumZ ~ 15:45, 25 March 2008 (UTC)
Maybe you could create DumZiBoT2 or something like that, it would require another BRFA, too but it is not a matter of time. The work just should be done sooner or later. What do you think? - Darwinek (talk) 15:54, 25 March 2008 (UTC)
It might be better implemented into AWB (see WP:AWB/FR) as it does many other similar types of edits. Revision which implement <references/> appending. — Dispenser 23:07, 25 March 2008 (UTC)
Is there any option how to request this feature to be added to next version of AWB? - Darwinek (talk) 23:57, 25 March 2008 (UTC)

Recognition for a job well done a job well done another job well done

  The Working Man's Barnstar
Every time I check my Watchlist, you and DumZiBoT have improved another article. Your continued work deserves recognition! TheRedPenOfDoom (talk) 03:20, 26 March 2008 (UTC)

Thanks! Great idea for a bot. — Omegatron 04:00, 26 March 2008 (UTC)

Great Bot. Yaki-gaijin (talk) 06:00, 26 March 2008 (UTC)

Thank you... ! :)
Really !
NicDumZ ~

Minor Edits

Is there a way you could make the bot not mark its edits as minor when it modifies a certain number of bytes? It's a relatively minor thing but it came to my attention with this edit where DumZibot added over 2000 bytes but marked the edit as minor. Great bot, by the way! The Dominator (talk) 18:28, 25 March 2008 (UTC)

I could do that easily, but... what for ? xD
NicDumZ ~ 20:06, 25 March 2008 (UTC)
lol, good point, still it does help a little, for example when you're calculating the average number of minor edits or the percentage of edits that are minor, bots marking major edits as minor sort of alter the statistic. The Dominator (talk) 20:19, 25 March 2008 (UTC)
Ah. Would you get offended if I answered you "altering the statistic does not really justify the count of how many bytes are modified for each article" ? :D
Besides, the minor/normal edit limit would be very arbitrary, am I right ?
NicDumZ ~ 20:28, 25 March 2008 (UTC)

Grumble Link to Seroquel website

Hi. DumZiBoT recently generated a title for the link to the Seroquel home page on the quetiapine Wikipedia page. This was the change: diff. Prior to the change the link looked like this http://www.seroquel.com/. It was easy to tell the link was to a corporate web page. Now it looks like this Home. The revised link is not so clear. Perhaps you might modify the DumZiBoT parsing algorithm so that it generates a list of untitled links of the form “http://www.token.com/”. Some of these links may then be changed to the form “Token home page”, or the link may be left unchanged, or the old substitution algorithm might be used depending on the characteristics of the destination url. I have modified the link to look like this: Seroquel website. Regards. KBlott (talk) 20:39, 25 March 2008 (UTC)

I understand your concern. On de:, actually, they considered that problem as so important that it partly was the reason for asking me to stop DumZiBoT.
But actually, I disagree. Surely, and quite sadly actually, some webmasters don't really get the point of using a descriptive title for their pages. But Domain website as a title is really loosing some information :
  1. There are more pages that have useful page titles than pages with undescriptive titles, and defaulting to a standard title for a minority of site is not the way out
  2. Some domain names are actually pretty much uninformative : I'd better use a vague page title than an even more vague domain name.
I understand your concern, but I already thought a lot about that, and as of now I really think that the way DumZiBoT does is quite efficient, looking at how many websites use descriptive titles, and how many don't.
NicDumZ ~ 10:54, 26 March 2008 (UTC)
Actually, I agree with your proposition that page titles generally contain more information than domain names. On the other hand, it is easy to find exceptions to this rule. The problem is confounded by the fact that “information content” is essentially subjective. Ultimately any string in a grammar is meaningless, except in relation to the semantics that we as organics (or bots) may happen to assign to it. I think this problem is inherent to all sufficiently large parsing/editing tasks. I agree that web page designers often forget to give their web pages useful titles. In any case, the Seroquel link is now titled, so DumZiBoT will probably leave it alone from now on. Regards. KBlott (talk) 15:53, 26 March 2008 (UTC)

Sinking_of_Prince_of_Wales_and_Repulse

Hi, Would you run the bot through Sinking_of_Prince_of_Wales_and_Repulse again please, It picked up that two off site referenced pages had poor page titles. I have now corrected these off site pages and if the bot could be run again it would reflect this on the above page references list. Apart from that, looks very interesting indeed. Nice one ;-) --Andy Wade (talk) 19:00, 26 March 2008 (UTC)

I'm glad my bot helped you improving your page titles.
However, as you can read in the [|User:DumZiBoT/refLinks|FAQ]], DumZiBoT only modifies untitled links, so it won't let me update the titles ;) (Or not with that current script) But you can easily =]
Cheers,
NicDumZ ~ 22:58, 26 March 2008 (UTC)
The pages in question are actually part of a frames site so normally they wouldn't show the page title.
But they were still untidy so I'm glad they're sorted out now. Cheers. --Andy Wade (talk) 23:39, 26 March 2008 (UTC)

Bad Title?

As seen in this dif you might want to add titles consisting solely of "test" to your title blacklist. Spiesr (talk) 20:18, 26 March 2008 (UTC)

Yes, you are right, actually. I added "test" as a blacklisted title, and DumZiBoT now runs with that updated list.
Thanks a lot for the report !
NicDumZ ~ 22:51, 26 March 2008 (UTC)

DumZiBoT's edit to Template:Infobox Planet

Your bot's edit to Template:Infobox Planet, [3], broke every page where that part of the template is used, by adding a references block to the end of the template. It might be an idea to restrict your bot to the main article space... Thanks. Mike Peel (talk) 20:39, 26 March 2008 (UTC)

Woops.
I'm truly sorry for that. I brought the bot to a stop, changed the references block code to add it only if it's in the main article namespace.
It still however add title to the references in every namespace, because I don't see any problem with that (Or please tell me if you can think of any)
Thanks for the kind report.
NicDumZ ~ 22:46, 26 March 2008 (UTC)

Good title

Sorry but nice name for your bot. I am not exactly sure why it labelled a particular report as Microsoft Word. How exactly is the bot meant to work? Shouldn't it read the title that is inside the document? here Simply south (talk) 21:32, 26 March 2008 (UTC)

Well, what is the title that is inside the document?
<dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Microsoft Word - LIP Chapter 3 - Boro Policy Statement _2_.doc</rdf:li></rdf:Alt></dc:title>
... I cannot quite fault the bot here either. :) — the Sidhekin (talk) 21:41, 26 March 2008 (UTC)
For one thing, it is not a .doc but a .pdf (and where did that code come from, maybe i am in way over my head somehow) here Simply south (talk) 21:54, 26 March 2008 (UTC)
That code came straight out of the PDF file. Just read it in a text editor instead of a PDF viewer, and you'll find it. At least part of it are human-readable. :) (I'm betting the PDF file has been created from a .doc file, and no one thought to give it a proper title.) — the Sidhekin (talk) 22:11, 26 March 2008 (UTC)
... or, if you have the right tools:
sidhekin@blackbox[23:13:02]~$ pdfinfo lip_chapter_3_-_boro_policy_statement_.pdf 
Title:          Microsoft Word - LIP Chapter 3 - Boro Policy Statement _2_.doc
Author:         
Creator:        PScript5.dll Version 5.2
Producer:       Acrobat Distiller 6.0 (Windows)
CreationDate:   Tue Oct  3 13:23:58 2006
ModDate:        Tue Oct  3 13:23:58 2006
Tagged:         no
Pages:          22
Encrypted:      no
Page size:      595 x 842 pts (A4)
File size:      189739 bytes
Optimized:      yes
PDF version:    1.4
sidhekin@blackbox[23:13:20]~$ 
(It may even be your PDF viewer displays the title; check the window title bar or document info dialog?) — the Sidhekin (talk) 22:16, 26 March 2008 (UTC)
Thanks a lot, Sidhekin, for your help here, you got it perfectly right ;) NicDumZ ~ 22:55, 26 March 2008 (UTC)
Ah ?! How is "DumZiBoT" a nice name ? Out of inspiration, I just used the last part of my username, which derivates from my real name have I unwillingly made it sound... special ? :) [I'm not a native English speaker and sometimes just miss these little things] NicDumZ ~ 22:55, 26 March 2008 (UTC)
Erm... sorry.. unfortunately "dum" (with a b at the end) means stupid. I was not meaning anything bad. It is just a funny and rather ironic name. It is doing good work so... never mind.
Computer language is complete gibberish but i have found the part you are referring to. Simply south (talk) 23:04, 26 March 2008 (UTC)
Ah xD I knew that but didn't quite made the link. Don't worry, I really don't feel offended by that coincidence ;) NicDumZ ~ 23:09, 26 March 2008 (UTC)

Thanks for DumZbot!

The bot is doing a great job. Having said this, once it gets through Wikipedia, it would be nice if it ran "periodically," and not every night maybe? My watchlist gets flooded with entries which are all improvements, I'll say that. And yes, we should have done it right the first time and this wouldn't be happening.

Would once a month be too infrequent once the whole encylopedia is processed? Again, a brilliant job whoever thought of this. Has been needed for a long time! Student7 (talk) 23:16, 26 March 2008 (UTC)

Thanks for your Thanks ! :)
You may imagine that searching for the pages that contains bad references is a heavy process. (If not, I'm telling you ! :] ) Since only a few pages, compared to 2 millions of articles, need fixing, DumZiBoT is not retrieving all the pages from the servers just to alter the ones that need alteration. I use XML dumps for my work.
And actually, the en: database is only dumped every two months. DumZiBoT is working on the dump of March, 15th, but last available dump, (and last run) is from January, 9th.
Yes, as of now, DumZiBoT is only fixing the bare references inserted during that two month timespan : It takes quite a long time :) (DumZiBoT has been continuously running for nearly 80 hours now and I don't really know when it'll stop !)
Now, I'm not sure that I could improve that behavior. Knowing the dumps issue (I only get to know the list of articles needing modification every two months or so) what would you suggest me to do ?
Thanks,
NicDumZ ~ 23:53, 26 March 2008 (UTC)
I think you are telling me that, in the future, you will essentially run this "every two months." I appreciate that it has run for several days on January's list and will continue many more days until through. I vote for that! I assume that future runs might be shorter. But either way, assuming I'm looking at an "average sample" I won't see more updates than I'm looking at now. And maybe a lot less once it is through. Yes, it needs to keep going. A real plus for references. Editors are encouraged to update the quality of some references now that they have names. It should help a lot. Student7 (talk) 00:49, 27 March 2008 (UTC)

Japanese characters

In this edit the bot generated a weird title for the ref. I'm guessing it is because the page is in Japanese? Just thought I would let you know. -- Ned Scott 06:54, 27 March 2008 (UTC)

Thanks !
It was not because the title was in japanese, it was because no encoding is actually specified in the HTML source : It is not standard complying, and there is no precise way to tell which encoding it is (My Firefox fails at outputing correctly the page)
However, I improved a bit the rules on .jp domain names, and it's better now.
Thanks again ;)
NicDumZ ~ 07:16, 27 March 2008 (UTC)

DumZiBoT

Hallelujah! It's about time someone with the know-how stepped up to the plate. Next step: Create a version that works in real time after each edit. If not in real time, then "on-demand" for a particular article. The other day I corrected dozens of such links in 2008 NCAA Men's Division I Basketball Tournament. If I could have "submitted" that article to your bot for processing it would've saved me a boatload of time. davidwr/(talk)/(contribs)/(e-mail) 20:35, 24 March 2008 (UTC)

Dispenser ported my tool to http://tools.wikimedia.de/~dispenser/view/Pywikipedia !
Enjoy !
NicDumZ ~ 20:40, 24 March 2008 (UTC)

Awesome bot! Thanks for this. Tb (talk) 21:28, 24 March 2008 (UTC)

Thanks from me too. In the past I've done a lot of grunt work wrapping external links in ref tags, good to see someone following along and furthering that. :) Bryan Derksen (talk) 18:22, 27 March 2008 (UTC)

Request running of bot?

Is there any way we can request running this bot on a page? Like maybe putting a tag at the top or something?--Paul McDonald (talk) 12:55, 27 March 2008 (UTC)

See http://tools.wikimedia.de/~dispenser/view/Pywikipedia which is an online version of the bot. — Dispenser 14:13, 27 March 2008 (UTC)

Great Page!

Really nice job on your page. I like it, so keep up the good work! =) --Cher <3 (talk) 00:24, 28 March 2008 (UTC)

User:DumZiBoT ? :) NicDumZ ~ 07:45, 28 March 2008 (UTC)

conversion from bare ref, causes the reference to be listed multiple times

I see you have edited couple of references in article Dada Kondke. These changes have lead to the references section below to be populated with same title multiple times. Is there a way to fix this? Appreciate the conversion process but if the problem persists then its annoying :) --Kedar (talk) 06:53, 28 March 2008 (UTC)

Well, just don't use thrice the same reference, name it :)
See Wikipedia:Footnotes#Naming a ref tag so it can be used more than once
NicDumZ ~ 07:44, 28 March 2008 (UTC)

Bug report

Hi there! I remember pointing out to you once that the bot does not always handle titles in Russian correctly, and I think you've fixed the problem for the most part, but today this edit popped up in my watchlist, and it seems to have the same problem. Don't know if it's something wrong with the site or with the bot, but I think it's worth a closer look. Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 14:36, 27 March 2008 (UTC)

Well, yes, again, it's a website that does not give out any encoding. :( Added a rule for .su websites, and it works
Thanks for the report ! :)
NicDumZ ~ 07:50, 28 March 2008 (UTC)
Actually no, it doesn't. The link you provided shows gibberish in Cyrillic letters. Looks like wrong encoding was selected, and it looks like KOI8-R was involved at some point in the decoding process. Sorry for the bad news!—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 13:56, 28 March 2008 (UTC)
Better then. DumZiBoT now use KOI8-R instead of windows-1251 as a default Cyrillic charset. But actually, since both of these encodings are 8 bits, *any* text could be decoded using KOI8-R. What I'm saying here is that I've just switched priorities : If someone here comes and tell me that some title ot decoded as garbage since DumZiBoT used KOI8-R where windows-1251 would have been better, I won't be able to do anything more...
NicDumZ ~ 16:38, 30 March 2008 (UTC)

Thanks

Thanks for converting the source links in relations to the PS2 version of Syphon Filter: Logan's Shadow. I usually have trouble with those. Beem2 (talk) 04:01, 31 March 2008 (UTC)

bot spelling error

Your bot seems to have misspelled "Association" quite a few times. I wonder if you could rerun to fix the error. A specific example is here, and a search shows it's fairly common. Regards, Jpmonroe (talk) 07:14, 31 March 2008 (UTC)

Yeah, basically, the title from the ARIA website is misspelled.
The website is wrong, not my bot, which just copies the title from the website.
NicDumZ ~ 11:16, 31 March 2008 (UTC)

Excessive link text

The bot generally does a good job, but look at this diff. Perhaps it should have some maximum text length? JonHarder talk 13:37, 29 March 2008 (UTC)

Ah, thanks a lot for the report !
Actually, I remember being asked about that problem, but can't remember why I did not fix it :(
Titles longer than 250 characters (arbitrary length) are now skipped.
NicDumZ ~ 16:17, 30 March 2008 (UTC)
If you skip, then please chop like this: "A really long line that is truncated" -> "A really long line that is tr.."
Rather than doing "A really long line that is tr", such that readers will know it's truncated. Electron9 (talk) 21:20, 30 March 2008 (UTC)
Ah, yes. When I meant "skipping", I meant that the references was not being changed. However, you're right. Simply skipping it is not enough. I now process it, appending "..." at the end. Thanks !!!
NicDumZ ~ 15:50, 31 March 2008 (UTC)

Character encoding issues

From pywikipedia's BeautifulSoup.py

try:
    import chardet
#    import chardet.constants
#    chardet.constants._debug = 1
except:
    chardet = None
chardet = None

Look at the last line. Web version updated, sources at http://tools.wikimedia.de/~dispenser/resources/sources/Dispenser 22:13, 30 March 2008 (UTC)

Ah, right
Good catch, it might help a lot ! (Sad that the run on the current dump is over, tho !)
I will test how chardet can improve DumZiBoT behavior.
I will commit that fix in pywikipedia trunk, as soon as I'll be able to.
NicDumZ ~ 15:48, 31 March 2008 (UTC)

yet another feature request - cite web templating

I'm sure you must've been asked this often; apologies (and just ignore this) if so... but could (perhaps the next revision of) DumZiBot convert the labelled URLs to web cite templates? This would mean that it could put in the retrieved date and would make it easier for other editors to flesh out the refs with richer information later. I appreciate all the work you and your bot have done! Pseudomonas(talk) 10:02, 31 March 2008 (UTC)

Yes, I've been asked that often. However, a few minutes ago, my FAQ page was not answering that FAQ, so don't apologize ;)
Here is your answer.
Thanks ! ;)
NicDumZ ~ 15:43, 31 March 2008 (UTC)

Page not found used as link

Hi, I like the bot idea overall. I just saw that it made this edit which I don't think it should list 404 pages as the title for a link. Perhaps if it gets a 404 response it could just skip trying to label that link? Just an idea. Thanks. MECUtalk 15:53, 31 March 2008 (UTC)

Unfortunately, that's a "200 OK" response. Silly webserver, to give a 200 code with that title.
Though I suppose adding "page you requested could not be" to the exceptions might be an idea? — the Sidhekin (talk) 16:21, 31 March 2008 (UTC)
Yup, and pretty silly webpage which has <div>s *in* the title markup.
However, I improved the part about "page not found" from "page.*not *found" to "page.*not( *be) *found" so that it matches that wrong title. It should be better.
Thanks,
NicDumZ ~ 18:34, 31 March 2008 (UTC)

Older screw-up

Please investigate [this] unnoticed screw-up. `'Míkka>t 03:40, 1 April 2008 (UTC)

No, DumZiBoT did not screw up. The page was already screwed up. I just tried adding a simple <references/> tag, and the same problem happens. In fact, a <ref> tag was not properly closed. I fixed it, and ran again DumZiBoT on the page.
NicDumZ ~ 09:11, 1 April 2008 (UTC)

Dumzibot: heaven or hell?

Bot made another error. Can't you people just do the links your selves?

[4]

Death Valley? izaakb ~talk ~contribs 00:10, 29 March 2008 (UTC)

Well. Try. Open the pdf. Title ?
Being aggressive when submitting an invalid bug report is just... ridiculous.
NicDumZ ~ 04:12, 29 March 2008 (UTC)
How 'bout answering the polite and accurate ones then? Equazcion /C 05:07, 29 Mar 2008 (UTC)
No. I answered that annoying question at 5 am local time, because it was just too much. But I need to think more about serious questions to issue an adapted answer :) NicDumZ ~ 09:49, 29 March 2008 (UTC)

The correct name of the PDF is Lethal Lou's: Profile of a Rogue Gun Dealer not "Death Valley". What PDF is called "Death Valley?" Not at the link below:

http://www.gunlawsuits.org/xshare/pdf/reports/lethal-lous.pdf Death Valley -- Bot generated title --

rgds izaakb ~talk ~contribs 21:35, 29 March 2008 (UTC)

The correct name of the PDF notwithstanding, the actual title of the PDF is "DEATH VALLEY". Observe:
sidhekin@blackbox[22:43:15]~$ pdfinfo lethal-lous.pdf 
Title:          DEATH VALLEY
Author:         vice
Creator:        Acrobat PDFMaker 7.0.5 for Word
Producer:       Acrobat Distiller 7.0.5 (Windows)
CreationDate:   Tue Sep  5 13:44:38 2006
ModDate:        Tue Sep  5 14:02:06 2006
Tagged:         yes
Pages:          25
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      3643054 bytes
Optimized:      yes
PDF version:    1.6
sidhekin@blackbox[22:43:43]~$ perl -nle 'print if /title/.../title/' lethal-lous.pdf 
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">DEATH VALLEY</rdf:li>
            </rdf:Alt>
         </dc:title>
sidhekin@blackbox[22:43:46]~$
Computer programs don't generally go around inventing titles. — the Sidhekin (talk) 21:44, 29 March 2008 (UTC)

I been looking at the bot as doing work for editor who have been too lazy to use half proper references. Even if the titles are so great they are very valuable in dead link recovery. — Dispenser 06:59, 30 March 2008 (UTC)

Now I understand how the bot got the name, but I think that is problematic as the writer of the document did not update the file info you posted above with the document name. If I were to go searching elsewhere for a document entitled "Death Valley" it wouldn't do much good, since that's apparently what the writer while in-progress. And I guess there is no way to double-check that? izaakb ~talk ~contribs 01:37, 1 April 2008 (UTC)
One way to increase confidence in a title extracted from a PDF would be to check to see if that string is present in the text stream of the PDF (I realize some PDFs don't have this). I suppose this would work for HTML pages as well, and might prevent some of these issues where the metadata is not, in fact, a good title. —johndburger 02:46, 1 April 2008 (UTC)
No, it wouldn't work. A title is basically a summary of the content, I see no particular reason why the title would be in the page / document body."User talk:NicDumZ - Wikipedia, the free encyclopedia", is not to be seen anywhere in this page, by the way. NicDumZ ~ 09:14, 1 April 2008 (UTC)
What about pages or PDFs which only contain images? — Dispenser 03:27, 2 April 2008 (UTC)

PDF being called Microsoft Word

http://en.wikipedia.org/w/index.php?title=MPEG_transport_stream&diff=201201845&oldid=199427990 The first reference it titled a PDF document microsoft word. Can you do something about that? Daniel.Cardenas (talk) 01:44, 27 March 2008 (UTC)

see Good title just above ;)
NicDumZ ~ 07:03, 27 March 2008 (UTC)

Are you going to do something about it? A PDF is not microsoft word. I suggest you add a rule to do nothing in this case. Daniel.Cardenas (talk) 11:23, 27 March 2008 (UTC)

A fix may be to check if titles have .foo in the end, and if they do, remove it. Martijn Hoekstra (talk) 07:59, 28 March 2008 (UTC)
Corrected for now. The regex is pretty simple, (?i) *microsoft (word|excel|visio|powerpoint), it probably can be improved.
NicDumZ ~ 14:41, 2 April 2008 (UTC)

DumZbot reference conversion style "problem"

Currently 'DumZbot' converts:

<ref>http://www.energybulletin.net/2389.html</ref>

Into:

<ref>[http://www.energybulletin.net/2389.html Europe Worries Over Russian Gas Giant's Influence | EnergyBulletin.net | Peak Oil News Clearinghouse<!-- Bot generated title -->]</ref>

I think the idea is great, but the destination format isn't really that good. I would suggest the following format:

<ref name=wwwenergybulletinnet2389html>{{cite web|title=Europe Worries Over Russian Gas Giant's Influence | EnergyBulletin.net | Peak Oil News Clearinghouse|url=http://www.energybulletin.net/2389.html}} 080327 energybulletin.net</ref>

(dateformat is YYMMDD)

That way one can see 1) which site the information comes from without clicking on the link 2) date when the link was retrieved 3) Any updates to the template can utilise the collected information structurly 4) the reference can be used several times.

Otherwise all is well :-)

Electron9 (talk) 18:49, 27 March 2008 (UTC)

Here is your answer.
Thanks ! :)
NicDumZ ~ 15:44, 31 March 2008 (UTC)
Also - there is a very good reason NOT to include the retrieved date - this needs to be done by a human who actually checks that the source backs up the statement it is a citation for. --Random832 (contribs) 14:55, 2 April 2008 (UTC)

Half completed

Hello, I think your bot was fixing the page Fiona Sit. I see it converted a lot of the references, but it looks half completed? Any reason why the bot left, lunch break? Benjwong (talk) 01:51, 2 April 2008 (UTC)

Yes there seems a bug for three of the links as the script matches <ref> but not <Ref> tags. Some of the references has two link, this wont be fixed. And rest 404 or aren't HTML files. — Dispenser 02:42, 2 April 2008 (UTC)
Ok that would explain it thanks. Benjwong (talk) 02:48, 2 April 2008 (UTC)
Wow, I suck. Actually I will fix that for the next run.
NicDumZ ~ 10:48, 2 April 2008 (UTC)
fixed ! Thanks !
NicDumZ ~ 14:24, 2 April 2008 (UTC)

Mastek

I there are sufficient reference available in Mastek article can you remove the unreferenced tag from the page or shall I do it myself. KuwarOnline (talk) 11:40, 9 April 2008 (UTC)

Suggestion

I think this bot is a great idea and it works perfectly as far as I've seen. I just have one suggestion: I find that the link text this bot generates is actually less descriptive than the URL. The bot just uses the title of the page, which usually doesn't distinguish the link all that well from others in the reflist. For example, if there's an article on Joe Smith, sourced with a few different biographies on that person, the links would all read "Joe Smith bio", "the life of Joe Smith", or something similar. There isn't much to distinguish one from the other, especially as far as which are from reliable sources.

The most important thing about references isn't really the title of the page, but the root site they're located on. I wonder if you'd consider modifying your bot to include the root site address in addition to the page title -- for instance, something like "Title at Site.com" (Joe Smith bio at timemagazine.com). This would allow a casual glance of the reflist to reveal any unreliable sources, any glaring omissions of sources that should be there, etc.

Thanks and please let me know your thoughts. Equazcion /C 23:00, 28 Mar 2008 (UTC)

The most important thing about references isn't really the title of the page, but the root site they're located on. I don't agree with that at all; I think your bot is doing great work by adding titles. But it would be even better if, for example with this diff, the bot were to include a domain name for the source, so that the added text would look something like this: "University of Illinois at Chicago (UIC) - College of Engineering<!-- Bot generated title -->], www.uic.edu"
If you were interested in this, I'd suggest considering yet one more enhancement - a page of standard sources (nytimes.com, washingtonpost.com, etc.), with matching names (New York Times, Washington Post), which the bot could use rather than posting the domain. By doing this, your bot would be adding two missing elements of citations, not just one. -- John Broughton (♫♫) 20:06, 5 April 2008 (UTC)
I'd like to third this; pagetitle, domain is far more useful than either alone.--Father Goose (talk) 22:52, 5 April 2008 (UTC)
I just can't find what to answer you guys. I understand what you're willing to do. But... To me formatting references in this way or in another (using cite templates for example) is just an editorial choice that I can't do as a bot owner. I try to be as neutral as I can. If I actually add in some way the web address of the link, others will come and ask me not to include it, with arguments as good as yours... :)
My point of view is simple : references should have titles, it's a strong style recommendation. I can simply add titles using the webpage title, so I do it, and no one can reasonably complain about this, because it's a well-accepted recommendation/policy. But if I actually format these titles in a way that is not widely accepted, I'll run into troubles :)
See below, someone thinks I should even remove the simple "bot generated title" comment ! :)
NicDumZ ~ 15:45, 12 April 2008 (UTC)
I don't think anyone would object to simply adding the domain name. When web refs are formatted manually, they always contain some indication of the site they came from. Equazcion /C 15:06, 13 Apr 2008 (UTC)

Maybe there's just some confusion here. Here's a ref your bot did:

Here's how we'd like it:

  • "GBU-43/B / "Mother Of All Bombs" / Massive Ordnance Air Blast Bomb". www.globalsecurity.org.

This is consistent with {{cite web}}'s formatting; in fact, I used "cite web" to generate the example above, putting the domain name of the website in the work field. (Per {{cite web}}'s documentation: "work: If this item is part of a larger "work", such as a book, periodical or website, write the name of that work.") DumbZBot does not need to use "cite web" itself but can just copy its formatting: [URL pagetitle]. domain name.

I cannot imagine that our suggestion to add the domain name to the reference that DumbZBot generates would be in any way controversial.--Father Goose (talk) 21:17, 13 April 2008 (UTC)

Laundry list

Done

  • Add check for PDF so Microsoft Word - ... .doc isn't a valid title (include family, PowerPoint, excel, Visio)
  • Fix language Icon issues (lang -> latin (la), assumed character encoding != language)
    • I'm not supporting language icon anymore. I believe it's too much work for too little pages.
  • Fix issues with <Ref> and <REF>

Won't Do

  • Add checks for DE: where the <h1> element need to partially match the title
    • That's absolutely not the way I coded my script, I do not intend to change it this way. NicDumZ ~ 15:35, 12 April 2008 (UTC)
  • Fix issues with title which are less than 6 characters
  • Maybe merge identical references
    • Maybe in another script, but no in reflinks. It's a rather complex task. NicDumZ ~ 15:35, 12 April 2008 (UTC)
  • Maybe identify unbalanced <ref> tags
    • No ? I'm not fixing syntax errors ;) NicDumZ ~ 15:35, 12 April 2008 (UTC)
  • Reject title with HTML tags (look for </...>)

??

  • Add the safari like algorithm that shows only the differences between titles
    • If multiple titles on a page are the same, but from different links then skip title
      You mean... only showing what titles got appended ? The current diff scheme only show the lines that got changed, it's not a lot of text, is it ? NicDumZ ~ 15:35, 12 April 2008 (UTC)
  • Add optional support to convert numbered external links into references where there are three or more <ref>s

To Do

  • Add optional support for bullet external links
  • Post the link to the source other than the BRfA
  • Get ride/merge the meta-data section in the FAQ

Well that's the stuff that I've been able to come up with. You might to look at the toolserver source as I've tried implementing (poorly) a new encoding scheme. — Dispenser 03:38, 2 April 2008 (UTC)

DumZiBoT on long pages

Hey Nic - love the bot, great work. One request: any way to modify it so it does not add the commented text (Bot generated title) when it converts bare references on long pages (>32k)? We are trying to fight long articles per WP:AS, and on those pages every character we save, even non-visible ones like commented text, helps. UnitedStatesian (talk) 14:07, 10 April 2008 (UTC)

Well, no I won't remove the comment. The idea is to let the user know that a robot (non-human) inserted the title, so that, in case of garbage, he can easily know what happened. I also think that I get more bug reports with comment. Editors don't just correct a wrong title, they report it, so I can improve the bot. :)
NicDumZ ~ 15:23, 12 April 2008 (UTC)
Well, thanks anyway; still love the bot. UnitedStatesian (talk) 04:52, 15 April 2008 (UTC)

Sweet work

Thanks again for this bot. It's doing great work. Do you have an estimate on the percentage of articles it has hit? Timneu22 (talk) 21:20, 13 April 2008 (UTC)

Well, yes, it's easy. DumZiBoT has 100K contribs all on this task. On 2,300K articles, it makes something like 4% of the articles. (Over 100K contribs, I think that the amount of articles that got processed several times can be ignored) NicDumZ ~ 21:29, 13 April 2008 (UTC)

Avoid long "titles"?

This page demonstrates the longest bot-generated title I've seen to date (citation 10 is a very detailed error message). To avoid this kind of thing, can the bot be programmed to ignore or truncate "titles" longer than some pre-set length? --Orlady (talk) 03:22, 14 April 2008 (UTC)

Thanks ! I actually fixed this on March, 30th. The maximum length is of 250 characters. For example, on this website, it gives this which is not really better, due to the html tags in the title. NicDumZ ~ 08:16, 14 April 2008 (UTC)

Nature:access

This bot generated titles for references, which appeared as "nature:access". this may be because it is operating from a computer without access to nature.com. i noticed this on the edits for Monotreme in January. Can it be fixed?Hectorguinness (talk) 12:39, 30 April 2008 (UTC)

Poke :)

Hi! I'm User:Bdamokos from the Hungarian Wikipedia and I would like to let you know, that I am going to test your bot on the Hungarian Wikipedia, if its okay with you. I think its a great tool. Regards, --Dami (talk) 01:35, 3 May 2008 (UTC)

Hey! I've run a couple of test edits on hu.wikipedia and there are three issues: Sometimes it gives a title to a link that already has a title (the third one is an example of this), repeats the title twice (2nd and 3rd link), and it freezes when trying to get the title of a PDF file (giving an error that an other process is using the same file; should I install some extra program that the bot relies on?). If you could help me getting to the roots of these problems, I think this could be a great tool on huwiki also. Bye, --Dami (talk) 02:23, 3 May 2008 (UTC)
You probably want to try the Feb 12 version as I have been editing it since then. The only benefits I made besides portability to my web framework is that it uses (a modified) chardet correctly. — Dispenser 02:43, 3 May 2008 (UTC)
Hello !
I'm glad you're trying it. Thanks for telling me, too :)
I've just added my up-to-date script to the official pywikipedia SVN, synchronizing it with noreferences.py for more coherence.
You will need :
  • if not already done, to configure noreferences.py for hu:
  • for pdf handling, the unix command pdfinfo. It can output some garbage about a badly formatted, or truncated PDF, it's pretty normal : to avoid downloading big PDFs, if it's bigger than 2Mo, I only get the first 2Mo from the file, and pdfinfo does not like it. However the title is located in the headers, and it should work. If you do not use Unix / cannot get pdfinfo, let me know, and I'll add an option to skip PDF files.
About your first bug, I tried adding the ref to my test page, and running the bot on it. As you can see, the hu: reference is not being modified. This is likely to be a bug that has been fixed now.
About your second bug, well.... Honestly for now I really don't know what it might be. Let me know if it happens again, I'll try to reproduce and to fix it.
Cheers ! :)
NicDumZ ~ 08:11, 3 May 2008 (UTC)
Hey! Thanks for the help! I am using it on Windows, so if the pdf part could be switched off or made optional (if I can get a Linux that works with my laptop in the future) that would be nice.
I have translated to Hungarian what I understood (=everything except the badtitles part, I guess its ok if its the same in hu as in en):
Long python source
msg = { 'fr':u'Bot: Correction des refs. mal formatées (cf. [[Utilisateur:DumZiBoT/liensRefs|explications]])',
        'de':u'Bot: Korrektes Referenzformat (siehe [[:en:User:DumZiBoT/refLinks]])',
        'en':u'Bot: Converting bare references, see [[User:DumZiBoT/refLinks|FAQ]]',
        'hu':u'Robot: Forráshivatkozások konvertálása'}
 
lang_template = { 'fr':u'{{%s}}',
                  'en':u'{{%s icon}}'}
 
deadLinkTag = {'fr':u'{{Lien mort}}',
               'de':u'',
               'en':u'{{dead link}}',
               'hu':u'{{halott link}}'}
 
comment = {'fr':u'Titre généré automatiquement',
           'de':u'Automatisch generierter titel',
           'en':u'Bot generated title',
           'hu':u'Robot generálta cím'}
 
stopPage = {'fr':u'Utilisateur:DumZiBoT/EditezCettePagePourMeStopper',
            'de':u'Benutzer:DumZiBoT/EditThisPageToStopMe',
            'en':u'User:DumZiBoT/EditThisPageToStopMe',
            'hu':'User:Damibot/EditThisPageToStopMe'}
 
soft404   = re.compile(ur'\D404(\D|\Z)|error|errdoc|Not.{0,3}Found|sitedown|eventlog|hiba', re.IGNORECASE)
dirIndex  = re.compile(ur'^\w+://[^/]+/((default|index)\.(asp|aspx|cgi|htm|html|phtml|mpx|mspx|php|shtml|var))?$', re.IGNORECASE)
domain    = re.compile(ur'^(\w+)://(?:www.|)([^/]+)')
badtitles = {'en':
                # starts with
                ur'(?is)^\W*(register|registration|(sign|log)[ \-]?in|subscribe|sign[ \-]?up|log[ \-]?on|(untitled|new) *(document|page|$))'
                # anywhere
                +ur'|(404|page|file).*not *found|error'
                # should never be
                +ur'|^JSTOR. Accessing JSTOR$'
                # ends with
                +ur'|(register|registration|(sign|log)[ \-]?in|subscribe|sign[ \-]?up|log[ \-]?on)\W*$',
            }
 
linksInRef = re.compile(
    # bracketed URLs
    ur'(?:<ref[^>]*>)(\s*\[*(?P<url>(?:http|https|ftp)://(?:' +
    # unbracketed with()
    ur'^\[\]\s<>"]+\([^\[\]\s<>"]+[^\[\]\s\.:;\\,<>\?"]+|'+
    # unbracketed without ()
    ur'[^\[\]\s<>"]+[^\[\]\s\)\.:;\\,<>\?"]+|[^\[\]\s<>"]+))[!?,\s]*\]*\s*)(?:</ref>)')
#'http://www.twoevils.org/files/wikipedia/404-links.txt.gz'
listof404pages = '404-links.txt'
 
# References sections are usually placed before further reading / external
# link sections. This dictionary defines these sections, sorted by priority.
# For example, on an English wiki, the script would place the "References"
# section in front of the "Further reading" section, if that existed.
# Otherwise, it would try to put it in front of the "External links" section,
# or if that fails, the "See also" section, etc.
placeBeforeSections = {
    'de': [              # no explicit policy on where to put the references
        u'Literatur',
        u'Weblinks',
        u'Siehe auch',
        u'Weblink',      # bad, but common singular form of Weblinks
    ],
    'en': [              # no explicit policy on where to put the references
        u'Further reading',
        u'External links',
        u'See also',
        u'Notes'
    ],
   'hu': [
		u'Külső hivatkozások',
		u'Lásd még',
	],
}
 
# Titles of sections where a reference tag would fit into.
# The first title should be the preferred one: It's the one that
# will be used when a new section has to be created.
referencesSections = {
    'de': [
        u'Einzelnachweise', # The "Einzelnachweise" title is disputed, some people prefer the other variants
        u'Quellen',
        u'Quellenangaben',
        u'Fußnoten',
    ],
    'en': [             # not sure about which ones are preferred.
        u'References',
        u'Footnotes',
        u'Notes',
    ],
   'hu': [
		u'Források és jegyzetek',
		u'Források',
		u'Jegyzetek',
		u'Hivatkozások',
		u'Megjegyzések',
		]
}
 
referencesTemplates = {
    'wikipedia': {
        'en': [u'Reflist',u'Refs',u'FootnotesSmall',u'Reference',
               u'Ref-list',u'Reference list',u'References-small',u'Reflink',
               u'Footnotes',u'FootnotesSmall'],
        'hu': [u'reflist'],
    },
}

. Maybe if you could put in the Hungarian also into the original source code, it would be easier to update. Best regards, --Dami (talk) 11:42, 3 May 2008 (UTC)

Thanks, I added your hu: translations to both noreferences.py and reflinks.py on the SVN NicDumZ ~ 23:31, 4 May 2008 (UTC)
Hey! I tried the Feb 12 version, but it doesn't work... I think the problem is basicaly with Hungarian accented characters, and I think the issue might be specific to Windows (Unix handling the encodings better and stuff...). I will try to test with a live CD.--Dami (talk) 12:05, 3 May 2008 (UTC)
Character encoding ? Does some titles print badly ? NicDumZ ~ 23:31, 4 May 2008 (UTC)
I was thinking, that the bot is trying to find titles to links that already have them, because in the title there are accented characters like éáőúűóüöí; but the error happened on an Ubuntu I tested it on, with the same pages, but it doesn't happen always; so I don't know what is the pattern. I will try again with SVN version and report back. --Dami (talk) 11:05, 5 May 2008 (UTC)
Hi again! It has the same bug on Linux as well. Currently I disabled the PDF part, and set it to manual mode. This way its quite usable, but not as fast as with automatic mode. Regards, --Dami (talk) 17:38, 3 May 2008 (UTC)
How have you disabled it ? Using -ignorepdf from the last SVN version, or modifying the outdated code found on User:DumZiBoT/reflinks.py ? NicDumZ ~ 23:31, 4 May 2008 (UTC)
This was before SVN version, with commenting out the relevant class. --Dami (talk) 11:05, 5 May 2008 (UTC)
It seems that the number of errors apart from the 5 or 6 articles it always gets wrong is zero. So, don't worry about good old huwiki, and keep developing such great tools!--Dami (talk) 17:52, 3 May 2008 (UTC)

Reproducing bug#2

Sorry for flooding your talk page. Could you run a test to check what does the bot do, if the same link appears more than once on a page, like here [5] or [6]. Ideally it would merge these refs into one named one, but instead it inserts the title as many times as there are identical links. --Dami (talk) 18:06, 3 May 2008 (UTC)

Are you actually using the SVN version ?
I just tried with the SVN version, here and here, and no wrong behavior occurred. ?! NicDumZ ~ 23:17, 4 May 2008 (UTC)
This was before the SVN version, because the translation was not yet included. I'll try with the SVN version and report back. --Dami (talk) 11:02, 5 May 2008 (UTC)

Svn version

Almost all problems solved, at the price of introducing a new one: It doesn't ignore titles such as "Untitled Document" or just "Untitled" (For example at hu:Marosvásárhely if I remember correctly). The ignoring should be enabled for Hungarian (there is no need for extra blockings for Hungarian, apart from maybe for "Névtelen" [meaning untitled]). If this problem could be solved, it would be wonderful. Thanks again, --Dami (talk) 19:02, 5 May 2008 (UTC)
I also made a small change in the translation, that would be nice to have in the SVN version:

msg = { 'fr':u'Bot: Correction des refs. mal formatées (cf. explications)',

       'de':u'Bot: Korrektes Referenzformat (siehe en:User:DumZiBoT/refLinks)',
       'en':u'Bot: Converting bare references, see FAQ',
       'hu':u'Robot: Forráshivatkozások kibővítése a hivatkozott oldal címével'

} --Dami (talk) 19:09, 5 May 2008 (UTC)

Nice !
I'm updating the SVN...
About [7], I don't think that you were using the SVN version at that time, were you ? :) [why ? It is inserting {{en icon}}, and I removed this feature in the SVN version, because most HTTP server don't give out proper language codes, resulting in wrong language icons inserted...]
NicDumZ ~ 20:57, 5 May 2008 (UTC)
I was actually using the SVN version, but just clicked on NO when it asked, whether to commit the change. Now I saved it, just to show you [8]. --Dami (talk) 08:19, 6 May 2008 (UTC)

This error is still present in the latest SVN version [9], I don't know what's causing it as changing the English badtitles list to Hungarian doesn't help. Any ideas?--Dami (talk) 13:12, 9 May 2008 (UTC)

Well thanks for your report. It seems that when adding my script to the repository, one space slipped in the badtitles regular expression. (?! I feel confused about that...)
Thanks again, a lot of blacklisted titles were NOT detected as bad titles because of my mistake. It is now fixed.
NicDumZ ~ 18:42, 10 May 2008 (UTC)
Thank you!--Dami (talk) 20:03, 10 May 2008 (UTC)

Bot producing mojibake

Your bot sometimes produces mojibake. See for example this edit: [10]. If you've already fixed this, or if it's not really your bot's fault, please ignore. —Keenan Pepper 01:51, 8 May 2008 (UTC)

Hello !
I just tested, and yes, apparently it has been fixed.
Thanks however for the message, I didn't know about the mojibake word :þ
NicDumZ ~ 06:49, 8 May 2008 (UTC)

Consolidate duplicate refs

Could this bot could consolidate duplicate references? Some pages have multiple references all to the same article, and it would be nice if they could be condensed to give a more accurate picture of the number of articles truly referenced. Novasource (talk) 17:59, 12 May 2008 (UTC)

Looks like it is now consolidating references! Yay! Novasource (talk) 17:19, 14 May 2008 (UTC)
Is it ?
Checking for duplicates was definitely something I intended to code "When I have time". But unfortunately that check does not seem this easy, and I don't have that much time :)
NicDumZ ~ 17:21, 14 May 2008 (UTC)
Yup. It worked at Lupe Valdez and Buffalo Speedway. Are you running the same thing that's at http://tools.wikimedia.de/%7Edispenser/cgi-bin/reflinks.py? That's what I used, and it appears to reference you in the edit summary (which I modified to say "Consolidate links" instead of the usual edit summary), which links the word FAQ to User:DumZiBoT/refLinks. Novasource (talk) 20:33, 14 May 2008 (UTC)
Well. I first wrote reflinks.py for DumZiBoT. At some point Dispenser used the source to convert the script to be web based and started adding new features. Now I continued developing the shell-based reflinks.py in my own way, and we now have scripts than tend to differ. No, DumZiBoT is not running the same script :þ
Dispenser I know you're reading this: your script is still buggy, (see [11] for example), but you apparently did a good job implementing that duplicate check :þ If you're okay with that, I'd like to take the "good part" to include it in the pywikipedia SVN :)
NicDumZ ~ 20:48, 14 May 2008 (UTC)
I actually implemented the duplicate check because of that bug. I changed the script to search for all free links, but when it double replaces duplicate links on a page it causing the problem. I implemented an extremely simple check which looks for duplicates. A better one would compare urls in the references.
# Convert autonumbered references into cite.php format
# Regex from AWB user
if autonum2ref:
	new_text = re.sub(r'(?m)(?!^[*#:=].*?)(?<!<ref>)(?<!\*)(\s*)\[(https?://[^\] ]*)\](?!.*</ref>)', r'<ref>\2</ref>\1', new_text)
# Merge duplicate refs
for m in re.finditer(r'(?si)(<ref>)(.*?)(</ref>)', new_text):
	# Skip single references 
	if new_text.count(m.group()) <= 1:
		continue

	refname = 'autoref'
	for g in re.split(r'\W', m.group(2)):
		if len(g)>len(refname):
			refname = g
	i = 1
	while refname+str(i) in new_text:
		i+=1
	else:
		refname += str(i)
	new_text = new_text.replace(m.group(), '<ref name="%s">%s</ref>' % (refname, m.group(2)), 1)
	new_text = new_text.replace(m.group(), '<ref name="%s"/>' % refname)
And there was a bug in the version that I was using where it would automatically insert the references section when there were no references. It could happen in the shell operation when not using the database as the source. I would like to remove the web hack and start using switches, the autonum2ref is an example of one. — Dispenser 03:03, 15 May 2008 (UTC)

Bot blocked

I've had to block your bot as it was apparently malfunctioning. See WP:ANI#BOT out of control and needs temporary blocking/shutting off for the discussion, and [12] for the malfunction diff. Mangojuicetalk 17:53, 17 May 2008 (UTC)

You shouldn't take it personally: it's a bot. When bots malfunction there's always a calculation about whether the block or leaving the bot unblocked is the correct action. In this case, I judged that since all the bot does is put up interwiki links, which several other bots do, there would be zero harm in blocking it, and potentially non-zero harm in not blocking it. Mangojuicetalk 13:26, 19 May 2008 (UTC)

ANI concern

Bot: User:DumZiBoT or [67]

Confirmation that it is a bot: I'm a bot, I am not able to understand by myself what is the aim of all these basic binary operations that I'm performing

Diff that bot is deleting material, not just adding a link at the bottom: http://en.wikipedia.org/w/index.php?title=Barack_Obama&diff=213072670&oldid=213069913

Please disable bot until repairs can be made. DianeFinn (talk) 17:42, 17 May 2008 (UTC)

Recent edits seem OK. It's not editing that fast either. Do you have time to try asking at User talk:NicDumZ? I'll keep an eye on it for the next few minutes. If you get no response on the talk page, come back here and leave a note. Or simpler still, edit User:DumZiBoT/EditThisPageToStopMe. :-) Carcharoth (talk) 17:49, 17 May 2008 (UTC) A bot shouldn't be deleting anything. What's another possibility? Sneaky edit summaries and human editing using a bot user? I'm not going to accuse someone of that. So the neutral observation is that the bot is not functioning. DianeFinn (talk) 17:53, 17 May 2008 (UTC)

Copied from ANI DianeFinn (talk) 17:54, 17 May 2008 (UTC)

Peque

Hello! loke it that: [13]; es:Peque (Antioquia) is a town in Colombia, no in Spain. Tnk you, XalD (talk) 15:25, 19 May 2008 (UTC)

A little misspelling in replace.py

In replace.py the nn comment reads "blabla teksterstatting". It should read "blabla teksterstatning" (per [14] [15] (nr. 2 reads 'not found')).

To be more precise, it should go like this:

'nn':u'robot: automatisk teksterstatning: %s',

Yeah, I didn't want to upload a whole patch just for this, I hope you understand. :-P

Thanks in advance. --Harald Khan Ճ 17:22, 20 May 2008 (UTC)

Thanks, I just fixed this ;)
NicDumZ ~ 20:22, 20 May 2008 (UTC)

Quote handling

I just saw, on this location, the bot mis-handle a title. I've gone back and fixed it by hand. The issue seems to be a single-quote in the retrieved title. - Denimadept (talk) 22:05, 28 May 2008 (UTC)

Thanks for the kind report ;)
However, the HTML source of the page is <title>World</title> : firefox prints "World" as a title. Surely, in the page, you can find "World's Longest Bridge Spans", but the HTML title is "World". The HTML page is wrong, not my bot ;)
NicDumZ ~ 22:09, 28 May 2008 (UTC)
Ah hah! Report went to the wrong person, then!  :-D - Denimadept (talk) 01:20, 29 May 2008 (UTC)

Reflink updates

I've noticed that in the changelog that you've removed the HTTP error logging. I've been meaning to ask you for a while now if you could send them over for use on the Toolserver. I have also add the svn version of reflink with modification only for enabling HTML output.

I'd like to discuss rev 5374 for reflinks.py. You change the syntax of the dead to a bracket-less format. This is a problem for my tool as the regex is tied to using those brackets.

The regex implementation in checklinks.py is as follows:

# Name of the dead link templates
dead_templates = r'[Dd]ead[ _]*link|Dl|dl|[Dd]l-s|404|[Bb]roken[ _]+link'
...
    # Label dead links
    text = re.sub(ur'\[(\w+://[^][<>\s]+) *([^][]*?)\](\W*?|\W*?<[^<>]*?>\W*?)\{\{(%s)[^{}]*?\}\}' % dead_templates, ur"[\1 '''&#123;&#123;dead link&#125;&#125;''' \2]\3", text)

I would like to come up with a more formal definition, possibly:

text = re.sub(ur'\[(http[s]?://[^\[\]<>"\s]+) *([^\]\n]*)\](?:</ref>)?\{\{[Dd]ead link[^}]*\}\}' , ur"[\1 **dead** \2]", text)

It is probably best to bring the discussion somewhere else to get more input on this matter. — Dispenser 04:06, 29 May 2008 (UTC)

Okay, quick answer in the morning
  • I'm truly sorry about the HTTP logs. I ended up not using them, and I thought it was best to delete them, as no other tool was at the moment able to use them
  • The bracket-less syntax was actually not wanted. r5374 introduced es: enlace roto, which has a different syntax from the other langs, and I had to adapt refDead. The deadLinkTag fix was wrong, because I forgot about brackets. It is now fixed by r5460.
  • With that last revision, I believe that the latter regex would work, wouldnt it ? You just have to correct it, for {{dead link}} is being placed inside <ref></ref>. Let me know...
  • I have also add the svn version of reflink with modification only for enabling HTML output. <- sorry, I don't understand the sentence :(
  • You are welcome to bring the matter elsewhere. You can post on pywikipedia-l lists.wikimedia.org or... anywhere else :)
NicDumZ ~ 06:38, 29 May 2008 (UTC)

Interwiki position of sk

The bot appears to think that sk (for Slovenčina) comes all the way at the end in the Interwiki sorting order, rather than between szl and cu.[16][17][18][19]  --Lambiam 08:19, 29 May 2008 (UTC)

hmm... I believe this has been fixed during the day. Is that bug still actual ? NicDumZ ~ 20:39, 29 May 2008 (UTC)

Bug?

Did something go wrong here?—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 13:23, 29 May 2008 (UTC)

mmm... Not really. In fact, my regular expression searches for references made exclusively of links, ignoring the spaces in the references. And then for every such reference, it tries to get a title. When no title is found, the reference is put with the format <ref(name ?)>(link)</ref> without space.
That's what happened here. It seems that only a space was removed, but actually it means that no title / or a bad title has been found, and that the ref has been normalized. Here, the title of http://www-ns.iaea.org/downloads/rw/waste-safety/north-test-site-final.pdf, "Microsoft Word - [...] .doc" is blacklisted and has been ignored.
So, that looks like a bug but isn't. Not editing the page would have been better, yes, but that's not this easy... Or so I think.
NicDumZ ~ 20:45, 29 May 2008 (UTC)
WP:AWB/FRDispenser 00:30, 30 May 2008 (UTC)

Bot - eswiki

Hello! You are now flagged on the Spanish wiki. We expect your bot's edits to fix external links. Muro de Aguas (write me) 14:37, 29 May 2008 (UTC)

Cool bot!

Cool bot, nice work, pretty clever. PhycoFalcon (talk) 22:51, 30 May 2008 (UTC)

Russian letters in references

Hello! Please, when you convert the references in Russian (thanks for this), please, pay attention to the right letter encoding otherwise cyrillic letters looks like strange senseless symbols after your work!

Regards, Vladimir--Vladimir Historian (talk) 12:24, 31 May 2008 (UTC)

Outhouse article

Dear NicDumZ: This article is chock full of bare links, and actually needs to be converted to footnotes. It was one of the first articles I worked on, and I didn't know what I was doing at the time. I've noticed some of your good work, and you seem to be doing a lot of this with a bot. If you could help, it would be greatly appreciated. Thanks. Merci! 7&6=thirteen (talk) 19:01, 31 May 2008 (UTC) Stan

Inadvertent advertising

The HTML title from webpages belonging to magazines, which might be perfectly appropriate links in articles, can make highly inappropriate references, because they are used to give promotional messages about the magazine. For example, there are few more comprehensive and well updated sources in English for following competitive cycling than Cycling Weekly, and so it was a good source for somebody to use to show that two teams had been offered a late entry into the 2008 Giro d'Italia, but it is not the place of Wikipedia to declare, as DumZiBoT did, this publication to be "Britain's biggest-selling cycling magazine, delivers an exciting mix of fitness advice, bike tests, product reviews, news and ride guides for every cyclist". Maybe the Bot needs a filter to change its action when it comes across boastful superlatives, or maybe the automatic editnote should more explicitly invite editors to check the suitability of the results. Just a thought. Kevin McE (talk) 11:28, 1 June 2008 (UTC)

Bot

Hey, can you run your bot through Roy Hibbert again? Thanks! Bash Kash (talk) 21:37, 1 June 2008 (UTC)

Zimbabwean dollar

Howdy! Please could you fix Zimbabwean dollar plz? 129.215.48.99 (talk) 01:00, 5 June 2008 (UTC)

You might want try using web reflinks which automatically converts numbered references into ref tags. — Dispenser 04:31, 12 June 2008 (UTC)

Overenthusiastic archiving

Since your last edit on this page, MiszaBot III has moved 10 items onto the archive page, thus leaving some pertinent issues abount this bot apparently unaddressed. This does not seem a very satisfactory response to the comments of other editors. Is it possible to disable the archiving bots trawls of this page while you are not active (only one edit in more than a week). Kevin McE (talk) 14:48, 8 June 2008 (UTC)

Yep, archiving disabled. Thanks for the report :/ NicDumZ ~ 20:44, 11 June 2008 (UTC)

Hola

Cuando pongo Ocultar ediciones de bots en mi lista de seguimiento, siguen apareciendo las de este bot. ¿Qué es lo que está fallando?. Gracias 189.162.18.70 (talk) 03:29, 9 June 2008 (UTC) eswiki

Bot - a minor point

Hi, and well done on your work with the bot - it a fantastic idea.

Just a tiny thing..

Would it be possible to make the bot not insert a title, when the title is "Untitled" or "Untitled Document"? It won't happen a lot, but if it is easy to do, it's probably worth it.

Don't waste your own time if it'll be a big job. Many thanks, Drum guy (talk) 15:40, 9 June 2008 (UTC)

I believe my bot should ignore these. If, recently, it inserted such a title into an article, please give me a diff of such an edit, so I can track a possible bug. NicDumZ ~ 20:45, 11 June 2008 (UTC)

HTML in title

The bot made an incorrect edit, adding HTML as a title. Most likely, this was because the HTML was incorrectly nested within the title tag in the original page. However, the bot should know to either ignore such titles, or strip out the HTML. Superm401 - Talk 20:34, 10 June 2008 (UTC)

Yes, I could do that.
However that edit occurred in an other of these times when DumZiBoT was broken.
This lenghty title, containing " the page you requested could not be found." is supposedly blacklisted. I tested it again, and now the page is being ignored.
NicDumZ ~ 20:49, 11 June 2008 (UTC)

References in templates

Your bot added <references/> to a transcluded template which it shouldn't do. I have wrapped <references/> in <noinclude>s there for now. (Just letting you know). – sgeureka tc 02:17, 13 June 2008 (UTC)

Thanks for the report.
I added a namespace check before adding <references/> :)
NicDumZ ~ 07:47, 29 June 2008 (UTC)

Love Bot

I hate the line noise that is required by Cite.php. I've hated on it elsewhere, and I don't intend to mellow with (r)age. Your bot, however, turns ordinary cites into line noise cites in a way that I can't object to. So now I hate you for coming up with something so utterly wonderful that I can't get my hate on.

In case that's not clear, I'd just like to say, for the record, that I completely and utterly hate you :-)

chocolateboy (talk) 16:51, 14 June 2008 (UTC)

Small problem

Your bot seems to have done something slightly weird here [20]. Maybe you need to strip out formatting from within the title. I have corrected the article page. Keep up the good work. 82.1.57.47 (talk) 05:39, 20 June 2008 (UTC)

How about duplicate refs?

This bot is doing awesome work, still. I get a sense of pride since I feel like I "commissioned" this work in the first place. ;-)

I was just editing a page that the bot had "fixed up", but I realized that the people who cited references didn't understand how to cite the same ref more than once. So the page had a bunch of...

<ref>[http://somewhere Somewhere<!--auto bot gen--></ref>
<ref>[http://somewhere Somewhere<!--auto bot gen--></ref>
<ref>[http://somewhere Somewhere<!--auto bot gen--></ref>

... instead of ...

<ref name="somewhere">[http://somewhere Somewhere<!--auto bot gen--></ref>
<ref name="somewhere" />
<ref name="somewhere" />

Is there a way for the bot to check for duplicates while it does its work? Could the bot check for duplicates on pages it has already done? I realize this isn't an easy task (what would the refname be, for example?) but I just thought I'd throw it out there.

Timneu22 (talk) 13:36, 22 June 2008 (UTC)

Yes, I've been asked to do this for a while :)
I wrote something to handle this (when no refname is specified in the article body, refname is "autogenerated#" where # is an id)
I'm testing it on fr:, where I wont get beaten this much if I make some small mistakes :)
I will then apply it to en :)
NicDumZ ~ 10:20, 29 June 2008 (UTC)

Combines references that aren't the same and orphans references

Something isn't quite right about how it detects/combines duplicate references. See this edit [21]. Specifically the part where it changed: <ref name="update1">IUDs—An Update. [http://www.infoforhealth.org/pr/b6/b6chap1.shtml#top Chapter 1: Background].</ref> into <ref name="population" />

This was wrong in that name=population refers to a different chapter in the referenced item, it also broke other places where name="update1" was referred to.

  • If it is going to rename a ref, it needs to rename all instances, not just the defining instance.
  • Something is wrong in how it decides that two references are the same, the ref with name=population has a different (although similar) URL and a different displayed name, so someplace it made a mistake by deciding to combine them.

(The other combination on the page worked okay.) I repaired the page. Hope this helps improve the bot. Thanks. Zodon (talk) 19:02, 22 July 2008 (UTC)

Hello !
First of all, thanks for the kind (calm) report. These really help to improve the bot, indeed.
However, I think that the page was buggy before my bot :) See this, the ref name "population" referred to two different references, hence the strange behavior.
You are right, however, about your first point (not renaming all the instances), this has been fixed in my script, per this
Thanks a lot, Zodon, this was a very useful report.
NicDumZ ~ 10:27, 23 July 2008 (UTC)
Glad it helped. I missed the bit about there being two items with name=population. Assume you also augmented the bot so that it will detect when there are multiple different references with the same name, and fix the naming issue if it can (e.g. 2 definitions of name with no other uses of it), or flag it if it can't. (I imagine this isn't the only page with that error.) Thanks. Zodon (talk) 05:11, 24 July 2008 (UTC)
True, it was a bit tricky, but I just implemented it : the seconde duplicate "population" gets changed into autogenerated1 :)
Thanks again !
NicDumZ ~ 09:54, 24 July 2008 (UTC)

A new named extlinks bot

Greetings. My name's Quadell, and I run Polbot. I'm considering creating a bot to perform various improvements in external links and references, and I'm a fan of DumZiBoT's task of naming links. I was wondering if you'd be willing to share your code, or at least the regexps you use, for DumZiBoT. That way I'm only partially recreating the wheel. (I'll be working on specific categories in real time, rather than using a database dump, by the way.) Thanks so much! – Quadell (talk) (random) 00:20, 29 June 2008 (UTC)

Sure ! The code is available in the pywikipedia repository, under GPL. Let me know how successful you are in your quest :p
NicDumZ ~ 07:38, 29 June 2008 (UTC)
Thanks for the link! Very helpful indeed. But I can't see where, or how, you merge duplicate refs. At User:DumZiBoT/refLinks, you say "When duplicate references are found (i.e. references having the exact same content) only the first is kept, and a refname is added to the others ( example )" That's the last part I can't figure out how to do, without making bad errors. Where is the code for this function? By the way, I'm requesting approval at Wikipedia:Bots/Requests for approval/Polbot 8. Thanks again, – Quadell (talk) (random) 01:39, 7 July 2008 (UTC)
That's because it has not been committed yet, for it's untested. (see the above section :) )
From the few pages I modified, it seems to work, but I'd like to test it on the next French dump, and receive feedback on it, before committing it.
NicDumZ ~ 08:03, 7 July 2008 (UTC)

Loading several pages at once with the wikipediabot

Hi, I'm in the process of creating a script that checks for a message that is supposed to be above the interwiki in every article on the nn wikipedia. Now, what I don't get is how to load several pages at once. The getall function in wikipedia.py doesn't return anything; so I don't know how to use it, if I in fact should. Thanks in advance. --Harald Khan Ճ 19:40, 6 July 2008 (UTC)

I apologize if I asked the wrong guy ;-) --Harald Khan Ճ 17:30, 31 July 2008 (UTC)

Changing refs in HTML comments

Re this edit. There is no need to go and edit ref tags when it is in a HTML comment. It adds nothing of value to the article. Mikemill (talk) 20:28, 21 July 2008 (UTC)

Sorry for the disturbance :)
This case is a bit particular, but it usually does nothing bad to actually format the comments inside html comments, while checking for a wrapping HTML comment can become really, really complicated.
I just upgraded my script to ignore blank (whitespaces only) references.
Thanks for your report ;)
NicDumZ ~ 20:36, 21 July 2008 (UTC)
Thanks for the change. IIRC HTML comments are not allowed to be nested inside of each other so it should be a matter of looking for the <!-- token and the --> token and removing the stuff inbetween. However, I'm not sure what format you are looking at the page code in so if you believe it is too hard then so be it ;) Mikemill (talk) 13:20, 22 July 2008 (UTC)
Sure, nested comments are not allowed. However, I work with regular expressions to detect different type of references, and the thing is : I can't "exclude" from these matches the commented wikitext. I can remove all the commented text before doing any work, but then I would have to re-add it after processing the references, and that might be tricky. Also, when working on a specific part of the text, testing if it is nested in an HTML comment is easy, and I could test, one by one every references and exclude the ones included in comments. However, the last part of my script is basically : "for each found references, replace it by the processed one". Now imagine two identical references, the first being inside a comment, the latter not being commented out. If I want to ignore the first one but not the latter one, it gets really hard, while processing all the identical references the same way is really easy.
There are ways, of course, to take special care of these cases, but honestly, I dont think that the very rare cases where it causes problems justify spending so much time on code :)
NicDumZ ~ 13:31, 22 July 2008 (UTC)

Hey

In the word "dumb" the letter 'b' is silent. Cheers. 89.243.32.74 (talk) 12:04, 22 July 2008 (UTC)


Messed up sources in Hamas article

Hello. On the last run of the bot on the Hamas article all references were messed up from ref. no 37 and onwards. Can you please recommend a solution? Thanks. Tkalisky (talk) 04:11, 23 July 2008 (UTC)

Sorry... I just fixed this here !
I have fixed my bot, thanks for the report :)
NicDumZ ~ 10:42, 23 July 2008 (UTC)

Edit summary?

Hey NicDumZ. I noticed DumZiBoT is removing duplicate citations, like (s)he did at Coronary artery bypass surgery here. It took me a second, because the edit summary (s)he left was "(Bot: Converting bare references, see FAQ)". Otherwise, on a cursory glance, it looked like s(he) had just deleted two references... After a second, I realized they were duplicates, when I noticed that DumZiBoT left the "ref name" in place. I see that's mentioned briefly under the "features" list in the FAQ... Is it possible for DumZiBoT's edit summary to say something like "Bot: References cleanup. Converting bare refs, removing and adding ref name for duplicates, adding missing ref lists; see FAQ." I guess that's kind of long, but it's just that s(he) isn't just converting bare references, and that can cause confusion (at least for me?)...   user:j    (aka justen)   15:45, 23 July 2008 (UTC)

Thanks for this comment. The summary was of course the old summary, from the first version of my script.
I just changed it, hoping that it's better.
There was a huge bug, preventing any new titles from being added, it has also just been fixed.
Again, thanks for this one, it has been very useful.
NicDumZ ~ 16:15, 23 July 2008 (UTC)

Bot error

The bot did bizarre things at Mongol Empire, changing valid ref names to "autogenerated" and messing up some other tags,[22] so I went ahead and reverted. This was a few hours ago, so I don't know if there might be other damaged articles, but I would recommend checking. --Elonka 18:02, 23 July 2008 (UTC)

True, this is buggy. The bot has stopped since, so no need to worry about it, but I will make sure to fix that one before restarting it, once back home.
Thanks for the accurate report ;)
NicDumZ ~ 19:11, 23 July 2008 (UTC)
Eventually, I fixed it !
Thanks again :)
NicDumZ ~ 09:07, 24 July 2008 (UTC)

Bot messed up Warren Buffett page

Categories were not showing up after bot's visit. I reverted the changes. —Preceding unsigned comment added by Kaka2008 (talkcontribs) 18:34, 23 July 2008 (UTC)

I believe this had nothing to do with mw bot, I reverted back :)
NicDumZ ~ 18:57, 23 July 2008 (UTC)

Dup refs

It looks like DumZiBot is fixing duplicate refs and doing a great job at it. Huzzah! – Quadell (talk) 19:10, 23 July 2008 (UTC)

Well, this particular edit is broken :P ( see above ) There's no need to replace existing reference names into "autogenerated". It has something to do with quotes. The script works with name="blah" but replaces name=blah by name="autogenerated#". Minor mistake, yes, but easy fix; I'll look into this tomorrow.
But still, yay ! :)
NicDumZ ~ 19:26, 23 July 2008 (UTC)

Bot is substituting bad names for working references

Please see this diff: [[23]]

I believe the bot is doing this because it does not see "quotes" around the name. Quotes are not necessary for single-word names in references. As in

<ref name=WORD>

"Quotes" are only needed if there is more than one word. As in

<ref name="WORD1 WORD2"> --Timeshifter (talk) 22:28, 23 July 2008 (UTC)
Yes, you are right, even if it has been reported to me above :p
I just fixed it, sorry for the inconvenience. My next move will be to modify my script, not to add quotes when those weren't there, to avoid what's happening in that last diff :)
NicDumZ ~ 09:21, 24 July 2008 (UTC)

Bot is combining 2 different references with the same name

Casualties of the Iraq War. Please see this diff: [24]

The bot combined 2 different references into one. If one looks at the page at the above diff link and searches for

<ref name=LAtimes>

one sees that there were 2 references mistakenly using the same name. The easy way for the bot to tell might be to compare the URLs. The bot gave the 2 references the same name. One of the references was then no longer used.

Maybe the bot could give one of the references a different name. Casualties of the Iraq War may be a good beta workout for the bot since it has over 150 references. --Timeshifter (talk) 13:36, 24 July 2008 (UTC)

And if you look a little bit closer, you'll see that the first reference using LAtimes ""War's Iraqi Death Toll Tops 50,000". Louise Roug and Doug Smith. Los Angeles Times. June 25, 2006." is kept, while the second, ""Poll: Civilian Death Toll in Iraq May Top 1 Million". By Tina Susman. Sept. 14, 2007. Los Angeles Times." is converted into "ORB2" :)
This matter had been raised earlier in the day, and has been fixed since.
NicDumZ ~ 13:46, 24 July 2008 (UTC)

(unindent) OK, I figured it out, I think. The bot tried to separate the 2 references and saw that one of the references had been given the name "ORB2" elsewhere (probably by me long ago).

But the bot got mixed up sometimes and put the "ORB2" reference where the "LAtimes" reference was.

It is easiest to see by looking for this line in the article:

A June 25, 2006 ''[[Los Angeles Times]]'' article, "War's Iraqi Death Toll Tops 50,000",<ref name=LAtimes/>

The bot mistakenly substituted the "ORB2" reference at the end of that line. I don't know how the bot could have known what reference name to use there though.

The bot made a mistake at another location too. Look for

A June 25, 2006 ''[[Los Angeles Times]]'' article<ref> [http://www.commondreams.org/headlines06/0625-03.htm "War's Iraqi Death Toll Tops 50,000"]. Louise Roug and Doug Smith. ''[[Los Angeles Times]].'' June 25, 2006.</ref>

The bot substituted the "ORB2" reference for that reference. The bot maybe should have looked at the URL and gave it the "LAtimes" reference name instead.

Maybe when there are 2 references using the same name, the bot should be instructed not to do anything unless it can also see the URL. Otherwise the bot is guessing. Since the bot can't think it can't do some tasks. Or the bot should put BOTH names. At least that way the readers might be able to figure out which one is the reference.

Better yet, the bot should flag the mistake somehow, so that the article editors can fix it. It would be nice if the bot could leave a note on the talk page. Edit history comments might not get seen by most editors if a later edit occurs soon after the bot edit. --Timeshifter (talk) 15:36, 24 July 2008 (UTC)

Okay, here is a fixed reference fix on that same article, copied in my fr: userspace, and here is the message added on the talk page of this article.
Better, isnt it ?
NicDumZ ~ 18:03, 24 July 2008 (UTC)
That is great! I don't see the PDF URLs though on the talk page notice:
I see the PDF URLs in the diff above the talk page notice, though. --Timeshifter (talk) 19:38, 24 July 2008 (UTC)
Because it uses Tl:PDFlink which probably does not the same as on en: :p If not, please explain me again what you're saying :)
NicDumZ ~ 19:41, 24 July 2008 (UTC)
That link goes to a redirect page with a link. Clicking the link goes to here:
http://fr.wikipedia.org/wiki/Mod%C3%A8le:Pdf
What is it supposed to do exactly? Hide the PDF? I can't figure out what it is. --Timeshifter (talk) 02:39, 25 July 2008 (UTC)
We dont use it the same way. Pdf does not take any parameter on fr, it is just used to "flag" pdf documents, as (in French) flags French documents. Anyway we don't care, do we ? What's important is the wiki text; if I copy it back to en:, it will work ? :)

(unindent). OK. Here is the same wikicode here on an English talk page:

Great! It is working fine here. --Timeshifter (talk) 15:23, 25 July 2008 (UTC)

Yup, I have opened a BRFA for that particular new task :)
NicDumZ ~ 15:25, 25 July 2008 (UTC)

I think your bot may be misbehaving :-)

Hi, four pages that I watch (because they all relate to Buckinghamshire) have just had their bare refs given an automatic title by your bot, which is fine, except that the name given to them was "Check Browser Settings" ([26] [27] [28] [29]). The external link in all cases was The Office of National Statistics, which I have been able to get into without problem, and have amended the refs accordingly. I recognise it may be a problem with the website itself as the bot doesn't appear to have done the same for other refs, however I'm wondering which other articles it has done this to? -- roleplayer 14:32, 24 July 2008 (UTC)

Bot Philosophy 101

This bot is fantastic when it is working right. It is a very needed bot. Many people just leave URLs as references without filling out the reference details.

I believe though that references are important above almost all else. So I suggest that the bot philosophy should be to err on the side of leaving incomplete references in the articles if the alternative is that the bot in effect removes some references in its efforts to fill in the details on all the references.

In other words when in doubt leave the reference in question as it is. For example; when one name is used for 2 references as described in a previous talk section.

<ref name="WORD or PHRASE">

This way the bot is almost always helping, and never hurting. Otherwise it may be creating many errors that may never get noticed. It already has created such errors. There is almost no way to know how many. I hope you keep bot runs short in order to allow people to comment, and to allow time to fix both the bot and the page errors. --Timeshifter (talk) 15:48, 24 July 2008 (UTC)

My bot only runs when told to, and when running, I'm constantly watching my talk page. It is stopped, and I'm working on an upgrade to leave dubious references alone, and to notice the editors in the talk page. dont worry ;)
NicDumZ ~ 15:54, 24 July 2008 (UTC)

Removing ref content

[30] In the diff, the bot removed the content for the named ref. Perhaps it was confused by the other uses of the name, some of which come before the definition in the text? Gimmetrow 14:43, 25 July 2008 (UTC)

Ah, this was a strange one. If you look at the wiki text after the diff, there IS a definition of the ref name WWEBio before the first use of <ref name="WWEBio" />, so I could not understand why the references were broken.
But... the parameter "billed height" did not exist :) so I replaced it to "height", and everything is repaired :)
NicDumZ ~ 14:59, 25 July 2008 (UTC)