User talk:Jnc/EditCount

Latest comment: 18 years ago by Who in topic Database groveling

Database groveling edit

...it blew out again, at close to the same point (might have been the exact spot; I didn't keep the other copy to be sure).... you said you got like 80% of the way through, doing the gunzip | grep on your machine? If you still have the output file from that, or if you're willing to run it again, it would be interesting to try my program on it, to see if it works OK, and see what the data looks like. ... Noel (talk) 09:58, 22 July 2005 (UTC)Reply

Yea, I think I was wrong about how far it got, the output is only 25mb, I had too many complaints about the lag during its run (host lots of sites), so I couldn't crunch it again. I dont' have my local linux server running here, was gonna run it on my pc, just have to get it setup to do so. I see you may have got further on the talk page. Sorry I couldn't help further, atm. If you still aren't getting anywhere, let me know, as I will be trying to get it done on my local pc, just have to re-d/l the dump, I deleted it off my local pc. Who?¿? 19:19, 22 July 2005 (UTC)Reply

yea I deleted the dump, so I would have to re-download it. I still have it on my server, but cant re-grep it. It's no biggy to re-d/l it though, I got it in like 2 hours I think. Granted I wont be here the rest of the night, but it can d/l in the background. Who?¿? 20:06, 22 July 2005 (UTC)Reply

Let me make sure I understand you. You downloaded the full dump to your server, and still have it there? And to do any actual processing on it, you'd have to copy it to another machine you have, on which you used to have a copy, but from which you have now deleted it? Noel (talk) 13:22, 23 July 2005 (UTC)Reply

Yea, d/l the full dump both to server and local pc, the server couldn't handle the grep's because of how much work goes on for the websites, not sure if my friend del'd the dump on server yet. And yes, to do any further work on it, I would re download it to my pc, find the right tools (windows on pc, linux on server), to work on it. I have grep, but not sure I like how it works, and would have to write a program to parse the rest of the data, which isnt really that difficult anyway. I wasn't sure how much you had completed, so I didn't re-download the dump again. Which I can still do. Who?¿? 17:47, 23 July 2005 (UTC)Reply

Yea I believe I still have it or can get it fairly quickly. I can run your prog on mine if you want. Sorry took awhile to get back been gone all day. Who?¿? 20:59, 24 July 2005 (UTC)Reply
Will have it again in about 1hr, 55% d/l done. Compiled your prog on my laptop, its a bit stronger than my pc. Will let you know the outcome in a few arns. Who?¿? 22:33, 24 July 2005 (UTC)Reply

Now I am getting this error:

unexpected char '=' in '== Speculative non-fiction books
about artificial intelligence ==</comment>'

I'm sure its just an extra characther in front of =, so I'm gonna decompress the structure and fix the error. If that doesnt work, I'll just edit your prog to look for it and recompile it. Who?¿? 00:18, 25 July 2005 (UTC)Reply

Nope its an error in the dump file, I'm going with the full page dump of 16JUL right now. Who?¿? 00:56, 25 July 2005 (UTC)Reply
Ok, same error, I think same spot on new dump. I think the prob is with gzip, I just found a patch on their page discussion long files. Have to compile a new version and try again. Who?¿? 04:28, 25 July 2005 (UTC)Reply
That didn't work either. Seems the real error is either no EOF on the dumps or the progs are failing. I tried 2 versions of gzip, rar, and winzip, they all error'd out. I finally got it unzipped, so I think this is a full parsed file, using your program. http://questdesign.net/db.xml I parsed another copy of the dump and got the same size file 683k, http://questdesign.net/output.xml.gz is the output, 169mb uncompressed. So I'm guessing even with the error, it may be all of it? Who?¿? 06:23, 25 July 2005 (UTC)Reply

Yea sorry, I can tend to be cryptic if I have a thought in my head. Ok, so yea I still have the full dump on my laptop and the server. I tried two different dumps and get the same error in the same spot. I tried the gzip's, rar, and winzip on laptop, all errors, so I monitored progress, it seemed to make it to EOF before the error, but that was way off. On the server, it didnt get the error till later on, so I cheated and did "gzip -c -d -f > output.xml" to get as much as the dump as possible unzipped. I figured since i got a 169mb output after running your program, I had it all, or most of it. I forgot to tail it. We need to get another dump from somewhere, I thought I seen another mirror for the dumps, but don't remember where. Who?¿? 12:56, 25 July 2005 (UTC)Reply

Well it seems Jamesday did a SQL query to the local db with this script, so I guess that is settled. I updated Wikipedia:List of Wikipedians by number of recent edits with the data from the recent list. Oh well, I cant do anything if they do it from the local db, but at least its done. Who?¿? 17:32, 26 July 2005 (UTC)Reply

re: edit count table edit

Hi - The positonal delta (up/down listed in the table) is currently done based on the "last week's ranking" in the csv. I don't think the "this week's ranking" number is used at all (it's recomputed). We could drop the positional delta, or modify the script to take as input the output from the last time it was run. The version I've written is a shell, so it's already not what you'd call frighteningly fast. If it needs to keep track of the positional delta, I think I might redo it in nawk or something (perl might be better, but I haven't done a lot of perl). So, do you think we need to keep the positional delta? -- Rick Block (talk) 21:41, July 22, 2005 (UTC)

Looking at the script, it does use the "this week's ranking", to compute the positional delta (it doesn't currently recompute). I'll work on a version that doesn't use the ranking information. -- Rick Block (talk) 21:53, July 22, 2005 (UTC)
Easiest would be the same format as the previous .csv file, but it's only a 50 line shell script so it doesn't really make a lot of difference. Actually, it would be good to have an indicator of whether the user is a bot or not (which I think is in the database, I'm doing this now with an explicit list in the shell script). If the file doesn't have the old/new positions in it (which I think is reasonable), I'm thinking about making the positional delta be "since last posted", probably identifying it by date. The only issue with this is folks newly appearing on the list. What do you think about "new" for these? -- Rick Block (talk) 00:22, July 23, 2005 (UTC)
I posted awk versions of the scripts at user:Rick Block/wp1000. If you want to run these on the csv file you're generating and update the article that'd be fine with me. -- Rick Block (talk) 20:45, July 23, 2005 (UTC)

List of Wikipedians by number of edits edit

Hi Noel, I added a reply to your message at Computer help desk/Dmcdevit 20050718 regarding the Wikipedia:List of Wikipedians by number of edits. I'd like to centralize communication there (so there is a record and so we don't have to fuss with talk pages) so would you be willing to add that page to your watchlist? I'm sure we can get this one tackled together. Triddle 17:14, July 24, 2005 (UTC)

In case you didn't notice, Jamesday just updated the list and posted an SQL script to regenerate it (which I assume needs a complete copy of the database). -- Rick Block (talk) 23:05, July 25, 2005 (UTC)