Creating a .txt version of wikipedia - log
Download the english Wikipedia html dump of April 2007 (most recent complete english version available) from: http://dumps.wikimedia.org/
files:
- wikipedia-en-html.0.7z 1 GB ext.: ~12 GB
- wikipedia-en-html.1.7z 897 MB ext.: 11,3 GB in 999.992 files
- wikipedia-en-html.2.7z 937 MB ext.: ~11 GB
- wikipedia-en-html.3.7z 961 MB ext.: ~11 GB
- wikipedia-en-html.4.7z 886 MB ext.: ~11 GB
- wikipedia-en-html.5.7z 947 MB ext.: ~11 GB
- wikipedia-en-html.6.7z 772 MB ext.: ~12 GB
- wikipedia-en-html.7.7z 846 MB ext.: ~12 GB
- wikipedia-en-html.8.7z 255 MB ext.: ~7 GB
- total size: ~7,5 GB ext.: ~110 GB
Delete folders "images", "raw", "skins" and "upload"
using cygwin:
- Delete files smaller than 2KB - delete redirect pages ****
- Delete files with "Talk~" and "_talk~" in their name - delete all talk pages ****
- Delete files with "Image~" in their name - delete all image pages (no images anyway) ****
- Delete files with "User~" and "User_" in their name - delete all user pages ****
- Delete files with "Wikipedia~" in their name - delete archived debates and articles for deletion ****
- Delete files with "WP~" in their name - delete policy pages and remaining redirect pages ****
- Delete files with "Portal~" in their name - delete portal pages ****
- 0: 6,72 GB (7.222.928.116 bytes) (7,42 GB on disk) in 365.064 files
- 1&2: 9,85 GB (10.585.485.390 bytes) (10,9 GB) in 581.674 files
- 3,4,5: 13,8 GB (14.828.113.013 bytes) (15,3 GB) in 797.025 files
- 6,7,8: 7,86 GB (8.447.226.758 bytes) (8,72 GB) in 438.116 files
Use freeware htmlastxt to convert html files to .txt
note: some corrupted files or files containing too weird characters that don't work in .txt format and some articles with names too long could not be converted
- 0: 1,77 GB (1.905.887.280 bytes) (2,40 GB on disk) in 360.838 files - 4.226 files not converted
- 1&2: 2,24 GB (2.410.544.202 bytes) (3,30 GB) in 573.434 files - 8.240 files lost
- 3,4,5: 3,10 GB (3.338.965.872 bytes) (4,56 GB) in 782.578 files - 14.447 files lost
- 6,7,8: 1,58 GB (1.702.757.532 bytes) (2,28 GB) in 386.153 files - 7.321 file lost (+ tidied)
Delete old .html files
- Delete files with "Category~" in their name - delete all category pages(not the number/history ones)***
- Delete files with "Template~" in their name - delete all standalone template pages ***
- Delete files smaller than 2kb - delete empty articles (quite a few there)
- Delete empty folders x3
- 345: 2.9gb 4.16gb 676.793 files
- Total: 7,29 GB (7.833.173.271 bytes) 10,3 GB in 1.721.590 files
tidy up what went through the automated deletion
- create numbers folder
- create special characters folder
- create list folder
- create favourites folder
Delete strings in .txt files:
From Wikipedia, the free encyclopedia
You have new messages (last change).
[edit] = ++
Views
- + Article
- + Discussion
- + Current revision
Navigation
- + Main page
- + Contents
- + Featured content
- + Current events
interaction
- + About Wikipedia
- + Community portal
- + Contact us
- + Make a donation
- + Help
Search
In other languages
- + All text is available under the terms of the GNU Free Documentation License. (See <Copyrights> for details.)
- Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a US-registered 501(c)(3) tax-deductible nonprofit charity.
- + About Wikipedia
- + Disclaimers
- Final: ++ wikipedia.txt : 5,71 GB (6.135.151.270 bytes) (9,22) 1.638.226 files (about 1 GB once compressed with 7z)
Junk left:
- Some Empty folders
- Some redirect pages
- various pitoresque junk pages
Ps: A big thanks to the people at the Computer Ref Desk for the help with Cygwin commands.