Creating a .txt version of wikipedia - log

Download the english Wikipedia html dump of April 2007 (most recent complete english version available) from: http://dumps.wikimedia.org/

files:

  • wikipedia-en-html.0.7z 1 GB ext.: ~12 GB
  • wikipedia-en-html.1.7z 897 MB ext.: 11,3 GB in 999.992 files
  • wikipedia-en-html.2.7z 937 MB ext.: ~11 GB
  • wikipedia-en-html.3.7z 961 MB ext.: ~11 GB
  • wikipedia-en-html.4.7z 886 MB ext.: ~11 GB
  • wikipedia-en-html.5.7z 947 MB ext.: ~11 GB
  • wikipedia-en-html.6.7z 772 MB ext.: ~12 GB
  • wikipedia-en-html.7.7z 846 MB ext.: ~12 GB
  • wikipedia-en-html.8.7z 255 MB ext.: ~7 GB
total size: ~7,5 GB ext.: ~110 GB

Delete folders "images", "raw", "skins" and "upload"

using cygwin:

Delete files smaller than 2KB - delete redirect pages ****
Delete files with "Talk~" and "_talk~" in their name - delete all talk pages ****
Delete files with "Image~" in their name - delete all image pages (no images anyway) ****
Delete files with "User~" and "User_" in their name - delete all user pages ****
Delete files with "Wikipedia~" in their name - delete archived debates and articles for deletion ****
Delete files with "WP~" in their name - delete policy pages and remaining redirect pages ****
Delete files with "Portal~" in their name - delete portal pages ****
  • 0: 6,72 GB (7.222.928.116 bytes) (7,42 GB on disk) in 365.064 files
  • 1&2: 9,85 GB (10.585.485.390 bytes) (10,9 GB) in 581.674 files
  • 3,4,5: 13,8 GB (14.828.113.013 bytes) (15,3 GB) in 797.025 files
  • 6,7,8: 7,86 GB (8.447.226.758 bytes) (8,72 GB) in 438.116 files

Use freeware htmlastxt to convert html files to .txt

note: some corrupted files or files containing too weird characters that don't work in .txt format and some articles with names too long could not be converted

  • 0: 1,77 GB (1.905.887.280 bytes) (2,40 GB on disk) in 360.838 files - 4.226 files not converted
  • 1&2: 2,24 GB (2.410.544.202 bytes) (3,30 GB) in 573.434 files - 8.240 files lost
  • 3,4,5: 3,10 GB (3.338.965.872 bytes) (4,56 GB) in 782.578 files - 14.447 files lost
  • 6,7,8: 1,58 GB (1.702.757.532 bytes) (2,28 GB) in 386.153 files - 7.321 file lost (+ tidied)

Delete old .html files

Delete files with "Category~" in their name - delete all category pages(not the number/history ones)***
Delete files with "Template~" in their name - delete all standalone template pages ***
Delete files smaller than 2kb - delete empty articles (quite a few there)
Delete empty folders x3
  • 345: 2.9gb 4.16gb 676.793 files
  • Total: 7,29 GB (7.833.173.271 bytes) 10,3 GB in 1.721.590 files

tidy up what went through the automated deletion

create numbers folder
create special characters folder
create list folder
create favourites folder


Delete strings in .txt files:




From Wikipedia, the free encyclopedia



You have new messages (last change).


[edit] = ++





Views


+ Article
+ Discussion
+ Current revision




Navigation


+ Main page
+ Contents
+ Featured content
+ Current events



interaction


+ About Wikipedia
+ Community portal
+ Contact us
+ Make a donation
+ Help



Search




In other languages


+ All text is available under the terms of the GNU Free Documentation License. (See <Copyrights> for details.)
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a US-registered 501(c)(3) tax-deductible nonprofit charity.
+ About Wikipedia
+ Disclaimers





  • Final: ++ wikipedia.txt : 5,71 GB (6.135.151.270 bytes) (9,22) 1.638.226 files (about 1 GB once compressed with 7z)

Junk left:

  • Some Empty folders
  • Some redirect pages
  • various pitoresque junk pages


Ps: A big thanks to the people at the Computer Ref Desk for the help with Cygwin commands.