Below is the list of the 100+ most often misspelled long words (10 letters or more) in the enwiki dump: enwiki-20070206-pages-articles.xml.bz2

The words are sorted first by word form distribution ("d" in the list) then by count ("c" in the list). The 9GB dump file is logically broken into 6314 parts, each 200'000 tokens long (using a special C program I wrote), so the "distribution" number shows in how many of these parts each misspelling is found. In other words the higher the distribution number the more homogeneously distributed each misspelling is throughout the dump file.

  • Example: definately (suggest: 1. definitely (change i 6)) 1322d 1495c

means "definately" is misspelled, suggested replacement is "definitely", the letter at position 6 should be changed to "i", the distribution of the misspelled word is 1322, the raw count of the misspelled word is 1495.

TODO:

edit

Publishing the complete list (or at least much longer) in the near future. The raw enwiki-20070206 file of the misspelled long words (10 letters or more) with suggested replacements contains about 147'000 different misspellings (57'700 with distribution 2 or more), but some percentage are false positives due to imperfections in the half-million English word forms list that I use, mostly due to missing valid word forms (for example "microcontrollers" is reported as a misspelling with a suggested replacement "microcontroller", simply because the plural form "microcontrollers" is not in the English word list). Another reason for false positives are the foreign words that are very close to their English counterparts (example: "Enciclopedia" (Spanish) is reported as a misspelling of Encyclopedia (change y 4) 495d 632c.

Top 100+ misspelled long words

edit
  1. definately (suggest: 1. definitely (change i 6)) 1322d 1495c
  2. independant (suggest: 1. independent (change e 9)) 1040d 1187c
  3. Encylopedia (suggest: 1. Encyclopedia (add c 5)) 929d 874c
  4. advertisment (suggest: 1. advertisement (add e 9)) 725d 748c
  5. knowledgable (suggest: 1. knowledgeable (add e 9)) 609d 713c
  6. encylopedic (suggest: 1. encyclopedic (add c 5)) 551d 669c
  7. commerical (suggest: 1. commercial (swap i-c 7-8)) 503d 800c
  8. consistant (suggest: 1. consistent (change e 8)) 450d 495c
  9. infomation (suggest: 1. information (add r 5) 2. infumation (change u 4)) 447d 496c
  10. mispelling (suggest: 1. misspelling (add s 3) 2. dispelling (change d 1) 3. mistelling (change t 4)) 447d 489c
  11. accomodate (suggest: 1. accommodate (add m 5)) 408d 466c
  12. appearence (suggest: 1. apparence (drop e 4) 2. appearance (change a 7)) 401d 474c
  13. departement (suggest: 1. department (drop e 7)) 400d 553c
  14. explaination (suggest: 1. explanation (drop i 6)) 393d 426c
  15. irrelevent (suggest: 1. irrelevant (change a 8)) 376d 441c
  16. seperately (suggest: 1. separately (change a 4)) 345d 387c
  17. salvagable (suggest: 1. salvageable (add e 7)) 337d 369c
  18. neccessary (suggest: 1. necessary (drop c 3)) 327d 382c
  19. resoultion (suggest: 1. resolution (swap u-l 5-6)) 322d 654c
  20. Profesional (suggest: 1. Professional (add s 6)) 320d 378c
  21. pronounciation (suggest: 1. pronunciation (drop o 5)) 311d 453c
  22. appearences (suggest: 1. appearances (change a 7)) 311d 336c
  23. unneccessary (suggest: 1. unnecessary (drop c 5)) 308d 309c
  24. unecessary (suggest: 1. necessary (drop u 1) 2. unnecessary (add n 2)) 307d 314c
  25. arguements (suggest: 1. arguments (drop e 5)) 306d 382c
  26. persistant (suggest: 1. persistent (change e 8)) 283d 330c
  27. overwitten (suggest: 1. overwritten (add r 6) 2. overbitten (change b 5)) 281d 388c
  28. Unfortunatly (suggest: 1. Unfortunately (add e 11)) 275d 177c
  29. particulary (suggest: 1. articulary (drop p 1) 2. particular (drop last y 11) 3. particularly (add l 11) 4. particulars (change s 11)) 274d 278c
  30. apparantly (suggest: 1. apparently (change e 6)) 272d 242c
  31. correspondance (suggest: 1. correspondence (change e 11)) 270d 171c
  32. nonexistant (suggest: 1. nonexistent (change e 9)) 259d 271c
  33. embarassing (suggest: 1. embarrassing (add r 5)) 251d 272c
  34. notablility (suggest: 1. notability (drop l 6)) 250d 254c
  35. particularily (suggest: 1. particularly (drop i 11) 2. particularity (change t 12)) 248d 273c
  36. consistancy (suggest: 1. consistency (change e 8)) 246d 251c
  37. appropiate (suggest: 1. appropriate (add r 7)) 245d 287c
  38. immediatly (suggest: 1. immediately (add e 9)) 239d 259c
  39. assasination (suggest: 1. assassination (add s 5)) 237d 226c
  40. rediculous (suggest: 1. pediculous (change p 1) 2. ridiculous (change i 2)) 232d 262c
  41. resemblence (suggest: 1. resemblance (change a 8)) 231d 236c
  42. harrassing (suggest: 1. harassing (drop r 3)) 230d 338c
  43. Entreprise (suggest: 1. Enterprise (swap r-e 4-5)) 222d 262c
  44. succesfully (suggest: 1. successfully (add s 6)) 221d 243c
  45. consistantly (suggest: 1. consistently (change e 8)) 212d 225c
  46. threshhold (suggest: 1. threshold (drop h 6) 2. threshwold (change w 7)) 209d 222c
  47. independance (suggest: 1. independence (change e 9)) 205d 156c
  48. adminstrator (suggest: 1. administrator (add i 6)) 204d 216c
  49. curiousity (suggest: 1. curiosity (drop u 6)) 193d 200c
  50. indefinately (suggest: 1. indefinitely (change i 8)) 193d 240c
  51. Liscensing (suggest: 1. Licensing (drop s 3)) 187d 252c
  52. politicans (suggest: 1. politicians (add i 8)) 187d 197c
  53. Assocation (suggest: 1. Association (add i 6)) 185d 175c
  54. archaelogical (suggest: 1. archaeological (add o 7)) 185d 119c
  55. occassions (suggest: 1. occasions (drop s 5)) 183d 215c
  56. insistance (suggest: 1. insistence (change e 7)) 183d 195c
  57. implimented (suggest: 1. implemented (change e 5)) 183d 190c
  58. possiblity (suggest: 1. possibility (add i 7)) 182d 192c
  59. propoganda (suggest: 1. propaganda (change a 5)) 176d 196c
  60. somethings (suggest: 1. something (drop last s 10)) 174d 148c
  61. seperation (suggest: 1. severation (change v 3) 2. separation (change a 4)) 173d 158c
  62. targetting (suggest: 1. targeting (drop t 6) 2. pargetting (change p 1)) 172d 195c
  63. signficant (suggest: 1. significant (add i 5)) 171d 189c
  64. comparision (suggest: 1. comparison (drop i 9)) 168d 156c
  65. transfering (suggest: 1. transferring (add r 8)) 167d 263c
  66. mentionned (suggest: 1. mentioned (drop n 7)) 167d 212c
  67. apropriate (suggest: 1. appropriate (add p 2)) 167d 192c
  68. notibility (suggest: 1. notability (change a 4)) 164d 216c
  69. enviroment (suggest: 1. environment (add n 7)) 164d 156c
  70. intersting (suggest: 1. interesting (add e 6)) 164d 155c
  71. successfull (suggest: 1. successful (drop l 10) 2. successfully (add y 12)) 161d 166c
  72. inflamatory (suggest: 1. inflammatory (add m 6)) 159d 201c
  73. neccessarily (suggest: 1. necessarily (drop c 3)) 158d 168c
  74. particuarly (suggest: 1. particularly (add l 8)) 157d 166c
  75. involvment (suggest: 1. involvement (add e 7)) 152d 165c
  76. occurences (suggest: 1. occurrences (add r 5)) 152d 152c
  77. independantly (suggest: 1. independently (change e 9)) 152d 160c
  78. verifyable (suggest: 1. verifiable (change i 6)) 151d 162c
  79. discription (suggest: 1. description (change e 2)) 150d 153c
  80. contruction (suggest: 1. construction (add s 4) 2. contraction (change a 6)) 149d 142c
  81. definetely (suggest: 1. definitely (change i 6)) 148d 143c
  82. compliation (suggest: 1. complication (add c 7) 2. compilation (swap l-i 5-6)) 147d 135c
  83. governement (suggest: 1. government (drop e 7)) 146d 130c
  84. suprisingly (suggest: 1. surprisingly (add r 3)) 146d 130c
  85. preferrably (suggest: 1. preferably (drop r 6)) 146d 134c
  86. perfomance (suggest: 1. performance (add r 6)) 145d 126c
  87. embarassment (suggest: 1. embarrassment (add r 5)) 142d 138c
  88. adminstrators (suggest: 1. administrators (add i 6)) 141d 144c
  89. commisioned (suggest: 1. commissioned (add s 6)) 140d 149c
  90. apperances (suggest: 1. appearances (add a 5)) 140d 107c
  91. precendent (suggest: 1. precedent (drop n 6)) 137d 148c
  92. responsiblity (suggest: 1. responsibility (add i 10)) 136d 142c
  93. committment (suggest: 1. commitment (drop t 6) 2. committent (drop m 8)) 134d 146c
  94. prestigous (suggest: 1. prestigious (add i 8)) 133d 133c
  95. ressources (suggest: 1. resources (drop s 3)) 130d 105c
  96. contraversial (suggest: 1. controversial (change o 6)) 130d 146c
  97. indepedent (suggest: 1. independent (add n 7)) 130d 121c
  98. adminstrative (suggest: 1. administrative (add i 6)) 129d 127c
  99. experiance (suggest: 1. experience (change e 7)) 129d 141c
  100. enyclopedia (suggest: 1. encyclopedia (add c 3)) 128d 122c
  101. developped (suggest: 1. developed (drop p 7)) 128d 141c
  102. embarassed (suggest: 1. embarrassed (add r 5)) 128d 136c
  103. perjorative (suggest: 1. pejorative (drop r 3) 2. perorative (drop j 4) 3. perforative (change f 4)) 127d 146c
  104. attendence (suggest: 1. attendance (change a 7)) 127d 127c
  105. accomodation (suggest: 1. accommodation (add m 5)) 126d 133c


(More soon)