Below is the list of the 100+ most often misspelled long words (10 letters or more) in the enwiki dump: enwiki-20070206-pages-articles.xml.bz2
The words are sorted first by word form distribution ("d" in the list) then by count ("c" in the list). The 9GB dump file is logically broken into 6314 parts, each 200'000 tokens long (using a special C program I wrote), so the "distribution" number shows in how many of these parts each misspelling is found. In other words the higher the distribution number the more homogeneously distributed each misspelling is throughout the dump file.
- Example: definately (suggest: 1. definitely (change i 6)) 1322d 1495c
means "definately" is misspelled, suggested replacement is "definitely", the letter at position 6 should be changed to "i", the distribution of the misspelled word is 1322, the raw count of the misspelled word is 1495.
TODO:
editPublishing the complete list (or at least much longer) in the near future. The raw enwiki-20070206 file of the misspelled long words (10 letters or more) with suggested replacements contains about 147'000 different misspellings (57'700 with distribution 2 or more), but some percentage are false positives due to imperfections in the half-million English word forms list that I use, mostly due to missing valid word forms (for example "microcontrollers" is reported as a misspelling with a suggested replacement "microcontroller", simply because the plural form "microcontrollers" is not in the English word list). Another reason for false positives are the foreign words that are very close to their English counterparts (example: "Enciclopedia" (Spanish) is reported as a misspelling of Encyclopedia (change y 4) 495d 632c.
Top 100+ misspelled long words
edit- definately (suggest: 1. definitely (change i 6)) 1322d 1495c
- independant (suggest: 1. independent (change e 9)) 1040d 1187c
- Encylopedia (suggest: 1. Encyclopedia (add c 5)) 929d 874c
- advertisment (suggest: 1. advertisement (add e 9)) 725d 748c
- knowledgable (suggest: 1. knowledgeable (add e 9)) 609d 713c
- encylopedic (suggest: 1. encyclopedic (add c 5)) 551d 669c
- commerical (suggest: 1. commercial (swap i-c 7-8)) 503d 800c
- consistant (suggest: 1. consistent (change e 8)) 450d 495c
- infomation (suggest: 1. information (add r 5) 2. infumation (change u 4)) 447d 496c
- mispelling (suggest: 1. misspelling (add s 3) 2. dispelling (change d 1) 3. mistelling (change t 4)) 447d 489c
- accomodate (suggest: 1. accommodate (add m 5)) 408d 466c
- appearence (suggest: 1. apparence (drop e 4) 2. appearance (change a 7)) 401d 474c
- departement (suggest: 1. department (drop e 7)) 400d 553c
- explaination (suggest: 1. explanation (drop i 6)) 393d 426c
- irrelevent (suggest: 1. irrelevant (change a 8)) 376d 441c
- seperately (suggest: 1. separately (change a 4)) 345d 387c
- salvagable (suggest: 1. salvageable (add e 7)) 337d 369c
- neccessary (suggest: 1. necessary (drop c 3)) 327d 382c
- resoultion (suggest: 1. resolution (swap u-l 5-6)) 322d 654c
- Profesional (suggest: 1. Professional (add s 6)) 320d 378c
- pronounciation (suggest: 1. pronunciation (drop o 5)) 311d 453c
- appearences (suggest: 1. appearances (change a 7)) 311d 336c
- unneccessary (suggest: 1. unnecessary (drop c 5)) 308d 309c
- unecessary (suggest: 1. necessary (drop u 1) 2. unnecessary (add n 2)) 307d 314c
- arguements (suggest: 1. arguments (drop e 5)) 306d 382c
- persistant (suggest: 1. persistent (change e 8)) 283d 330c
- overwitten (suggest: 1. overwritten (add r 6) 2. overbitten (change b 5)) 281d 388c
- Unfortunatly (suggest: 1. Unfortunately (add e 11)) 275d 177c
- particulary (suggest: 1. articulary (drop p 1) 2. particular (drop last y 11) 3. particularly (add l 11) 4. particulars (change s 11)) 274d 278c
- apparantly (suggest: 1. apparently (change e 6)) 272d 242c
- correspondance (suggest: 1. correspondence (change e 11)) 270d 171c
- nonexistant (suggest: 1. nonexistent (change e 9)) 259d 271c
- embarassing (suggest: 1. embarrassing (add r 5)) 251d 272c
- notablility (suggest: 1. notability (drop l 6)) 250d 254c
- particularily (suggest: 1. particularly (drop i 11) 2. particularity (change t 12)) 248d 273c
- consistancy (suggest: 1. consistency (change e 8)) 246d 251c
- appropiate (suggest: 1. appropriate (add r 7)) 245d 287c
- immediatly (suggest: 1. immediately (add e 9)) 239d 259c
- assasination (suggest: 1. assassination (add s 5)) 237d 226c
- rediculous (suggest: 1. pediculous (change p 1) 2. ridiculous (change i 2)) 232d 262c
- resemblence (suggest: 1. resemblance (change a 8)) 231d 236c
- harrassing (suggest: 1. harassing (drop r 3)) 230d 338c
- Entreprise (suggest: 1. Enterprise (swap r-e 4-5)) 222d 262c
- succesfully (suggest: 1. successfully (add s 6)) 221d 243c
- consistantly (suggest: 1. consistently (change e 8)) 212d 225c
- threshhold (suggest: 1. threshold (drop h 6) 2. threshwold (change w 7)) 209d 222c
- independance (suggest: 1. independence (change e 9)) 205d 156c
- adminstrator (suggest: 1. administrator (add i 6)) 204d 216c
- curiousity (suggest: 1. curiosity (drop u 6)) 193d 200c
- indefinately (suggest: 1. indefinitely (change i 8)) 193d 240c
- Liscensing (suggest: 1. Licensing (drop s 3)) 187d 252c
- politicans (suggest: 1. politicians (add i 8)) 187d 197c
- Assocation (suggest: 1. Association (add i 6)) 185d 175c
- archaelogical (suggest: 1. archaeological (add o 7)) 185d 119c
- occassions (suggest: 1. occasions (drop s 5)) 183d 215c
- insistance (suggest: 1. insistence (change e 7)) 183d 195c
- implimented (suggest: 1. implemented (change e 5)) 183d 190c
- possiblity (suggest: 1. possibility (add i 7)) 182d 192c
- propoganda (suggest: 1. propaganda (change a 5)) 176d 196c
- somethings (suggest: 1. something (drop last s 10)) 174d 148c
- seperation (suggest: 1. severation (change v 3) 2. separation (change a 4)) 173d 158c
- targetting (suggest: 1. targeting (drop t 6) 2. pargetting (change p 1)) 172d 195c
- signficant (suggest: 1. significant (add i 5)) 171d 189c
- comparision (suggest: 1. comparison (drop i 9)) 168d 156c
- transfering (suggest: 1. transferring (add r 8)) 167d 263c
- mentionned (suggest: 1. mentioned (drop n 7)) 167d 212c
- apropriate (suggest: 1. appropriate (add p 2)) 167d 192c
- notibility (suggest: 1. notability (change a 4)) 164d 216c
- enviroment (suggest: 1. environment (add n 7)) 164d 156c
- intersting (suggest: 1. interesting (add e 6)) 164d 155c
- successfull (suggest: 1. successful (drop l 10) 2. successfully (add y 12)) 161d 166c
- inflamatory (suggest: 1. inflammatory (add m 6)) 159d 201c
- neccessarily (suggest: 1. necessarily (drop c 3)) 158d 168c
- particuarly (suggest: 1. particularly (add l 8)) 157d 166c
- involvment (suggest: 1. involvement (add e 7)) 152d 165c
- occurences (suggest: 1. occurrences (add r 5)) 152d 152c
- independantly (suggest: 1. independently (change e 9)) 152d 160c
- verifyable (suggest: 1. verifiable (change i 6)) 151d 162c
- discription (suggest: 1. description (change e 2)) 150d 153c
- contruction (suggest: 1. construction (add s 4) 2. contraction (change a 6)) 149d 142c
- definetely (suggest: 1. definitely (change i 6)) 148d 143c
- compliation (suggest: 1. complication (add c 7) 2. compilation (swap l-i 5-6)) 147d 135c
- governement (suggest: 1. government (drop e 7)) 146d 130c
- suprisingly (suggest: 1. surprisingly (add r 3)) 146d 130c
- preferrably (suggest: 1. preferably (drop r 6)) 146d 134c
- perfomance (suggest: 1. performance (add r 6)) 145d 126c
- embarassment (suggest: 1. embarrassment (add r 5)) 142d 138c
- adminstrators (suggest: 1. administrators (add i 6)) 141d 144c
- commisioned (suggest: 1. commissioned (add s 6)) 140d 149c
- apperances (suggest: 1. appearances (add a 5)) 140d 107c
- precendent (suggest: 1. precedent (drop n 6)) 137d 148c
- responsiblity (suggest: 1. responsibility (add i 10)) 136d 142c
- committment (suggest: 1. commitment (drop t 6) 2. committent (drop m 8)) 134d 146c
- prestigous (suggest: 1. prestigious (add i 8)) 133d 133c
- ressources (suggest: 1. resources (drop s 3)) 130d 105c
- contraversial (suggest: 1. controversial (change o 6)) 130d 146c
- indepedent (suggest: 1. independent (add n 7)) 130d 121c
- adminstrative (suggest: 1. administrative (add i 6)) 129d 127c
- experiance (suggest: 1. experience (change e 7)) 129d 141c
- enyclopedia (suggest: 1. encyclopedia (add c 3)) 128d 122c
- developped (suggest: 1. developed (drop p 7)) 128d 141c
- embarassed (suggest: 1. embarrassed (add r 5)) 128d 136c
- perjorative (suggest: 1. pejorative (drop r 3) 2. perorative (drop j 4) 3. perforative (change f 4)) 127d 146c
- attendence (suggest: 1. attendance (change a 7)) 127d 127c
- accomodation (suggest: 1. accommodation (add m 5)) 126d 133c
(More soon)