Talk:Soundex

This article is rated C-class on Wikipedia's content assessment scale.
It is of interest to the following WikiProjects:

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

??? This article has not yet received a rating on the project's importance scale.

This article has been automatically rated by a bot or other tool because one or more other projects use this class. Please ensure the assessment is correct before removing the |auto= parameter.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Linguistics: Applied Linguistics

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics articles
???	This article has not yet received a rating on the project's importance scale.
	This article is supported by Applied Linguistics Task Force.

Soundex may be used for indexing words

Latest comment: 17 years ago4 comments3 people in discussion

Soundex may be used for indexing words, but it was specifically designed for names, and doesn't always apply well to much else. RossPatterson 21:39, 24 September 2005 (UTC)Reply

Soundex was specifically designed to index names of Western European origin, and not much else. This was a huge bias on the inventor(s) part since they were American, and in the 18th/19th centuries the vast majority of Americans had surnames of Western European origin. This is not necessarily the case today. PS - Can someone please demonstrate that Magaret Odell had anything to do with this patent? I don't see where she appears anywhere on the patent as a patent holder and I can't find any information on her anywhere... Thx. 12.110.196.19 15:48, 5 April 2006 (UTC)Reply

Knuth describes Soundex in volume 3 (p. 394 in the second edition) as: "... a technique that was originally developed by Margaret K. Odell and Robert C. Russell [see U.S. Patents 1261167 (1918), 1435663 (1922)], ...", and that's good enough for me. I expect just about every reference you'll find on the web goes back to Knuth. It's certainly true that only Russell's name is on those two patents, but I see on re-reading that neither Knuth nor this article say that Odell held the patents. RossPatterson 02:00, 6 April 2006 (UTC)Reply

This page said Odell was a co-inventory back in the March revision. I see someone updated the page to reflect that Daitch-Mokotoff returns up to 32 separate encodings. The range of encodings, however, is actually 000000 to 999999; although many numeric combinations in this range will never be encountered because of restrictions on side-by-side duplicate digits in the rules for the algorithm.69.116.243.218 02:20, 22 July 2006 (UTC)Reply

C Code removal

Latest comment: 17 years ago1 comment1 person in discussion

I'm sorry Sudipta, because I do believe you were acting in perfectly good faith and put plenty of effort into your C implementation of the algorithm, but unfortunately incorporating your own work contravenes the No Original Research standard so I removed it. I hope you understand. Fortunately there are plenty of avenues on the internet where publishing original code is positively encouraged, so maybe your implementation can find a public there? --VinceBowdren 22:41, 22 May 2007 (UTC)Reply

Better algorithm?

Latest comment: 16 years ago3 comments3 people in discussion

This description doesn't seem very good - at least it's not possible to just convert the steps to code on-by-one as they're written here. The main problem is that step 2 tells you to discard the vowels, but then step 4 refers to the original string. Is there a better algorithm somewhere? Interplanet Janet 16:18, 25 September 2007 (UTC)Reply

One modified algorithm may work fine for you:

Retain the first letter of the string
(amecican census version only): remove any H or W unless it is the first letter
Assign numbers to letters (after the first) as follows:
- b, f, p, v = 1
- c, g, j, k, q, s, x, z = 2
- d, t = 3
- l = 4
- m, n = 5
- r = 6
- a, e, h, i, o, u, w, y = 0
Any runs of 2 or more of the same digit should be replaced with a single copy.
Remove any 0 digits
Return the first four characters, right-padding with zeroes if there are fewer than four.

Does that help? 67.76.205.139 (talk) 23:53, 10 January 2008 (UTC)Reply

This doesn't deal with the case where the second letter of the string has the same code number as the first letter (in which case the second should be ignored).JMG (talk) 02:42, 2 June 2008 (UTC)Reply

Clarity and correctness of Rules section

Latest comment: 4 years ago4 comments4 people in discussion

The description of the meaning of 'h' and 'w' is not at all clear. The section also states that vowels can affect the coding, but it doesn't say anything about in which way they do. Furthermore one of the examples given is not correct: Ashcraft is coded A261, not A226 (see http://www.archives.gov/genealogy/census/soundex.html ). I suggest the Rules section either be rewritten, or maybe even removed completely. 79.136.60.98 (talk) 23:22, 28 April 2010 (UTC)Reply

I have rewritten the H & W rules according to the US Census Bureau (http://www.archives.gov/research/census/soundex.html). — Preceding unsigned comment added by Copyeditor42 (talk • contribs) 15:46, 8 February 2012 (UTC)Reply

As noted above, the main page gives the incorrect answer for Ashcraft, which should be A261, not A226, following the rules as I understand them. Moreover, this example is used in the US Census Bureau document cited above, which explicitly says that it should produce A261 not A226. I will therefore update it on the main page.

Njr~enwiki (talk) 09:57, 24 July 2019 (UTC)Reply

The American Soundex section seems redundant now. The rules are nearly identical, and they even use the same examples. The article would benefit by combining them, or limiting them to one or the other. Devinmcginty (talk) 21:08, 26 July 2019 (UTC)Reply

SQL Server 2008's implementation of soundex

I noticed that soundex in SQL Server 2008 returned A226 for Ashcraft instead of A261. It appears as though they are on the Pre 1920 implementation of the soundex algroithm, for whatever reason (perhaps they are not counting the 'h' as a separator char in the repeating digit rule, or perhaps they have some more advanced rule that is not documented here even (?) ). Or perhaps it's simply one of their bugs.

Additional Information

Latest comment: 9 years ago2 comments2 people in discussion

It's been a few years since I've even logged in to wiki to post something, so please forgive any inadvertent lapses in wiki etiquette.

I happened to run across this term today on Slashdot (in reference to Tamerlan Tsarnaev) and jumped over here to see what it said about Soundex.

I learned about the term back in 1993 when I started as a police officer. When a user "ran" somebody on the computer system, the user might get a wanted hit on a similar name. That was taught to me as a "Soundex" and further confirmation was needed if the name wasn't an exact match. The joke was that more times than not the "matched" name wasn't even close. I never knew until today exactly how the database actually returned the name.

Anyway, I thought it might be of some interest to those of you that were maintaining this entry.

Trick414 (talk) 02:59, 27 March 2014 (UTC)Trick414Reply

NOTE - if it's of any use - for info -----------------------------. [Anonymous]

The UK Home office had an "advanced" algorithm for UK surnames and street names, which are far more idiomatic in pronunciation than American English. This consisted of a set of rules for name beginnings, name ending, and name 'centres' (i.e. a rolling pass allowed between defined 'begin' and 'end' sequences), which were applied in a defined order of passes, the idea being to remove/reduce known letter 'groups' of common pronunciation, and then the final pass was a 'classic' type letter to number reduction. This system also had an 'exceptions' list of names which were truly idiomatic (an example - "Cholmondeley" is pronounced "Chumley"). I don't know if this system is still used in their computing systems, but was definitely used in 1980 to late 90's for criminal records office database(s). I guess with more immigrant/foreign language surnames around now, perhaps this system is no longer viable

— Preceding unsigned comment added by 219.89.41.218 (talk) 21:32, 2 April 2015 (UTC)Reply

CCA's database system's version of Soundex funtion ($SNDX)

Latest comment: 6 years ago1 comment1 person in discussion

CCA's database system Model-204 includes a funtion ($SNDX) which works a little differently and doesn't follow the 3-digit rule at all. It will return as many (or few) characters as there continue to be consonants but ignores W, H, Y and doubles, so Ashcraft is encoded as A22613, while Murray becomes M6 (no extra zeroes). It considers the extra length useful for database indexes. ^[1]Rogerclarinet (talk) 19:46, 13 November 2017 (UTC)Reply

References

^ Model 204 User Language manual (various editions) $SNDX

homophone?

Latest comment: 6 years ago1 comment1 person in discussion

Is the goal really to match homophones? A homophone, according to its wikipedia entry, is "..a word that is pronounced the same (to varying extent) as another word but differs in meaning". Isn't soundex typically used to match words that are pronounced that same and have the same meaning? — Preceding unsigned comment added by JonasRosenqvist (talk • contribs) 08:32, 5 June 2018 (UTC)Reply

Add topic

[1] Model 204 User Language manual (various editions) $SNDX

[1]