Speaker recognition is the identification of a person from characteristics of voices (voice biometrics). It is also called voice recognition. There is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). These two terms are frequently confused, and "voice recognition" can be used for both. In addition, there is a difference between the act of authentication (commonly referred to as speaker verification or speaker authentication) and identification. Finally, there is a difference between speaker recognition (recognizing who is speaking) and speaker diarisation (recognizing when the same speaker is speaking). Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific person's voices or it can be used to authenticate or verify the identity of a speaker as part of a security process.
Speaker recognition has a history dating back some four decades and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy (e.g., size and shape of the throat and mouth) and learned behavioral patterns (e.g., voice pitch, speaking style). Speaker verification has earned speaker recognition its classification as a "behavioral biometric".
Verification versus identificationEdit
There are two major applications of speaker recognition technologies and methodologies. If the speaker claims to be of a certain identity and the voice is used to verify this claim, this is called verification or authentication. On the other hand, identification is the task of determining an unknown speaker's identity. In a sense speaker verification is a 1:1 match where one speaker's voice is matched to one template (also called a "voice print" or "voice model") whereas speaker identification is a 1:N match where the voice is compared against N templates.
From a security perspective, identification is different from verification. For example, presenting your passport at border control is a verification process: the agent compares your face to the picture in the document. Conversely, a police officer comparing a sketch of an assailant against a database of previously documented criminals to find the closest match(es) is an identification process.
Speaker verification is usually employed as a "gatekeeper" in order to provide access to a secure system (e.g. telephone banking). These systems operate with the users' knowledge and typically require their cooperation. Speaker identification systems can also be implemented covertly without the user's knowledge to identify talkers in a discussion, alert automated systems of speaker changes, check if a user is already enrolled in a system, etc.
In forensic applications, it is common to first perform a speaker identification process to create a list of "best matches" and then perform a series of verification processes to determine a conclusive match.
Variants of speaker recognitionEdit
Each speaker recognition system has two phases: Enrollment and verification. During enrollment, the speaker's voice is recorded and typically a number of features are extracted to form a voice print, template, or model. In the verification phase, a speech sample or "utterance" is compared against a previously created voice print. For identification systems, the utterance is compared against multiple voice prints in order to determine the best match(es) while verification systems compare an utterance against a single voice print. Because of the process involved, verification is faster than identification.
Speaker recognition systems fall into two categories: text-dependent and text-independent.
If the text must be the same for enrollment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers (e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.: passwords and PINs) or knowledge-based information can be employed in order to create a multi-factor authentication scenario.
Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different. In fact, the enrollment may happen without the user's knowledge, as in the case for many forensic applications. As text-independent technologies do not compare what was said at enrollment and verification, verification applications tend to also employ speech recognition to determine what the user is saying at the point of authentication.
Speaker recognition is a pattern recognition problem. The various technologies used to process and store voice prints include frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, Vector Quantization and decision trees. Some systems also use "anti-speaker" techniques, such as cohort models, and world models. Spectral features are predominantly used in representing speaker characteristics.
Ambient noise levels can impede both collections of the initial and subsequent voice samples. Noise reduction algorithms can be employed to improve accuracy, but incorrect application can have the opposite effect. Performance degradation can result from changes in behavioural attributes of the voice and from enrollment using one telephone and verification on another telephone ("cross channel"). Integration with two-factor authentication products is expected to increase. Voice changes due to ageing may impact system performance over time. Some systems adapt the speaker models after each successful verification to capture such long-term changes in the voice, though there is debate regarding the overall security impact imposed by automated adaptation.
Capture of the biometric is seen as non-invasive. The technology traditionally uses existing microphones and voice transmission technology allowing recognition over long distances via ordinary telephones (wired or wireless).
Digitally recorded audio voice identification and analogue recorded voice identification uses electronic measurements as well as critical listening skills that must be applied by a forensic expert in order for the identification to be accurate.
The first international patent was filed in 1983, coming from the telecommunication research in CSELT (Italy) by Michele Cavazza and Alberto Ciaramella as a basis for both future telco services to final customers and to improve the noise-reduction techniques across the network.
In May 2013 it was announced that Barclays Wealth was to use passive speaker recognition to verify the identity of telephone customers within 30 seconds of normal conversation. The system used had been developed by voice recognition company Nuance (that in 2011 acquired the company Loquendo, the spin-off from CSELT itself for speech technology), the company behind Apple's Siri technology. A verified voiceprint was to be used to identify callers to the system and the system would in the future be rolled out across the company.
The private banking division of Barclays was the first financial services firm to deploy voice biometrics as the primary means to authenticate customers to their call centers. 93% of customer users had rated the system at "9 out of 10" for speed, ease of use and security.
Since then, Nuance Voice Biometrics solutions have been deployed across several financial institutions, including Banco Santander, Royal Bank of Canada, Tangerine Bank, and Manulife.
In August 2014 GoVivace Inc. deployed a speaker identification system that allowed its telecom industry client to positively search for an individual among millions of speakers by using just a single example recording of their voice.
In February 2016 UK high-street bank HSBC and its internet-based retail bank First Direct announced that it would offer 15 million customers its biometric banking software to access online and phone accounts using their fingerprint or voice.
- AI effect
- Applications of artificial intelligence
- Speaker diarisation
- Speech recognition
- Voice changer
- Poddar, Arnab; Sahidullah, Md; Saha, Goutam (March 2018). "Speaker Verification with Short Utterances: A Review of Challenges, Trends and Opportunities". IET Biometrics. 7 (2): 91–101. doi:10.1049/iet-bmt.2017.0065.
- Pollack, Pickett, Sumby (1974). Experimental phonetics. MSS Information Corporation. pp. 251–258. ISBN 0-8422-5149-9.
- Van Lancker and Kreiman (July 3, 1984). "Familiar voice recognition: Patterns and parameters. Part I: Recognition of backward voices" (PDF). Journal of Phonetics. pp. 19–38. Retrieved February 21, 2012.
- "British English definition of voice recognition". Macmillan Publishers Limited. Retrieved February 21, 2012.
- "voice recognition, definition of". WebFinance, Inc. Retrieved February 21, 2012.
- "Linux Gazette 114". Linux Gazette. Retrieved February 21, 2012.
- Lisa Myers (April 19, 2004). "An Exploration of Voice Biometrics".
- Sahidullah, Md.; Kinnunen, Tomi (March 2016). "Local spectral variability features for speaker verification". Digital Signal Processing. 50: 1–11. doi:10.1016/j.dsp.2015.10.011.
- US4752958 A, Michele Cavazza, Alberto Ciaramella, "Device for speaker's verification" http://www.google.com/patents/US4752958?hl=it&cl=en
- International Banking (December 27, 2013). "Voice Biometric Technology in Banking | Barclays". Wealth.barclays.com. Retrieved February 21, 2016.
- Matt Warman (May 8, 2013). "Say goodbye to the pin: voice recognition takes over at Barclays Wealth". Retrieved June 5, 2013.
- "Voice Biometrics for fast, secure authentication in your IVR and mobile apps". Nuance. Retrieved February 21, 2016.
- "Speaker Identification". Archived from the original on August 15, 2014. Retrieved September 3, 2014.
- Ewen MacAskill. "Did 'Jihadi John' kill Steven Sotloff? | Media". The Guardian. Retrieved February 21, 2016.
- Julia Kollewe (February 19, 2016). "HSBC rolls out voice and touch ID security for bank customers | Business". The Guardian. Retrieved February 21, 2016.
- "Biometrics from the movies" –National Institute of Standards and Technology
- Elisabeth Zetterholm (2003), Voice Imitation. A Phonetic Study of Perceptual Illusions and Acoustic Success, Phd thesis, Lund University.
- Md Sahidullah (2015), Enhancement of Speaker Recognition Performance Using Block Level, Relative and Temporal Information of Subband Energies, PhD thesis, Indian Institute of Technology Kharagpur.