Statistical machine translation

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation,[1] and has more recently been superseded by neural machine translation in many applications (see this article's final section).

The first ideas of statistical machine translation were introduced by Warren Weaver in 1949,[2] including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in the late 1980s and early 1990s by researchers at IBM's Thomas J. Watson Research Center[3][4][5] and has contributed to the significant resurgence in interest in machine translation in recent years. Before the introduction of neural machine translation, it was by far the most widely studied machine translation method.


The idea behind statistical machine translation comes from information theory. A document is translated according to the probability distribution   that a string   in the target language (for example, English) is the translation of a string   in the source language (for example, French).

The problem of modeling the probability distribution   has been approached in a number of ways. One approach which lends itself well to computer implementation is to apply Bayes Theorem, that is  , where the translation model   is the probability that the source string is the translation of the target string, and the language model   is the probability of seeing that target language string. This decomposition is attractive as it splits the problem into two subproblems. Finding the best translation   is done by picking up the one that gives the highest probability:


For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings   in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in speech recognition.

As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence, but even this is not enough. Language models are typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but there is additional complexity due to different sentence lengths and word orders in the languages.

The statistical translation models were initially word based (Models 1-5 from IBM Hidden Markov model from Stephan Vogel[6] and Model 6 from Franz-Joseph Och[7]), but significant advances were made with the introduction of phrase based models.[8] Later work incorporated syntax or quasi-syntactic structures.[9]


The most frequently cited[citation needed] benefits of statistical machine translation over rule-based approach are:

  • More efficient use of human and data resources
    • There are many parallel corpora in machine-readable format and even more monolingual data.
    • Generally, SMT systems are not tailored to any specific pair of languages.
    • Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.
  • More fluent translations owing to use of a language model


  • Corpus creation can be costly.
  • Specific errors are hard to predict and fix.
  • Results may have superficial fluency that masks translation problems.[10]
  • Statistical machine translation usually works less well for language pairs with significantly different word order.
  • The benefits obtained for translation between Western European languages are not representative of results for other language pairs, owing to smaller training corpora and greater grammatical differences.

Word-based translationEdit

In word-based translation, the fundamental unit of translation is a word in some natural language. Typically, the number of words in translated sentences are different, because of compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Necessarily it is assumed by information theory that each covers the same concept. In practice this is not really true. For example, the English word corner can be translated in Spanish by either rincón or esquina, depending on whether it is to mean its internal or external angle.

Simple word-based translation can't translate between languages with different fertility. Word-based translation systems can relatively simply be made to cope with high fertility, such that they could map a single word to multiple words, but not the other way about[citation needed]. For example, if we were translating from English to French, each word in English could produce any number of French words— sometimes none at all. But there's no way to group two English words producing a single French word.

An example of a word-based translation system is the freely available GIZA++ package (GPLed), which includes the training program for IBM models and HMM model and Model 6.[7]

The word-based translation is not widely used today; phrase-based systems are more common. Most phrase-based system are still using GIZA++ to align the corpus[citation needed]. The alignments are used to extract phrases or deduce syntax rules.[11] And matching words in bi-text is still a problem actively discussed in the community. Because of the predominance of GIZA++, there are now several distributed implementations of it online.[12]

Phrase-based translationEdit

In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases, but typically are not linguistic phrases, but phrasemes found using statistical methods from corpora. It has been shown that restricting the phrases to linguistic phrases (syntactically motivated groups of words, see syntactic categories) decreases the quality of translation.[13]

The chosen phrases are further mapped one-to-one based on a phrase translation table, and may be reordered. This table can be learnt based on word-alignment, or directly from a parallel corpus. The second model is trained using the expectation maximization algorithm, similarly to the word-based IBM model.[14]

Syntax-based translationEdit

Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences/utterances.[15] The idea of syntax-based translation is quite old in MT, though its statistical counterpart did not take off until the advent of strong stochastic parsers in the 1990s. Examples of this approach include DOP-based MT and, more recently, synchronous context-free grammars.

Hierarchical phrase-based translationEdit

Hierarchical phrase-based translation combines the strengths of phrase-based and syntax-based translation. It uses synchronous context-free grammar rules, but the grammars may be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents. This idea was first introduced in Chiang's Hiero system (2005).[9]

Language modelsEdit

A language model is an essential component of any statistical machine translation system, which aids in making the translation as fluent as possible. It is a function that takes a translated sentence and returns the probability of it being said by a native speaker. A good language model will for example assign a higher probability to the sentence "the house is small" than to "small the is house". Other than word order, language models may also help with word choice: if a foreign word has multiple possible translations, these functions may give better probabilities for certain translations in specific contexts in the target language.[14]

Challenges with statistical machine translationEdit

Problems that statistical machine translation have to deal with include:

Sentence alignmentEdit

In parallel corpora single sentences in one language can be found translated into several sentences in the other and vice versa.[15] Long sentences may be broken up, short sentences may be merged. There are even some languages that use writing systems without clear indication of a sentence end (for example, Thai). Sentence aligning can be performed through the Gale-Church alignment algorithm. Through this and other mathematical models efficient search and retrieval of the highest scoring sentence alignment is possible.

Word alignmentEdit

Sentence alignment is usually either provided by the corpus or obtained by aforementioned Gale-Church alignment algorithm. To learn e.g. the translation model, however, we need to know which words align in a source-target sentence pair. Solutions are the IBM-Models or the HMM-approach.

One of the problems presented is function words that have no clear equivalent in the target language. For example, when translating from English to German the sentence "John does not live here," the word "does" doesn't have a clear alignment in the translated sentence "John wohnt hier nicht." Through logical reasoning, it may be aligned with the words "wohnt" (as in English it contains grammatical information for the word "live") or "nicht" (as it only appears in the sentence because it is negated) or it may be unaligned. [14]

Statistical anomaliesEdit

Real-world training sets may override translations of, say, proper nouns. An example would be that "I took the train to Berlin" gets mis-translated as "I took the train to Paris" due to an abundance of "train to Paris" in the training set.


Depending on the corpora used, idioms may not translate "idiomatically". For example, using Canadian Hansard as the bilingual corpus, "hear" may almost invariably be translated to "Bravo!" since in Parliament "Hear, Hear!" becomes "Bravo!". [16]

This problem is connected with word alignment, as in very specific contexts the idiomatic expression may align with words that result in an idiomatic expression of the same meaning in the target language. However, it is unlikely, as the alignment usually doesn't work in any other contexts. For that reason, idioms should only be subjected to phrasal alignment, as they cannot be decomposed further without losing their meaning. This problem is therefore specific for word-based translation. [14]

Different word ordersEdit

Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located, or where the same words are used as a question or a statement.

In speech recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the machine translator can only manage small sequences of words, and word order has to be thought of by the program designer. Attempts at solutions have included re-ordering models, where a distribution of location changes for each item of translation is guessed from aligned bi-text. Different location changes can be ranked with the help of the language model and the best can be selected.

Recently, Skype voice communicator started testing speech translation.[17] However, machine translation is following technological trends in speech at a slower rate than speech recognition. In fact, some ideas from speech recognition research have been adopted by statistical machine translation.[18]

Out of vocabulary (OOV) wordsEdit

SMT systems typically store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated. This might be because of the lack of training data, changes in the human domain where the system is used, or differences in morphology.

Mobile devicesEdit

The rapid increase in the computing power of tablets and smartphones, combined with the wide availability of high-speed mobile Internet access, makes it possible for them to run machine translation systems. Experimental systems have already been developed to assist foreign health workers in developing countries. Similar systems are already available on the market. For example, Apple’s iOS 8 allows users to dictate text messages. A built-in ASR system recognizes the speech and the recognition results are edited by an online system.[19]

Projects such as Universal Speech Translation Advanced Research (U-STAR1, a continuation of the A-STAR project) and EU-BRIDGE2 are currently conducting research in translation of full sentences recognized from spoken language. Recent years have seen a growing interest in combining speech recognition, machine translation and speech synthesis. To achieve speech-to-speech translation, n-best lists are passed from the ASR to the statistical machine translation system. However, combining those systems raises problems of how to achieve sentence segmentation, de-normalization and punctuation prediction needed for quality translations.[20]

Systems implementing statistical machine translationEdit

See alsoEdit

Notes and referencesEdit

  1. ^ Philipp Koehn (2009). Statistical Machine Translation. Cambridge University Press. p. 27. ISBN 978-0521874151. Retrieved 22 March 2015. Statistical machine translation is related to other data-driven methods in machine translation, such as the earlier work on example-based machine translation. Contrast this to systems that are based on hand-crafted rules.
  2. ^ W. Weaver (1955). Translation (1949). In: Machine Translation of Languages, MIT Press, Cambridge, MA.
  3. ^ P. Brown; John Cocke; S. Della Pietra; V. Della Pietra; Frederick Jelinek; Robert L. Mercer; P. Roossin (1988). "A statistical approach to language translation". Coling'88. Association for Computational Linguistics. 1: 71–76. Retrieved 22 March 2015.
  4. ^ P. Brown; John Cocke; S. Della Pietra; V. Della Pietra; Frederick Jelinek; John D. Lafferty; Robert L. Mercer; P. Roossin (1990). "A statistical approach to machine translation". Computational Linguistics. MIT Press. 16 (2): 79–85. Retrieved 22 March 2015.
  5. ^ P. Brown; S. Della Pietra; V. Della Pietra; R. Mercer (1993). "The mathematics of statistical machine translation: parameter estimation". Computational Linguistics. MIT Press. 19 (2): 263–311. Retrieved 22 March 2015.
  6. ^ S. Vogel, H. Ney and C. Tillmann. 1996. HMM-based Word Alignment in Statistical Translation. In COLING ’96: The 16th International Conference on Computational Linguistics, pp. 836-841, Copenhagen, Denmark.
  7. ^ a b Och, Franz Josef; Ney, Hermann (2003). "A Systematic Comparison of Various Statistical Alignment Models". Computational Linguistics. 29: 19–51. doi:10.1162/089120103321337421.
  8. ^ P. Koehn, F.J. Och, and D. Marcu (2003). Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).
  9. ^ a b D. Chiang (2005). A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05).
  10. ^ Zhou, Sharon (July 25, 2018). "Has AI surpassed humans at translation? Not even close!". Skynet Today. Retrieved 2 August 2018.
  11. ^ P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. ACL 2007, Demonstration Session, Prague, Czech Republic
  12. ^ Q. Gao, S. Vogel, "Parallel Implementations of Word Alignment Tool", Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, June, 2008
  13. ^ Philipp Koehn, Franz Josef Och, Daniel Marcu: Statistical Phrase-Based Translation (2003)
  14. ^ a b c d Koehn, Philipp (2010). Statistical Machine Translation. Cambridge University Press. ISBN 978-0-521-87415-1.
  15. ^ a b Philip Williams; Rico Sennrich; Matt Post; Philipp Koehn (1 August 2016). Syntax-based Statistical Machine Translation. Morgan & Claypool Publishers. ISBN 978-1-62705-502-4.
  16. ^ W. J. Hutchins and H. Somers. (1992). An Introduction to Machine Translation, 18.3:322. ISBN 978-0-12-362830-5
  17. ^ Skype Translator Preview
  18. ^ Wołk, K.; Marasek, K. (2014-04-07). "Real-Time Statistical Speech Translation". Advances in Intelligent Systems and Computing. Springer. 275: 107–114. arXiv:1509.09090. doi:10.1007/978-3-319-05951-8_11. ISBN 978-3-319-05950-1. ISSN 2194-5357. S2CID 15361632.
  19. ^ Wołk K.; Marasek K. (2014). Polish-English Speech Statistical Machine Translation Systems for the IWSLT 2014. Proceedings of the 11th International Workshop on Spoken Language Translation, Lake Tahoe, USA.
  20. ^ Wołk K.; Marasek K. (2013). Polish-English Speech Statistical Machine Translation Systems for the IWSLT 2013. Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany. pp. 113–119. arXiv:1509.09097.
  21. ^ Turovsky, Barak (2016-11-15). "Found in translation: More accurate, fluent sentences in Google Translate". Google. Retrieved 2019-10-03.
  22. ^ "Machine Translation". Microsoft Translator for Business. Retrieved 2019-10-03.
  23. ^ Vashee, Kirti (2016-12-22). "SYSTRAN's Continuing Neural MT Evolution". eMpTy Pages. Retrieved 2019-10-03.
  24. ^ "One model is better than two. Yandex.Translate launches a hybrid machine translation system". Yandex Blog. 2017-09-14. Retrieved 2019-10-03.

External linksEdit