Semantic spaces[note 1] in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch (the fact that the same meaning can be expressed in many ways) and ambiguity of natural language (the fact that the same term can have several meanings).
The application of semantic spaces in natural language processing (NLP) aims at overcoming limitations of rule-based or model-based approaches operating on the keyword level. The main drawback with these approaches is their brittleness, and the large manual effort required to create either rule-based NLP systems or training corpora for model learning. Rule-based and machine learning based models are fixed on the keyword level and break down if the vocabulary differs from that defined in the rules or from the training material used for the statistical models.
Research in semantic spaces dates back more than 20 years. In 1996, two papers were published that raised a lot of attention around the general idea of creating semantic spaces: latent semantic analysis and Hyperspace Analogue to Language. However, their adoption was limited by the large computational effort required to construct and use those semantic spaces. A breakthrough with regard to the accuracy of modelling associative relations between words (e.g. "spider-web", "lighter-cigarette", as opposed to synonymous relations such as "whale-dolphin", "astronaut-driver") was achieved by explicit semantic analysis (ESA) in 2007. ESA was a novel (non-machine learning) based approach that represented words in the form of vectors with 100,000 dimensions (where each dimension represents an Article in Wikipedia). However practical applications of the approach are limited due to the large number of required dimensions in the vectors.
More recently, advances in neural network techniques in combination with other new approaches (tensors) led to a host of new recent developments: Word2vec from Google, GloVe from Stanford University, and fastText from Facebook AI Research (FAIR) labs.
- also referred to as distributed semantic spaces or distributed semantic memory
- Baroni, Marco; Lenci, Alessandro (2010). "Distributional Memory: A General Framework for Corpus-Based Semantics". Computational Linguistics. 36 (4): 673–721. CiteSeerX 10.1.1.331.3769. doi:10.1162/coli_a_00016. S2CID 5584134.
- Scott C. Deerwester; Susan T. Dumais; Thomas K. Landauer; George W. Furnas; Richard A. Harshen (1990). "Indexing by Latent Semantic Analysis" (PDF). Journal of the American Society for Information Science.
- Xing Wei; W. Bruce Croft (2007). "Investigating retrieval performance with manually-built topic models". Proceeding RIAO '07 Large Scale Semantic Access to Content (Text, Image, Video, and Sound). Riao '07: 333–349.
- "LSA: A Solution to Plato's Problem". lsa.colorado.edu. Retrieved 2016-04-19.
- Lund, Kevin; Burgess, Curt (1996-06-01). "Producing high-dimensional semantic spaces from lexical co-occurrence". Behavior Research Methods, Instruments, & Computers. 28 (2): 203–208. doi:10.3758/BF03204766. ISSN 0743-3808.
- Evgeniy Gabrilovich & Shaul Markovitch (2007). "Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis" (PDF). Proc. 20th Int'l Joint Conf. On Artificial Intelligence (IJCAI). Pp. 1606–1611.
- Tomas Mikolov; Ilya Sutskever; Kai Chen; Greg Corrado; Jeffrey Dean (2013). "Distributed Representations of Words and Phrases and their Compositionality". arXiv:1310.4546 [cs.CL].
- Jeffrey Pennington; Richard Socher; Christopher D. Manning (2014). "GloVe: Global Vectors for Word Representation" (PDF).
- Mannes, John. "Facebook's fastText library is now optimized for mobile". TechCrunch. Retrieved 12 January 2018.