Cosine similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1. From the latter definition, it follows that the cosine similarity depends only on the angle between the two non-zero vectors, but not on their magnitudes. The cosine similarity is bounded in the interval for any angle . For example, two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at right angle relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in . The name derives from the term "direction cosine": in this case, unit vectors are maximally "similar" if they're parallel and maximally "dissimilar" if they're orthogonal (perpendicular). This is analogous to the cosine, which is unity (maximum value) when the segments subtend a zero angle and zero (uncorrelated) when the segments are perpendicular.

These bounds apply for any number of dimensions, and the cosine similarity is most commonly used in high-dimensional positive spaces. For example, in information retrieval and text mining, each term is notionally assigned a different dimension and a document is characterised by a vector where the value in each dimension corresponds to the number of times the term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.[1]

The technique is also used to measure cohesion within clusters in the field of data mining.[2]

One advantage of cosine similarity is its low-complexity, especially for sparse vectors: only the non-zero dimensions need to be considered.

Other names of cosine similarity are Orchini similarity and the Tucker coefficient of congruence; Otsuka-Ochiai similarity (see below) is cosine similarity applied to binary data.

DefinitionEdit

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

 

Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

 

where   and   are components of vector   and   respectively.

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison.

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights. The angle between two term frequency vectors cannot be greater than 90°.

If the attribute vectors are normalized by subtracting the vector means (e.g.,  ), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering,  

The term cosine distance is commonly used for the complement of cosine similarity in positive space, that is

 

It is important to note, however, that the cosine distance is not a proper distance metric as it does not have the triangle inequality property—or, more formally, the Schwarz inequality—and it violates the coincidence axiom; to repair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular distance. Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

Angular distance and similarityEdit

The normalized angle, referred to as angular distance, between any two vectors   and   is a formal distance metric and can be calculated from the cosine similarity.[3] The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive.

When the vector elements may be positive or negative:

 
 

Or, if the vector elements are always positive:

 
 

Unfortunately, computing the arcus cosinus function is rather slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.

 -normalised Euclidean distanceEdit

Another effective proxy for cosine distance can be obtained by   normalisation of the vectors, followed by the application of normal Euclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then, it is clear, the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques. This normalised form distance is often used within many deep learning algorithms.

Otsuka-Ochiai coefficientEdit

In biology, there is a similar concept known as the Otsuka-Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka,[4] Japanese: 大塚 弥之助)[5] and Akira Ochiai (Japanese: 落合 明),[6] also known as the Ochiai-Barkman[7] or Ochiai coefficient,[8] which can be represented as:

 

Here,   and   are sets, and   is the number of elements in  . If sets are represented as bit vectors, the Otsuka-Ochiai coefficient can be seen to be the same as the cosine similarity.

In a recent book,[9] the coefficient is misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned)[6] by citing an article by Ikuso Hamai (Japanese: 浜井 生三),[10] who in turn cites the original 1936 article by Yanosuke Otsuka.[5]

PropertiesEdit

The most noteworthy property of Cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any constant   and vector  , the vectors   and   are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as Jensen-Shannon, SED, and Triangular Divergence have been shown to have improved semantics in at least some contexts. [11]

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual  , and observe that

  (polarization identity)

by expansion. When A and B are normalized to unit length,   so this expression is equal to

 

The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of   (where   is the number of dimensions), and although the distribution is bounded between -1 and +1, as   grows large the distribution is increasingly well-approximated by the normal distribution.[12][13] Other types of data such as bitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.[14]

Triangle inequality for cosine similarityEdit

The ordinary triangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that

 

Because the cosine function decreases as an angle in [0, π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value:

 

Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines,

 
 

This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering[15] the same way the Euclidean triangle inequality has been used to accelerate regular k-means.

Soft cosine measureEdit

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features.[16] The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.

For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams[17] can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus mapped to different points in VSM; yet they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculating soft cosine, the matrix s is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix.

Given two N-dimension vectors   and  , the soft cosine similarity is calculated as follows:

 

where sij = similarity(featurei, featurej).

If there is no similarity between features (sii = 1, sij = 0 for ij), the given equation is equivalent to the conventional cosine similarity formula.

The time complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic.[18] An efficient implementation of such soft cosine similarity is included in the Gensim open source library.

See alsoEdit

ReferencesEdit

  1. ^ Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
  2. ^ P.-N. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.
  3. ^ "COSINE DISTANCE, COSINE SIMILARITY, ANGULAR COSINE DISTANCE, ANGULAR COSINE SIMILARITY". www.itl.nist.gov. Retrieved 2020-07-11.
  4. ^ Omori, Masae (2004). "Geological idea of Yanosuke Otuka, who built the foundation of neotectonics (geoscientist)". Earth Science. 58 (4): 256–259. doi:10.15080/agcjchikyukagaku.58.4_256.
  5. ^ a b Otsuka, Yanosuke (1936). "The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan". Bulletin of the Biogeographical Society of Japan. 6 (16): 165–170.
  6. ^ a b Ochiai, Akira (1957). "Zoogeographical studies on the soleoid fishes found in Japan and its neighhouring regions-II". Bulletin of the Japanese Society of Scientific Fisheries. 22 (9): 526–530. doi:10.2331/suisan.22.526.
  7. ^ Barkman, Jan J. (1958). Phytosociology and Ecology of Cryptogamic Epiphytes: Including a Taxonomic Survey and Description of Their Vegetation Units in Europe. Assen: Van Gorcum.
  8. ^ H. Charles Romesburg (1984). Cluster Analysis for Researchers. Belmont, California: Lifetime Learning Publications. p. 149.
  9. ^ Howarth, Richard J. (2017). Dictionary of Mathematical Geosciences: With Historical Notes. Cham, Switzerland: Springer. p. 421. doi:10.1007/978-3-319-57315-1. ISBN 978-3-319-57314-4.
  10. ^ Hamai, Ikuso (1955). "Stratification of community by means of "community coefficient" (continued)". Japanese Journal of Ecology. 5 (1): 41–45. doi:10.18960/seitai.5.1_41.
  11. ^ Connor, Richard (2016). A Tale of Four Metrics. Similarity Search and Applications. Tokyo: Springer.
  12. ^ Spruill, Marcus C. (2007). "Asymptotic distribution of coordinates on high dimensional spheres". Electronic Communications in Probability. 12: 234–247. doi:10.1214/ECP.v12-1294.
  13. ^ "Distribution of dot products between two random unit vectors in RD". CrossValidated.
  14. ^ Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity". Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044.
  15. ^ Schubert, Erich; Lang, Andreas; Feher, Gloria (2021). Reyes, Nora; Connor, Richard; Kriege, Nils; Kazempour, Daniyal; Bartolini, Ilaria; Schubert, Erich; Chen, Jian-Jia (eds.). "Accelerating Spherical k-Means". Similarity Search and Applications. Lecture Notes in Computer Science. Cham: Springer International Publishing: 217–231. doi:10.1007/978-3-030-89657-7_17. ISBN 978-3-030-89657-7.
  16. ^ Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David (29 September 2014). "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.
  17. ^ Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana (2013). Advances in Computational Intelligence. Lecture Notes in Computer Science. 7630. LNAI 7630. pp. 1–11. doi:10.1007/978-3-642-37798-3_1. ISBN 978-3-642-37798-3.
  18. ^ Novotný, Vít (2018). Implementation Notes for the Soft Cosine Measure. The 27th ACM International Conference on Information and Knowledge Management. Torun, Italy: Association for Computing Machinery. pp. 1639–1642. arXiv:1808.09407. doi:10.1145/3269206.3269317. ISBN 978-1-4503-6014-2.

External linksEdit