Perplexity

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.^[1]

Perplexity of a probability distribution edit

The perplexity PP of a discrete probability distribution p is a concept widely used in information theory, machine learning, and statistical modeling. It is defined as

{\mathit {PP}}(p):=2^{H(p)}=2^{-\sum _{x}p(x)\log _{2}p(x)}=\prod _{x}p(x)^{-p(x)}

where H(p) is the entropy (in bits) of the distribution, and x ranges over the events. The base of the logarithm need not be 2: The perplexity is independent of the base, provided that the entropy and the exponentiation use the same base. In some contexts, this measure is also referred to as the (order-1 true) diversity.

Perplexity of a random variable X may be defined as the perplexity of the distribution over its possible values x. It can be thought of as a measure of uncertainty or "surprise" related to the outcomes.

For the special case of a distribution p with exactly k equaling the constant value 1/k and zeros otherwise, the product can be evaluated simply and the perplexity equals k. For example, this is the case when p models a fair k-sided die, i.e. a uniform distribution over k discrete events. In this sense, a random variable with perplexity k has the same uncertainty as a fair k-sided die. One is said to be "k-ways perplexed" about the value of the random variable. Unless it is a fair k-sided die, more than k values may be possible, but the overall uncertainty is not greater because some values may have a probability greater than 1/k.

Perplexity is sometimes used as a measure of the difficulty of a prediction problem. It is however generally not a straight forward representation of the relevant probability. For example, if you have two choices, one with probability 0.9, your chances of a correct guess using the optimal strategy are 90 percent. Yet, the perplexity is 2^{−0.9 log₂ 0.9 - 0.1 log₂ 0.1}= 1.38. The inverse of the perplexity, 1/1.38 = 0.72, does not correspond to the 0.9 probability.

The perplexity is the exponentiation of the entropy, a more straightforward quantity. Entropy measures the expected or "average" number of bits required to encode the outcome of the random variable using an optimal variable-length code. It can also be regarded as the expected information gain from learning the outcome of the random variable, providing insight into the uncertainty and complexity of the underlying probability distribution.

Perplexity of a probability model edit

A model of an unknown probability distribution p, may be proposed based on a training sample that was drawn from p. Given a proposed probability model q, one may evaluate q by asking how well it predicts a separate test sample x₁, x₂, ..., x_N also drawn from p. The perplexity of the model q is defined as

b^{-{\frac {1}{N}}\sum _{i=1}^{N}\log _{b}q(x_{i})}=\left(\prod _{i}q(x_{i})\right)^{-1/N}

where $b$ is customarily 2. Better models q of the unknown distribution p will tend to assign higher probabilities q(x_i) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample.

The exponent above may be regarded as the average number of bits needed to represent a test event x_i if one uses an optimal code based on q. Low-perplexity models do a better job of compressing the test sample, requiring few bits per test element on average because q(x_i) tends to be high.

The exponent $-{\tfrac {1}{N}}\sum _{i=1}^{N}\log _{b}q(x_{i})$ may also be interpreted as a cross-entropy:

H({\tilde {p}},q)=-\sum _{x}{\tilde {p}}(x)\log _{b}q(x)

where ${\tilde {p}}$ denotes the empirical distribution of the test sample (i.e., ${\tilde {p}}(x)=n/N$ if x appeared n times in the test sample of size N).

By the definition of KL divergence, it is also equal to

H({\tilde {p}})+D_{KL}({\tilde {p}}\|q)

which is

\geq H({\tilde {p}})

. Consequently, the perplexity is minimized when

q={\tilde {p}}

.

Perplexity per word edit

In natural language processing, a corpus is a set of sentences or texts, and a language model is a probability distribution over entire sentences or texts. Consequently, in NLP, the more commonly used measure is perplexity per word, defined as:

\left(\prod _{i=1}^{n}q(s_{i})\right)^{-1/N}

where

s_{1},...,s_{n}

are the

n

sentences in the corpus, but

N

is the number of words in the corpus. This normalizes the perplexity by the length of the text, allowing for more meaningful comparisons between different texts or models.

Suppose the average sentence x_i in the corpus has a probability of $2^{-190}$ according to the language model. This would give a model perplexity of 2¹⁹⁰ per sentence. However, it is more common to normalize for sentence length. Thus, if the test sample's sentences comprised a total of 1,000 words, and could be coded using 7.95 bits per word, one could report a model perplexity of 2^7.95 = 247 per word. In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each word.

There are two standard evaluation metrics for language models: perplexity or word error rate(WER). The simpler of these measures, WER, is simply the percentage of erroneously recognized words E (deletions, insertions, substitutions) to total number of words N, in a speech recognition task i.e.

WER=\left({\frac {E}{\mathbb {N} }}\right)*100\%

The second metric, perplexity(per word), is an information theoretic measure that evaluates the similarity of proposed model m to the original distribution p. It can be computed as a inverse of (geometric) average probability of test set T

$PPL(D)={\sqrt[{N}]{1 \over m(T)}}$

$=2^{-{\frac {1}{N}}lg(m(T))}$

where N is the number of words in test set T. Equation 1 can be seen as exponentiated cross entropy, where cross entropy H(p;m) is approximated as

$H(p;m)=-{\frac {1}{N}}lg(m(T))$

In many ways, WER is a better metric, as any improvement on language modeling benchmarks is meaningful only if it translates into improvements in Automatic Speech Recognition (ASR) or Machine Translation. The problem with WER is that it needs a complete ASR pipeline to evaluate. Also, almost all benchmarking datasets are behind a pay-wall, hence not readily available for evaluation.

Recent advances in language modeling edit

Since 2007, significant advancements in language modeling had emerged, particularly with the advent of deep learning techniques. Perplexity per word, a measure that quantifies the predictive power of a language model, remained central to evaluating models like transformers, BERT, GPT-4 and others. This measure was employed to compare different models on the same dataset and guide the optimization of hyperparameters, although it has been found sensitive to factors such as linguistic features and sentence length.^[2] Despite its pivotal role in language model development, perplexity has shown limitations, particularly as an inadequate predictor of speech recognition performance, where it may not correlate would with word-error rates,^[3]^[4] raising questions about its accuracy.

Brown Corpus edit

The lowest perplexity that had been published on the Brown Corpus (1 million words of American English of varying topics and genres) as of 1992 is indeed about 247 per word, corresponding to a cross-entropy of log₂247 = 7.95 bits per word or 1.75 bits per letter^[5] using a trigram model. While this figure represented the state of the art at the time, advancements in techniques such as deep learning have led to significant improvements in perplexity on other benchmarks, such as the One Billion Word Benchmark.^[6]

In the context of the Brown Corpus, simply guessing that the next word is "the" will achieve an accuracy of 7 percent, contrasting with the 1/247 = 0.4 percent that might be expected from a naive use of perplexity. This difference underscores the importance of the statistical model used and the nuanced nature of perplexity as a measure of predictiveness.^[7] The guess is based on unigram statistics, not on the trigram statistics that yielded the perplexity of 247, and utilizing trigram statistics would further refine the prediction.

References edit

^ Jelinek, F.; Mercer, R. L.; Bahl, L. R.; Baker, J. K. (1977). "Perplexity—a measure of the difficulty of speech recognition tasks". The Journal of the Acoustical Society of America. 62 (S1): S63–S63. doi:10.1121/1.2016299. ISSN 0001-4966.
^ Miaschi, Alessio; Brunato, Dominique; Dell'Orletta, Felice; Venturi, Giulia (2021). "What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity". Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. pp. 40--47. doi:10.18653/v1/2021.deelio-1.5. Archived from the original on 2023-10-24. Retrieved 2023-08-24.
^ Klakow, Dietrich; Peters, Jochen (2002). "Testing the correlation of word error rate and perplexity". Speech Communication. 38 (1–2): 19–28. doi:10.1016/S0167-6393(01)00041-3. ISSN 0167-6393.
^ Chen, Stanley F; Beeferman, Douglas; Rosenfeld, Roni (2018). "Evaluation Metrics For Language Models". Carnegie Mellon University.
^ Brown, Peter F.; et al. (March 1992). "An Estimate of an Upper Bound for the Entropy of English" (PDF). Computational Linguistics. 18 (1). Archived (PDF) from the original on 2021-09-17. Retrieved 2007-02-07.
^ Jozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016). [1] Archived 2021-05-04 at the Wayback Machine
^ Wilcox, Ethan Gotlieb, et al. "On the predictive power of neural language models for human real-time comprehension behavior." arXiv preprint arXiv:2006.01912 (2020). [2] Archived 2023-08-25 at the Wayback Machine

[1] Jelinek, F.; Mercer, R. L.; Bahl, L. R.; Baker, J. K. (1977). "Perplexity—a measure of the difficulty of speech recognition tasks". The Journal of the Acoustical Society of America. 62 (S1): S63–S63. doi:10.1121/1.2016299. ISSN 0001-4966.

[2] Miaschi, Alessio; Brunato, Dominique; Dell'Orletta, Felice; Venturi, Giulia (2021). "What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity". Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. pp. 40--47. doi:10.18653/v1/2021.deelio-1.5. Archived from the original on 2023-10-24. Retrieved 2023-08-24.

[3] Klakow, Dietrich; Peters, Jochen (2002). "Testing the correlation of word error rate and perplexity". Speech Communication. 38 (1–2): 19–28. doi:10.1016/S0167-6393(01)00041-3. ISSN 0167-6393.

[4] Chen, Stanley F; Beeferman, Douglas; Rosenfeld, Roni (2018). "Evaluation Metrics For Language Models". Carnegie Mellon University.

[5] Brown, Peter F.; et al. (March 1992). "An Estimate of an Upper Bound for the Entropy of English" (PDF). Computational Linguistics. 18 (1). Archived (PDF) from the original on 2021-09-17. Retrieved 2007-02-07.

[6] Jozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016). [1] Archived 2021-05-04 at the Wayback Machine

[7] Wilcox, Ethan Gotlieb, et al. "On the predictive power of neural language models for human real-time comprehension behavior." arXiv preprint arXiv:2006.01912 (2020). [2] Archived 2023-08-25 at the Wayback Machine

[1]

[2]

[3]

[4]

[5]

[6]

[7]