# Likelihood function

(Redirected from Likelihood)

In statistics, a likelihood function (often simply a likelihood) is a function ${\displaystyle {\mathcal {L}}(\theta ,y)}$ of parameters ${\displaystyle \theta }$ within the parameter space ${\displaystyle \Theta }$ that describes the probability of obtaining the observed data ${\displaystyle y}$.[1][2] It is proportional—up to a function of only the observed data—to the joint probability distribution of ${\displaystyle y}$ given ${\displaystyle \theta }$. The likelihood principle states that all relevant information for inference about ${\displaystyle \theta }$ is contained in the likelihood function for the observed data given the assumed statistical model.[3][4] The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher,[5] who believed it to be a self-contained framework for statistical modelling and inference. But the likelihood function also plays a fundamental role in frequentist and Bayesian statistics.[6]

If the parameter space ${\displaystyle \Theta }$ is one-dimensional, the likelihood can be graphed as a curve over the horizontal axis spanned by the parameter. In general, for a statistical model with ${\displaystyle k}$ parameters, the likelihood function takes the shape of a ${\displaystyle k}$-dimensional surface sitting above a ${\displaystyle k}$-dimensional hyperplane spanned by the parameter vector ${\displaystyle \theta }$.[7] The peak of that surface identifies the point ${\displaystyle {\hat {\theta }}}$ in the parameter space that maximizes the likelihood, ${\displaystyle {\hat {\theta }}=\arg \max _{\theta \in \Theta }{\mathcal {L}}(\theta ,y)}$; that is the combination of values that are most likely to be the parameters of the joint probability distribution underlying the observed data. The procedure of obtaining ${\displaystyle {\hat {\theta }}}$ is known as maximum likelihood estimation.

Since concavity plays a key role in the maximization, and since most common probability distributions—in particular the exponential family—are only logarithmically concave,[8][9] it is usually more convenient to work with a logarithmic transformation of the likelihood function ${\displaystyle \ell (\theta ,y)=\log \left({\mathcal {L}}(\theta ,y)\right)}$, known as the log-likelihood function. Further, if the log-likelihood function is smooth, its gradient with respect to ${\displaystyle \theta }$, known as the score, exists and allows for the application of differential calculus to find a maximum by determining the roots of the first-order conditions, known as the likelihood equations. The second derivative evaluated at ${\displaystyle {\hat {\theta }}}$, known as Fisher information, determines the curvature of the likelihood surface,[10] and thus indicates the precision of the estimate.[11]

## Definition

The likelihood function is usually defined differently for discrete and continuous probability distributions. A general definition is also possible, as discussed below.

### Discrete probability distribution

Let ${\displaystyle X}$  be a discrete random variable with probability mass function ${\displaystyle p}$  depending on a parameter ${\displaystyle \theta }$ . Then the function

${\displaystyle {\mathcal {L}}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x),}$

considered as a function of ${\displaystyle \theta }$ , is the likelihood function (of ${\displaystyle \theta }$ ), given the outcome ${\displaystyle x}$  of the random variable ${\displaystyle X}$ . Sometimes the probability of "the value ${\displaystyle x}$  of ${\displaystyle X}$  for the parameter value ${\displaystyle \theta }$  " is written as P(X = x | θ) or P(X = x; θ).

### Continuous probability distribution

Let ${\displaystyle X}$  be a random variable following an absolutely continuous probability distribution with density function ${\displaystyle f}$  depending on a parameter ${\displaystyle \theta }$ . Then the function

${\displaystyle {\mathcal {L}}(\theta \mid x)=f_{\theta }(x),\,}$

considered as a function of ${\displaystyle \theta }$ , is the likelihood function (of ${\displaystyle \theta }$ , given the outcome ${\displaystyle x}$  of ${\displaystyle X}$ ). Sometimes the density function for "the value ${\displaystyle x}$  of ${\displaystyle X}$  for the parameter value ${\displaystyle \theta }$  " is written as ${\displaystyle f(x\mid \theta )}$ ; this should not be confused with ${\displaystyle {\mathcal {L}}(\theta \mid x)}$ , which should not be considered a conditional probability density.

### In general

In measure-theoretic probability theory, the density function is defined as the Radon–Nikodym derivative of the probability distribution relative to a common dominating measure.[12] The likelihood function is that density interpreted as a function of the parameter (possibly a vector), rather than the possible outcomes.[13] This provides a likelihood function for any probability model with all distributions, whether discrete, absolutely continuous, a mixture or something else. (Likelihoods will be comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.)

The discussion above of likelihood with discrete probabilities is a special case of this using the counting measure, which makes the probability of any single outcome equal to the probability density for that outcome.

Note that given no event (no data), the probability and thus likelihood is 1;[citation needed] any non-trivial event will have lower likelihood.

## Example 1

Figure 1.  The likelihood function (${\displaystyle p_{\text{H}}^{2}}$ ) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HH.

Figure 2.  The likelihood function (${\displaystyle p_{\text{H}}^{2}(1-p_{\text{H}})}$ ) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HHT.

Consider a simple statistical model of a coin flip: a single parameter ${\displaystyle p_{\text{H}}}$  that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed. ${\displaystyle p_{\text{H}}}$  can take on any value within the range 0.0 to 1.0. For a perfectly fair coin, ${\displaystyle p_{\text{H}}}$  = 0.5.

Imagine flipping a fair coin twice, and observing the following data: two heads in two tosses ("HH"). Assuming that each successive coin flip is i.i.d., then the probability of observing HH is

${\displaystyle P({\text{HH}}\mid p_{\text{H}}=0.5)=0.5^{2}=0.25.}$

Hence, given the observed data HH, the likelihood that the model parameter ${\displaystyle p_{\text{H}}}$  equals 0.5 is 0.25. Mathematically, this is written as

${\displaystyle {\mathcal {L}}(p_{\text{H}}=0.5\mid {\text{HH}})=0.25.}$

This is not the same as saying that the probability that ${\displaystyle p_{\text{H}}=0.5}$ , given the observation HH, is 0.25. (For that, we could apply Bayes' theorem, which implies that the posterior probability is proportional to the likelihood times the prior probability.)

Suppose that the coin is not a fair coin, but instead it has ${\displaystyle p_{\text{H}}=0.3}$ . Then the probability of getting two heads is

${\displaystyle P({\text{HH}}\mid p_{\text{H}}=0.3)=0.3^{2}=0.09.}$

Hence

${\displaystyle {\mathcal {L}}(p_{\text{H}}=0.3\mid {\text{HH}})=0.09.}$

More generally, for each value of ${\displaystyle p_{\text{H}}}$ , we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure 1.

In Figure 1, the integral of the likelihood over the interval [0, 1] is 1/3. That illustrates an important aspect of likelihoods: likelihoods do not have to integrate (or sum) to 1, unlike probabilities.

## Interpretations under different foundations

Among statisticians, there is no consensus about what the foundation of statistics should be. There are four main paradigms that have been proposed for the foundation: frequentism, Bayesianism, likelihoodism, and AIC-based.[6] For each of the proposed foundations, the interpretation of likelihood is different. The four interpretations are described in the subsections below.

### Bayesian interpretation

In Bayesian inference, although one can speak about the likelihood of any proposition or random variable given another random variable: for example the likelihood of a parameter value or of a statistical model (see marginal likelihood), given specified data or other evidence,[14][15][16][17] the likelihood function remains the same entity, with the additional interpretations of (i) a conditional density of the data given the parameter (since the parameter is then a random variable) and (ii) a measure or amount of information brought by the data about the parameter value or even the model.[14][15][16][17][18] Due to the introduction of a probability structure on the parameter space or on the collection of models, it is a possible that a parameter value or a statistical model have a large likelihood value for given data, and yet have a low probability, or vice versa.[16][18] This is often the case in medical contexts.[19] Following Bayes' Rule, the likelihood when seen as a conditional density can be multiplied by the prior probability density of the parameter and then normalized, to give a posterior probability density.[14][15][16][17][18]. More generally, the likelihood of an unknown quantity ${\displaystyle X}$  given another unknown quantity ${\displaystyle Y}$  is the probability of ${\displaystyle Y}$  given ${\displaystyle X}$ [14][15][16][17][18].

### AIC-based interpretation

Under the AIC paradigm, likelihood is interpreted within the context of information theory.[20][21][22]

## Likelihood ratio

A likelihood ratio is the ratio of any two specified likelihoods: ${\displaystyle {\mathcal {L}}(\theta _{1}\mid x)/{\mathcal {L}}(\theta _{2}\mid x)}$ . Likelihood ratios are frequently written as ${\displaystyle \Lambda }$ , as follows.

${\displaystyle \Lambda (\theta _{1}:\theta _{2}\mid x)={\frac {{\mathcal {L}}(\theta _{1}\mid x)}{{\mathcal {L}}(\theta _{2}\mid x)}}}$

The likelihood ratio of two models, given the same event, may be contrasted with the odds of two events, given the same model. In terms of a parametrized probability mass function ${\displaystyle p_{\theta }(x)}$ , the likelihood ratio of two values of the parameter ${\displaystyle \theta _{1}}$  and ${\displaystyle \theta _{2}}$ , given an outcome ${\displaystyle x}$  is:

${\displaystyle \Lambda (\theta _{1}:\theta _{2}\mid x)=p_{\theta _{1}}(x):p_{\theta _{2}}(x),}$

while the odds of two outcomes, ${\displaystyle x_{1}}$  and ${\displaystyle x_{2}}$ , given a value of the parameter ${\displaystyle \theta }$ , is:

${\displaystyle O(x_{1}:x_{2}\mid \theta )=p_{\theta }(x_{1}):p_{\theta }(x_{2}).}$

This highlights the difference between likelihood and odds: in likelihood, one compares models (parameters), holding data fixed; while in odds, one compares events (outcomes, data), holding the model fixed.

The odds ratio is a ratio of two conditional odds (of an event, given another event being present or absent). However, the odds ratio can also be interpreted as a ratio of two likelihoods ratios, if one considers one of the events to be more easily observable than the other. See diagnostic odds ratio, where the result of a diagnostic test is more easily observable than the presence or absence of an underlying medical condition.

Given no event (no data), the likelihoods are both 1, and thus the likelihood ratio is also 1: in the absence of data, there is no evidence to distinguish two models.

### Purposes

The likelihood ratio is central to likelihoodist statistics: the law of likelihood states that degree to which data (considered as evidence) supports one parameter value versus another is measured by the likelihood ratio.

The likelihood ratio is also of central importance in Bayesian inference, where it is known as the Bayes factor, and is used in Bayes' rule. Stated in terms of odds, Bayes' rule is that the posterior odds of two alternatives, ${\displaystyle A_{1}}$  and ${\displaystyle A_{2}}$ , given an event ${\displaystyle B}$ , is the prior odds, times the likelihood ratio. As an equation:

${\displaystyle O(A_{1}:A_{2}\mid B)=O(A_{1}:A_{2})\cdot \Lambda (A_{1}:A_{2}\mid B).}$

The likelihood ratio is also used in frequentist inference as a test statistic in the likelihood-ratio test. By the Neyman–Pearson lemma, this is the most powerful test for comparing two simple hypotheses at a given significance level. The likelihood ratio is thus of great interest in frequentist inference, but is not as central as in Bayesian statistics. Numerous other tests can be viewed as likelihood-ratio tests or approximations thereof. The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given by Wilks' theorem.

The likelihood ratio is not directly used in AIC-based statistics. Instead, what is used is the relative likelihood of models (see below).

## Products of likelihoods

The likelihood, given two or more independent events, is the product of the likelihoods of each of the individual events:

${\displaystyle \Lambda (A\mid X_{1}\land X_{2})=\Lambda (A\mid X_{1})\cdot \Lambda (A\mid X_{2})}$

This follows from the definition of independence in probability: the probabilities of two independent events happening, given a model, is the product of the probabilities.

This is particularly important when the events are from independent and identically distributed random variables, such as independent observations or sampling with replacement. In such a situation, the likelihood function factors into a product of individual likelihood functions.

The empty product has value 1, which corresponds to the likelihood, given no event, being 1: before any data, the likelihood is always 1. This is similar to a uniform prior in Bayesian statistics, but in likelihoodist statistics this is not an improper prior because likelihoods are not integrated.

## Log-likelihood

Because one is primarily interested in ratios and products of likelihoods, the logarithm of the likelihood function is often easier to work with, since logarithms convert multiplication to addition: ratios become differences, and products become sums. This is called the log-likelihood, the loglihood[23] or the support.[24] Often the log-likelihood is denoted by a lowercase l or ${\displaystyle \ell }$ , to contrast with the uppercase L or ${\displaystyle {\mathcal {L}}}$  for the likelihood.

In addition to the mathematical convenience, the log-likelihood has an intuitive interpretation, as suggested by the term "support". Given independent events, the overall log-likelihood is the sum of the log-likelihoods of the individual events, just as the overall log-probability is the sum of the log-probability of the individual events. Viewing data as evidence, this is interpreted as "support from independent evidence adds", and the log-likelihood is the "weight of evidence". Interpreting negative log-probability as information content or surprisal, the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model.

The choice of base b for the logarithm corresponds to a choice of scale;[a] generally the natural logarithm is used and the base is fixed, but sometimes the base is varied, in which case, writing the base as ${\displaystyle b=e^{\beta }}$ , the factor β can be interpreted as the coldness.[b]

A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods:

${\displaystyle \log {\frac {L(A)}{L(B)}}=\log L(A)-\log L(B)=l(A)-l(B).}$

The log-likelihood is particularly convenient for maximum likelihood estimation. Because logarithms are strictly increasing functions, maximizing the likelihood is equivalent to maximizing the log-likelihood. The basic way to maximize a differentiable function is to find the stationary points (the points where the derivative is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires the product rule, it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events.

Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models.

### Exponential families

The log-likelihood is also particularly useful for exponential families of distributions, which include many of the common parametric probability distributions. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involving exponentiation. The logarithm of such a function is a sum of products, again easier to differentiate than the original function.

An exponential family is one whose probability density function is of the form (for some functions, writing ${\displaystyle \langle -,-\rangle }$  for the inner product):

${\displaystyle p(x\mid {\boldsymbol {\theta }})=h(x)\exp {\Big (}\langle {\boldsymbol {\eta }}({\boldsymbol {\theta }}),\mathbf {T} (x)\rangle -A({\boldsymbol {\theta }}){\Big )}.}$

Each of these terms has an interpretation,[c] but simply switching from probability to likelihood and taking logarithms yields the sum:

${\displaystyle \ell ({\boldsymbol {\theta }}\mid x)=\langle {\boldsymbol {\eta }}({\boldsymbol {\theta }}),\mathbf {T} (x)\rangle -A({\boldsymbol {\theta }})+\log h(x).}$

The ${\displaystyle {\boldsymbol {\eta }}({\boldsymbol {\theta }})}$  and ${\displaystyle h(x)}$  each correspond to a change of coordinates, so in these coordinates, the log-likelihood of an exponential family is given by the simple formula:

${\displaystyle \ell ({\boldsymbol {\eta }}\mid x)=\langle {\boldsymbol {\eta }},\mathbf {T} (x)\rangle -A({\boldsymbol {\eta }}).}$

In words, the log-likelihood of an exponential family is inner product of the natural parameter ${\displaystyle {\boldsymbol {\eta }}}$  and the sufficient statistic ${\displaystyle \mathbf {T} (x)}$ , minus the normalization factor (log-partition function) ${\displaystyle A({\boldsymbol {\eta }})}$ . Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic T and the log-partition function A.

#### Example: the gamma distribution

The gamma distribution is an exponential family with two parameters, ${\displaystyle \alpha }$  and ${\displaystyle \beta }$ . The likelihood function is

${\displaystyle {\mathcal {L}}(\alpha ,\beta \mid x)={\frac {\beta ^{\alpha }}{\Gamma (\alpha )}}x^{\alpha -1}e^{-\beta x}.}$

Finding the maximum likelihood estimate of ${\displaystyle \beta }$  for a single observed value ${\displaystyle x}$  looks rather daunting. Its logarithm is much simpler to work with:

${\displaystyle \log {\mathcal {L}}(\alpha ,\beta \mid x)=\alpha \log \beta -\log \Gamma (\alpha )+(\alpha -1)\log x-\beta x.\,}$

To maximize the log-likelihood, we first take the partial derivative with respect to ${\displaystyle \beta }$ :

${\displaystyle {\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x)}{\partial \beta }}={\frac {\alpha }{\beta }}-x.}$

If there are a number of independent observations ${\displaystyle x_{1},\ldots ,x_{n}}$ , then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood:

{\displaystyle {\begin{aligned}&{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{1},\ldots ,x_{n})}{\partial \beta }}\\={}&{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{1})}{\partial \beta }}+\cdots +{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{n})}{\partial \beta }}={\frac {n\alpha }{\beta }}-\sum _{i=1}^{n}x_{i}.\end{aligned}}}

To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved for ${\displaystyle \beta }$ :

${\displaystyle {\widehat {\beta }}={\frac {\alpha }{\bar {x}}}.}$

Here ${\displaystyle {\widehat {\beta }}}$  denotes the maximum-likelihood estimate, and ${\displaystyle \textstyle {\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}}$  is the sample mean of the observations.

## Likelihood function of a parameterized model

Among many applications, we consider here one of broad theoretical and practical importance. Given a parameterized family of probability density functions (or probability mass functions in the case of discrete distributions)

${\displaystyle x\mapsto f(x\mid \theta ),\!}$

where ${\displaystyle \theta }$  is the parameter, the likelihood function is

${\displaystyle \theta \mapsto f(x\mid \theta ),\!}$

written

${\displaystyle {\mathcal {L}}(\theta \mid x)=f(x\mid \theta ),\!}$

where ${\displaystyle x}$  is the observed outcome of an experiment. In other words, when ${\displaystyle f(x|\theta )}$  is viewed as a function of ${\displaystyle x}$  with ${\displaystyle \theta }$  fixed, it is a probability density function, and when viewed as a function of ${\displaystyle \theta }$  with ${\displaystyle x}$  fixed, it is a likelihood function.

This is not the same as the probability that those parameters are the right ones, given the observed sample. Attempting to interpret the likelihood of a hypothesis given observed evidence as the probability of the hypothesis is a common error, with potentially disastrous consequences in medicine, engineering or jurisprudence. See prosecutor's fallacy for an example of this.

From a geometric standpoint, if we consider ${\displaystyle f(x|\theta )}$  as a function of two variables then the family of probability distributions can be viewed as a family of curves parallel to the ${\displaystyle x}$ -axis, while the family of likelihood functions is the orthogonal curves parallel to the ${\displaystyle \theta }$ -axis.

### Likelihoods for continuous distributions

The use of the probability density in specifying the likelihood function above is justified as follows. Given an observation ${\displaystyle x_{j}}$ , the likelihood for the interval ${\displaystyle [x_{j},x_{j}+h]}$ , where ${\displaystyle h>0}$  is a constant, is given by ${\displaystyle {\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])}$ . Observe that ${\displaystyle \operatorname {argmax} _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\operatorname {argmax} _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])}$ ,

since ${\displaystyle h}$  is positive and constant. Because

${\displaystyle \operatorname {argmax} _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\operatorname {argmax} _{\theta }{\frac {1}{h}}\Pr(x_{j}\leq x\leq x_{j}+h\mid \theta )=\operatorname {argmax} _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx,}$

where ${\displaystyle f(x\mid \theta )}$  is the probability density function, it follows that

${\displaystyle \operatorname {argmax} _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\operatorname {argmax} _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx}$ .

The first fundamental theorem of calculus and the l'Hôpital's rule together provide that

{\displaystyle {\begin{aligned}&\lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx=\lim _{h\to 0^{+}}{\frac {{\frac {d}{dh}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx}{\frac {dh}{dh}}}\\[4pt]={}&\lim _{h\to 0^{+}}{\frac {f(x_{j}+h\mid \theta )}{1}}=f(x_{j}\mid \theta ).\end{aligned}}}

Then

{\displaystyle {\begin{aligned}&\operatorname {argmax} _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\operatorname {argmax} _{\theta }\left[\lim _{h\to 0^{+}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])\right]\\[4pt]={}&\operatorname {argmax} _{\theta }\left[\lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx\right]=\operatorname {argmax} _{\theta }f(x_{j}\mid \theta ).\end{aligned}}}

Therefore,

${\displaystyle \operatorname {argmax} _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\operatorname {argmax} _{\theta }f(x_{j}\mid \theta ),\!}$

and so maximizing the probability density at ${\displaystyle x_{j}}$  amounts to maximizing the likelihood of the specific observation ${\displaystyle x_{j}}$ .

### Likelihoods for mixed continuous–discrete distributions

The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses ${\displaystyle p_{k}\theta }$  and a density ${\displaystyle f(x|\theta )}$ , where the sum of all the ${\displaystyle p}$ 's added to the integral of ${\displaystyle f}$  is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply

${\displaystyle {\mathcal {L}}(\theta \mid x)=p_{k}(\theta ),\!}$

where ${\displaystyle k}$  is the index of the discrete probability mass corresponding to observation ${\displaystyle x}$ , because maximizing the probability mass (or probability) at ${\displaystyle x}$  amounts to maximizing the likelihood of the specific observation.

The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation ${\displaystyle x}$ , but not with the parameter ${\displaystyle \theta }$ .

## Example 2

Consider a jar containing N lottery tickets numbered from 1 through N. If you pick a ticket randomly, then you get positive integer n, with probability 1/N if n ≤ N and with probability 0 if n > N. This can be written

${\displaystyle P(n\mid N)={\frac {[n\leq N]}{N}}}$

where the Iverson bracket [n ≤ N] is 1 when n ≤ N and 0 otherwise. When considered a function of n for fixed N, this is the probability distribution. When considered a function of N for fixed n, this is a likelihood function. The maximum likelihood estimate for N is n (by contrast, the unbiased estimate is 2n − 1).

This likelihood function is not a probability distribution for ${\displaystyle N}$ . To see this, note that the total

${\displaystyle \sum _{N=1}^{\infty }P(n\mid N)=\sum _{N}{\frac {[n\leq N]}{N}}=\sum _{N=n}^{\infty }{\frac {1}{N}}}$

is a divergent series, and so is ${\displaystyle \infty }$ , not 1 as it would have to be if they were probabilities.

Suppose, however, that you pick two tickets (without replacement), rather than one. Then the probability of the outcome {n1n2}, where n1 < n2, is

${\displaystyle P(\{n_{1},n_{2}\}\mid N)={\frac {[n_{2}\leq N]}{\binom {N}{2}}}.}$

When considered a function of N for fixed n2, this is a likelihood function. The maximum likelihood estimate for N is n2. The total

${\displaystyle \sum _{N=1}^{\infty }P(\{n_{1},n_{2}\}\mid N)=\sum _{N}{\frac {[N\geq n_{2}]}{\binom {N}{2}}}={\frac {2}{n_{2}-1}}}$

is a convergent series, and so this likelihood function can be normalized into a probability distribution.

If you pick 3 or more tickets, the likelihood function has a well defined mean value, which is larger than the maximum likelihood estimate. If you pick 4 or more tickets, the likelihood function has a well defined standard deviation too.

With 2 or more tickets, the probability distributions just derived match the results from a Bayesian analysis assuming an improper, uniform prior for N over all positive integers. The use of improper priors is often justified by saying that the information from the data dominates the information from the prior. If only a very few tickets are available, and a precise answer is important, this can justify the work of collecting relevant information from other sources to use as an informative prior.

## Relative likelihood

### Relative likelihood function

Suppose that the maximum likelihood estimate for the parameter θ is ${\displaystyle {\hat {\theta }}}$ . Relative plausibilities of other θ values may be found by comparing the likelihoods of those other values with the likelihood of ${\displaystyle {\hat {\theta }}}$ . The relative likelihood of θ is defined to be[25][26][27][28][29]

${\displaystyle {\mathcal {L}}(\theta \mid x)/{\mathcal {L}}({\hat {\theta }}\mid x).}$

Thus, the relative likelihood is the likelihood ratio (discussed above) with the fixed denominator ${\displaystyle {\mathcal {L}}({\hat {\theta }})}$ . This corresponds to normalizing the likelihood to have a maximum of 1.

### Likelihood region

A likelihood region is the set of all values of θ whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, a p% likelihood region for θ is defined to be[25][27]

${\displaystyle \left\{\theta :{\frac {{\mathcal {L}}(\theta \mid x)}{{\mathcal {L}}({\hat {\theta \,}}\mid x)}}\geq {\frac {p}{100}}\right\}.}$

If θ is a single real parameter, a p% likelihood region will usually comprise an interval of real values. If the region does comprise an interval, then it is called a likelihood interval.[25][27][30]

Likelihood intervals, and more generally likelihood regions, are used for interval estimation within likelihoodist statistics: they are similar to confidence intervals in frequentist statistics and credible intervals in Bayesian statistics. Likelihood intervals are interpreted directly in terms of relative likelihood, not in terms of coverage probability (frequentism) or posterior probability (Bayesianism).

Given a model, likelihood intervals can be compared to confidence intervals. If θ is a single real parameter, then under certain conditions, a 14.65% likelihood interval (about 1:7 likelihood) for θ will be the same as a 95% confidence interval (19/20 coverage probability).[25] In a slightly different formulation suited to the use of log-likelihoods (see Wilks' theorem), the test statistic is twice the difference in log-likelihoods and the probability distribution of the test statistic is approximately a chi-squared distribution with degrees-of-freedom (df) equal to the difference in df's between the two models (therefore, the e−2 likelihood interval is the same as the 0.954 confidence interval; assuming difference in df's to be 1).[30]

### Relative likelihood of models

The definition of relative likelihood can be generalized to compare different statistical models. This generalization is based on AIC (Akaike information criterion), or sometimes AICc (Akaike Information Criterion with correction).

Suppose that, for some dataset, we have two statistical models, M1 and M2. Also suppose that AIC(M1 ) ≤ AIC(M2 ). Then the relative likelihood of M2 with respect to M1 is defined as follows.[31]

${\displaystyle \exp \left({\frac {\operatorname {AIC} (M_{1})-\operatorname {AIC} (M_{2})}{2}}\right)}$

To see that this is a generalization of the earlier definition, suppose that we have some model M with a (possibly multivariate) parameter θ. Then for any θ, set M2 = M(θ), and also set M1 = M(${\displaystyle {\hat {\theta }}}$ ). The general definition now gives the same result as the earlier definition.

## Likelihoods that eliminate nuisance parameters

In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered as nuisance parameters. Several alternative approaches have been developed to eliminate such nuisance parameters, so that a likelihood can be written as a function of only the parameter (or parameters) of interest: the main approaches are marginal, conditional, and profile likelihoods.[32][33]

These approaches are useful because standard likelihood methods can become unreliable or fail entirely when there are many nuisance parameters or when the nuisance parameters are high-dimensional. This is particularly true when the nuisance parameters can be considered to be "missing data"; they represent a non-negligible fraction of the number of observations and this fraction does not decrease when the sample size increases. Often these approaches can be used to derive closed-form formulae for statistical tests when direct use of maximum likelihood requires iterative numerical methods. These approaches find application in some specialized topics such as sequential analysis.

### Conditional likelihood

Sometimes it is possible to find a sufficient statistic for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters.

One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the non-central hypergeometric distribution. This form of conditioning is also the basis for Fisher's exact test.

### Marginal likelihood

Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linear mixed models, where considering a likelihood for the residuals only after fitting the fixed effects leads to residual maximum likelihood estimation of the variance components.

### Concentrated or profile likelihood

When the likelihood function depends on many parameters, the likelihood surface becomes increasingly complex, indeed increases in dimensionality, which makes it difficult to illustrate the function. It is possible to reduce the dimensions by concentrating the likelihood function for a subset of parameters by expressing the uninteresting (nuisance) parameters as functions of the parameters of interest and replacing them in the likelihood function.[34][35] For instance, if ${\displaystyle L(\mu ,\sigma )}$  is a two-parameter likelihood function, the concentrated likelihood function (in ${\displaystyle \mu }$ ) is defined as ${\displaystyle L_{p}(\mu )=L(\mu ,{\hat {\sigma }}(\mu ))}$  where ${\displaystyle {\hat {\sigma }}(\mu )}$  is the solution of ${\displaystyle \partial L/\partial \sigma =0}$ . In general, for a likelihood function depending on the parameter vector ${\displaystyle \mathbf {\theta } }$  that can be partitioned into ${\displaystyle \mathbf {\theta } =\left(\mathbf {\theta } _{1}:\mathbf {\theta } _{2}\right)}$ , and where a correspondence ${\displaystyle \mathbf {\hat {\theta }} _{2}=\mathbf {\hat {\theta }} _{2}\left(\mathbf {\theta } _{1}\right)}$  can be determined explicitly, concentration reduces computational burden of the original maximization problem.[36]

For instance, in a linear regression with normally distribution errors, ${\displaystyle \mathbf {y} =\mathbf {X} \beta +u}$ , the coefficient vector could be partitioned into ${\displaystyle \beta =\left[\beta _{1}:\beta _{2}\right]}$  (and consequently the design matrix ${\displaystyle \mathbf {X} =\left[\mathbf {X} _{1}:\mathbf {X} _{2}\right]}$ ). Maximizing with respect to ${\displaystyle \beta _{2}}$  yields an optimal value function ${\displaystyle \beta _{2}(\beta _{1})=\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}\left(\mathbf {y} -\mathbf {X} _{1}\beta _{1}\right)}$ . Using this result, the maximum likelihood estimator for ${\displaystyle \beta _{1}}$  can then be derived as

${\displaystyle {\hat {\beta }}_{1}=\left(\mathbf {X} _{1}^{\mathsf {T}}\left(\mathbf {I} -\mathbf {X} _{2}\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}\right)\mathbf {X} _{1}\right)^{-1}\mathbf {X} _{1}^{\mathsf {T}}\left(\mathbf {I} -\mathbf {X} _{2}\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}\right)\mathbf {y} }$

where ${\displaystyle \mathbf {X} _{2}\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}}$  is the projection matrix of ${\displaystyle \mathbf {X} _{2}}$ . This result is known as the Frisch–Waugh–Lovell theorem.

Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameter ${\displaystyle \sigma }$  that maximizes the likelihood function, creating an isometric profile of the likelihood function for a given ${\displaystyle \mu }$ , the result of this procedure is also known as profile likelihood.[37] In addition to being graphed, the profile likelihood can also be used to compute confidence intervals that often have better small-sample properties than those based on asymptotic standard errors calculated from the full likelihood.[38][39]

### Partial likelihood

A partial likelihood is an adaption of the full likelihood such that only a part of the parameters (the parameters of interest) occur in it.[40] It is a key component of the proportional hazards model: using a restriction on the hazard function, the likelihood does not contain the shape of the hazard over time.

## Historical remarks

The term "likelihood" has been in use in English since at least late Middle English.[41] Its formal use to refer to a specific function in mathematical statistics was proposed by Ronald Fisher,[42] in two research papers published in 1921[43] and 1922.[44] The 1921 paper introduced what is today called a "likelihood interval"; the 1922 paper introduced the term "method of maximum likelihood". Quoting Fisher:

[I]n 1922, I proposed the term ‘likelihood,’ in view of the fact that, with respect to [the parameter], it is not a probability, and does not obey the laws of probability, while at the same time it bears to the problem of rational choice among the possible values of [the parameter] a relation similar to that which probability bears to the problem of predicting events in games of chance. . . .Whereas, however, in relation to psychological judgment, likelihood has some resemblance to probability, the two concepts are wholly distinct. . . .”[45]

The concept of likelihood should not be confused with probability as mentioned by Sir Ronald Fisher "I stress this because in spite of the emphasis that I have always laid upon the difference between probability and likelihood there is still a tendency to treat likelihood as though it were a sort of probability. The first result is thus that there are two different measures of rational belief appropriate to different cases. Knowing the population we can express our incomplete knowledge of, or expectation of, the sample in terms of probability; knowing the sample we can express our incomplete knowledge of the population in terms of likelihood".[46] Fisher's invention of statistical likelihood was in reaction against an earlier form of reasoning called inverse probability.[47] His use of the term "likelihood" fixed the meaning of the term within mathematical statistics.

A. W. F. Edwards (1972) established the axiomatic basis for use of the log-likelihood ratio as a measure of relative support for one hypothesis against another. The support function is then the natural logarithm of the likelihood function. Both terms are used in phylogenetics, but were not adopted in a general treatment of the topic of statistical evidence.[48]

## Notes

1. ^ The scale factor is ${\displaystyle \log _{a}b}$ ; see Logarithm § Change of base
2. ^ "Coldness" is also known as thermodynamic beta or inverse temperature; See Watanabe–Akaike information criterion and Softmax function § Statistical mechanics for examples of varying the coldness.
3. ^

## References

1. ^ Pickles, Andrew (1985). An Introduction to Likelihood Analysis. Norwich: W. H. Hutchins & Sons. p. 9. ISBN 0-86094-190-6.
2. ^ Davidson, Russell; MacKinnon, James G. (1993). "The Method of Maximum Likelihood : Fundamental Concepts and Notation". Estimation and Inference in Econometrics. New York: Oxford University Press. pp. 247–253. ISBN 0-19-506011-3.
3. ^ Berger, James O.; Wolpert, Robert L. (1988). The Likelihood Principle. Hayward: Institute of Mathematical Statistics. p. 19. ISBN 0-940600-13-7.
4. ^ Edwards, A. W. F. (1992). Likelihood. Baltimore: Johns Hopkins University Press. ISBN 0-8018-4443-6.
5. ^ Fisher, R. A. Statistical Methods for Research Workers. What has now appeared is that the mathematical concept of probability is ... inadequate to express our mental confidence or [lack of confidence] in making ... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term "likelihood" to designate this quantity.... The quotation is from §1.2 of the book. The wording of the quotation varies slightly among editions of the book; the wording presented here is from the last edition. The phrase "[lack of confidence]" is, in all editions of the book, "diffidence", the usual definition of which makes little sense in the context to modern readers. A rare/obsolete definition of "diffidence", though, is "lack of confidence" (see e.g. SOED), which makes excellent sense in the context. Ergo, the quotation is presented as here.
6. ^ a b Bandyopadhyay, P. S.; Forster, M. R., eds. (2011), Philosophy of Statistics, North-Holland Publishing.
7. ^ Myung, In Jae (2003). "Tutorial on Maximum Likelihood Estimation". Journal of Mathematical Psychology. 47 (1): 90–100. doi:10.1016/S0022-2496(02)00028-7.
8. ^ Kass, Robert E.; Vos, Paul W. (1997). Geometrical Foundations of Asymptotic Inference. New York: John Wiley & Sons. p. 14. ISBN 0-471-82668-5.
9. ^ Papadopoulos, Alecos (September 25, 2013). "Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)?". Stack Exchange.
10. ^ Rao, B. Raja (1960). "A formula for the curvature of the likelihood surface of a sample drawn from a distribution admitting sufficient statistics". Biometrika. 47 (1–2): 203–207. doi:10.1093/biomet/47.1-2.203.
11. ^ Ward, Michael D.; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. Cambridge: Cambridge University Press. pp. 25–27. ISBN 978-1-316-63682-4.
12. ^ Billingsley, Patrick (1995). Probability and Measure (Third ed.). John Wiley & Sons. pp. 422–423.
13. ^ Shao, Jun (2003). Mathematical Statistics (2nd ed.). Springer. §4.4.1.
14. ^ a b c d I. J. Good: Probability and the Weighing of Evidence (Griffin 1950), §6.1
15. ^ a b c d H. Jeffreys: Theory of Probability (3rd ed., Oxford University Press 1983), §1.22
16. E. T. Jaynes: Probability Theory: The Logic of Science (Cambridge University Press 2003), §4.1
17. ^ a b c d D. V. Lindley: Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 1: Probability (Cambridge University Press 1980), §1.6
18. ^ a b c d A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin: Bayesian Data Analysis (3rd ed., Chapman & Hall/CRC 2014), §1.3
19. ^ H. C. Sox, M. C. Higgins, D. K. Owens: Medical Decision Making (2nd ed., Wiley, 2013), http://doi.org/10.1002/9781118341544, chapters 3–4
20. ^ Akaike, H. (1985), "Prediction and entropy", in Atkinson, A. C.; Fienberg, S. E. (eds.), A Celebration of Statistics, Springer, pp. 1–24.
21. ^ Sakamoto, Y.; Ishiguro, M.; Kitagawa, G. (1986), Akaike Information Criterion Statistics, D. Reidel, Part I.
22. ^ Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.), Springer-Verlag, chap. 7.
23. ^ Lee, Youngjo; Nelder, John A. (2005). "Likelihood for random-effect models". Statistics & Operations Research Transactions. 29 (2): 143. ... log likelihood which we shall abbreviate to loglihood (a useful contraction which we owe to Michael Healy).
24. ^ Edwards 1972, p. 12.
25. ^ a b c d Kalbfleisch, J. G. (1985), Probability and Statistical Inference, Springer (§9.3).
26. ^ Azzalini, A. (1996), Statistical Inference—Based on the likelihood, Chapman & Hall, ISBN 9780412606502 (§1.4.2).
27. ^ a b c Sprott, D. A. (2000), Statistical Inference in Science, Springer (chap. 2).
28. ^ Davison, A. C. (2008), Statistical Models, Cambridge University Press (§4.1.2).
29. ^ Held, L.; Sabanés Bové, D. S. (2014), Applied Statistical Inference—Likelihood and Bayes, Springer (§2.1).
30. ^ a b Hudson, D. J. (1971), "Interval estimation from the likelihood function", Journal of the Royal Statistical Society, Series B, 33 (2): 256–262.
31. ^ Burnham K. P. & Anderson D.R. (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach, Springer (§2.8).
32. ^ Pawitan, Yudi (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. ISBN 978-0-19-850765-9.
33. ^ Wen Hsiang Wei. "Generalized Linear Model - course notes". Tunghai University, Taichung, Taiwan. pp. Chapter 5. Retrieved 2017-10-01.
34. ^ Amemiya, Takeshi (1985). "Concentrated Likelihood Function". Advanced Econometrics. Cambridge: Harvard University Press. pp. 125–127. ISBN 978-0-674-00560-0.
35. ^ Davidson, Russell; MacKinnon, James G. (1993). "Concentrating the Loglikelihood Function". Estimation and Inference in Econometrics. New York: Oxford University Press. pp. 267–269. ISBN 978-0-19-506011-9.
36. ^ Gourieroux, Christian; Monfort, Alain (1995). "Concentrated Likelihood Function". Statistics and Econometric Models. New York: Cambridge University Press. pp. 170–175. ISBN 978-0-521-40551-5.
37. ^ Pickles, Andrew (1985). An Introduction to Likelihood Analysis. Norwich: W. H. Hutchins & Sons. pp. 21–24. ISBN 0-86094-190-6.
38. ^ Aitkin, Murray (1982). "Direct Likelihood Inference". GLIM 82: Proceedings of the International Conference on Generalised Linear Models. New York: Springer. pp. 76–86. ISBN 0-387-90777-7.
39. ^ Venzon, D. J.; Moolgavkar, S. H. (1988). "A Method for Computing Profile-Likelihood-Based Confidence Intervals". Journal of the Royal Statistical Society. Series C (Applied Statistics). 37 (1): 87–94. doi:10.2307/2347496.
40. ^ Cox, D. R. (1975). "Partial likelihood". Biometrika. 62 (2): 269–276. doi:10.1093/biomet/62.2.269. MR 0400509.
41. ^ "likelihood", Shorter Oxford English Dictionary (2007).
42. ^
43. ^ Fisher, R.A. (1921), "On the "probable error" of a coefficient of correlation deduced from a small sample", Metron, 1: 3–32.
44. ^ Fisher, R.A. (1922), "On the mathematical foundations of theoretical statistics", Philosophical Transactions of the Royal Society A, 222 (594–604): 309–368, doi:10.1098/rsta.1922.0009, JFM 48.1280.02, JSTOR 91208.
45. ^ Klemens, Ben (2008). Modeling with Data: Tools and Techniques for Scientific Computing. Princeton University Press. p. 329.
46. ^ Fisher, Ronald (1930). "Inverse Probability". Mathematical Proceedings of the Cambridge Philosophical Society. 26 (4): 528–535. doi:10.1017/S0305004100016297.
47. ^ Fienberg, Stephen E (1997). "Introduction to R.A. Fisher on inverse probability and likelihood". Statistical Science. 12 (3): 161. doi:10.1214/ss/1030037905.
48. ^ Royall, R. (1997). Statistical Evidence. Chapman & Hall.