My LinkedIn page:

www.linkedin.com/in/joserodal

________________________________________________________________________________________________________________

Beta distribution with four parameters (Pearson Type I) edit

A beta distribution with the two shape parameters α and β is supported on the range [0,1]. It is possible to alter the location and scale of the distribution by introducing two further parameters representing the minimum, a, and maximum c (c > a), values of the distribution,^[1] by a linear transformation substituting the non-dimensional variable x in terms of the new variable y (with support [a,c]) and the parameters a and c:

y=x(c-a)+a{\text{,  therefore    }}x={\frac {y-a}{c-a}}.

The probability density function of the four parameter beta distribution is equal to the two parameter distribution, scaled by the range (c-a), (so that the total area under the density curve equals a probability of one), and with the "y" variable shifted and scaled as follows:

f(y;\alpha ,\beta ,a,c)={\frac {f(x;\alpha ,\beta )}{c-a}}={\frac {((y-a)/(c-a))^{\alpha -1}((c-y)/(c-a))^{\beta -1}}{(c-a)B(\alpha ,\beta )}}={\frac {(y-a)^{\alpha -1}(c-y)^{\beta -1}}{(c-a)^{\alpha +\beta -1}B(\alpha ,\beta )}}\;.

That a random variable Y is Beta-distributed with four parameters α, β, a, and c will be denoted by:

Y\sim {\rm {Beta}}(\alpha ,\beta ,a,c)

.

The measures of central location are scaled (by (c-a)) and shifted (by a), as follows:

{\text{mean}}(Y)={\text{mean}}(X)(c-a)+a=\left({\frac {\alpha }{\alpha +\beta }}\right)(c-a)+a={\frac {\alpha c+\beta a}{\alpha +\beta }}\ ,

{\begin{aligned}{\text{mode}}(Y)={\text{mode}}(X)(c-a)+a&=\left({\frac {\alpha -1}{\alpha +\beta -2}}\right)(c-a)+a\\[6pt]&={\frac {(\alpha -1)c+(\beta -1)a}{\alpha +\beta -2}}\ ,\qquad {\text{conditional on }}\ \alpha >1\,\,\&\,\,\beta >1,\end{aligned}}

{\begin{aligned}{\text{median}}(Y)={\text{median}}(X)(c-a)+a&=(I_{\frac {1}{2}}^{[-1]}(\alpha ,\beta ))(c-a)+a\end{aligned}}

{\begin{aligned}G_{Y}=G_{X}(c-a)+a&=(e^{\psi (\alpha )-\psi (\alpha +\beta )})(c-a)+a\end{aligned}}

{\begin{aligned}H_{Y}=H_{X}(c-a)+a&=({\frac {\alpha \,-\,1}{\alpha +\beta \,-\,1}})(c-a)+a,\,\qquad {\text{conditional on  }}\,\alpha >1\,\,\&\,\,\beta >0\end{aligned}}

The statistical dispersion measures are scaled (they do not need to be shifted because they are already centered around the mean) by the range (c-a), linearly for the mean deviation and nonlinearly for the variance:

{\text{(mean deviaton around mean)}}(Y)=({\text{(mean deviaton around mean)}}(X))(c-a)={\frac {2\alpha ^{\alpha }\beta ^{\beta }}{\mathrm {B} (\alpha ,\beta )(\alpha +\beta )^{\alpha +\beta +1}}}(c-a)

{\text{variance}}(Y)={\text{variance}}(X)(c-a)^{2}={\frac {\alpha \beta (c-a)^{2}}{(\alpha +\beta )^{2}(\alpha +\beta +1)}}\ .

Since the skewness and excess kurtosis are non-dimensional quantities (as moments centered around the mean and normalized by the standard deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in terms of X (with support [0,1]):

{\text{skewness}}(Y)={\text{skewness}}(X)={\frac {2(\beta -\alpha ){\sqrt {\alpha +\beta +1}}}{(\alpha +\beta +2){\sqrt {\alpha \beta }}}}.

{\text{kurtosis excess}}(Y)={\text{kurtosis excess}}(X)={\frac {6[(\alpha -\beta )^{2}(\alpha +\beta +1)-\alpha \beta (\alpha +\beta +2)]}{\alpha \beta (\alpha +\beta +2)(\alpha +\beta +3)}}

Parameter estimation edit

Method of moments edit

Solutions for parameter estimates vs. (sample) excess Kurtosis and (sample) squared Skewness Beta distribution

All four parameters ( $\scriptstyle {\hat {\alpha }},{\hat {\beta }},{\hat {a}},{\hat {c}}$ of a beta distribution supported in the [a,c] interval -see section "Alternative parametrizations, Four parameters"-) can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).^[1]^[2]^[3] The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν= α + β, (see previous section titled "Kurtosis") as follows:

{\text{excess kurtosis}}={\frac {6}{3+\nu }}{\bigg (}{\frac {(2+\nu )}{4}}({\text{skewness}})^{2}-1{\bigg )}{\text{  conditional on (skewness)}}^{2}-2<{\text{excess kurtosis}}<{\frac {3}{2}}({\text{skewness}})^{2}

One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows^[2]:

{\hat {\nu }}={\hat {\alpha }}+{\hat {\beta }}=3{\frac {({\text{sample excess kurtosis}})-({\text{sample skewness}})^{2}+2}{{\frac {3}{2}}({\text{sample skewness}})^{2}-{\text{(sample excess kurtosis)}}}}{\text{  conditional on (sample skewness)}}^{2}-2<{\text{sample excess kurtosis}}<{\frac {3}{2}}({\text{sample skewness}})^{2}

This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson^[4]) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see previous section titled "Kurtosis bounded by the square of the skewness"):

The case of zero skewness, can be immediately solved because for zero skewness, α = β and hence ν = 2 α = 2 β, therefore α = β = ν/2

{\hat {\alpha }}={\hat {\beta }}={\frac {\hat {\nu }}{2}}={\frac {{\frac {3}{2}}({\text{sample excess kurtosis}})+3}{-{\text{(sample excess kurtosis)}}}}{\text{  conditional on sample skewness}}=0{\text{    and   }}-2<{\text{sample excess kurtosis}}<0

(Excess kurtosis is negative for the beta distribution with zero skewness, ranging from -2 to 0, so that ${\hat {\nu }}$ -and therefore the sample shape parameters- is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches -2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero).

For non-zero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters ${\hat {a}},{\hat {c}}$ , the parameters ${\hat {\alpha }},{\hat {\beta }}$ can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters):

({\text{sample skewness}})^{2}={\frac {4({\hat {\beta }}-{\hat {\alpha }})^{2}(1+{\hat {\alpha }}+{\hat {\beta }})}{{\hat {\alpha }}{\hat {\beta }}(2+{\hat {\alpha }}+{\hat {\beta }})^{2}}}

{\text{sample excess kurtosis}}={\frac {6}{3+{\hat {\alpha }}+{\hat {\beta }}}}{\bigg (}{\frac {(2+{\hat {\alpha }}+{\hat {\beta }})}{4}}({\text{sample skewness}})^{2}-1{\bigg )}{\text{  conditional on (sample skewness)}}^{2}-2<{\text{sample excess kurtosis}}<{\frac {3}{2}}({\text{sample skewness}})^{2}

resulting in the following solution^[2]:

{\hat {\alpha }},{\hat {\beta }}={\frac {\hat {\nu }}{2}}\left(1\pm {\frac {1}{\sqrt {1+{\frac {16({\hat {\nu }}+1)}{({\hat {\nu }}+2)^{2}({\text{sample skewness}})^{2}}}}}}\right){\text{  conditional on sample skewness}}\neq 0{\text{    and   }}({\text{sample skewness}})^{2}-2<{\text{sample excess kurtosis}}<{\frac {3}{2}}({\text{sample skewness}})^{2}

Where one should take the solutions as follows: ${\hat {\alpha }}>{\hat {\beta }}$ for (negative) sample skewness < 0, and ${\hat {\alpha }}<{\hat {\beta }}$ for (positive) sample skewness > 0.

The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric U-shaped for α = β < 1, uniform for α = β = 1, upside-down-U-shaped for 1 < α = β < 2 and bell-shaped for α = β >2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2 - skewness² = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities $p={\tfrac {\beta }{\alpha +\beta }}$ at the left end $x=0$ and $q=1-p={\tfrac {\alpha }{\alpha +\beta }}$ at the right end $x=1$ . The two surfaces become further apart towards the rear edge. At this rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton,^[5] sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)² = 0) (the just-J-shaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton ^[5] write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However the mean and standard deviation are fairly reliable." Therefore the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See section titled "Kurtosis bounded by the square of the skewness" for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis - (3/2)(sample skewness)² = 0). As remarked by Karl Pearson himself ^[6] this issue may not be of much practical importance as this trouble arises only for very skewed J-shaped (or mirror-image J-shaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewed skewed-bell-shape distributions that occur in practice do not have this parameter estimation problem.

The remaining two parameters ${\hat {a}},{\hat {c}}$ can be determined using the sample mean and the sample variance using a variety of equations.^[1]^[2] One alternative is to calculate the support interval range $({\hat {c}}-{\hat {a}})$ based on the sample variance and the sample kurtosis. For this purpose one can solve, in terms of the range $({\hat {c}}-{\hat {a}})$ , the equation expressing the excess kurtosis in terms of the sample variance, and the sample size ν (see section titled "Kurtosis" and "Alternative parametrizations, four parameters"):

{\text{sample excess kurtosis}}={\frac {6}{(3+{\hat {\nu }})(2+{\hat {\nu }})}}{\bigg (}{\frac {({\hat {c}}-{\hat {a}})^{2}}{\text{(sample variance)}}}-6-5{\hat {\nu }}{\bigg )}

to obtain:

({\hat {c}}-{\hat {a}})={\sqrt {\text{(sample variance)}}}{\sqrt {6+5{\hat {\nu }}+{\frac {(2+{\hat {\nu }})(3+{\hat {\nu }})}{6}}{\text{(sample excess kurtosis)}}}}

Another alternative is to calculate the support interval range $({\hat {c}}-{\hat {a}})$ based on the sample variance and the sample skewness.^[2] For this purpose one can solve, in terms of the range $({\hat {c}}-{\hat {a}})$ , the equation expressing the squared skewness in terms of the sample variance, and the sample size ν (see section titled "Skewness" and "Alternative parametrizations, four parameters"):

({\text{sample skewness}})^{2}={\frac {4}{(2+{\hat {\nu }})^{2}}}{\bigg (}{\frac {({\hat {c}}-{\hat {a}})^{2}}{\text{(sample variance)}}}-4(1+{\hat {\nu }}){\bigg )}

to obtain^[2]:

({\hat {c}}-{\hat {a}})={\frac {\sqrt {\text{(sample variance)}}}{2}}{\sqrt {(2+{\hat {\nu }})^{2}({\text{sample skewness}})^{2}+16(1+{\hat {\nu }})}}

The remaining parameter can be determined from the sample mean and the previously obtained parameters: $({\hat {c}}-{\hat {a}}),{\hat {\alpha }},{\hat {\nu }}={\hat {\alpha }}+{\hat {\beta }}$ :

{\hat {a}}=({\text{sample mean}})-\left({\frac {\hat {\alpha }}{\hat {\nu }}}\right)({\hat {c}}-{\hat {a}})

and finally, of course, ${\hat {c}}=({\hat {c}}-{\hat {a}})+{\hat {a}}$ .

In the above formulas one may take, for example, as estimates of the sample moments:

{\text{sample mean}}={\bar {y}}={\frac {1}{N}}\sum _{i=1}^{N}Y_{i}

{\text{sample variance}}={\bar {v_{Y}}}={\frac {1}{N-1}}\sum _{i=1}^{N}(Y_{i}-{\bar {y}})^{2}

{\text{sample skewness}}=G_{1}={\frac {N}{(N-1)(N-2)}}\;{\frac {\sum _{i=1}^{N}(Y_{i}-{\overline {y}})^{3}}{{\bar {v_{Y}}}^{3/2}}}

{\text{sample excess kurtosis}}=G_{2}={\frac {N\,(N+1)}{(N-1)\,(N-2)\,(N-3)}}\;{\frac {\sum _{i=1}^{N}(Y_{i}-{\bar {y}})^{4}}{{\bar {v_{Y}}}^{2}}}\,\,-\,\,{\frac {3\,(N-1)^{2}}{(N-2)\,(N-3)}}

The estimators G₁ for sample skewness and G₂ for sample kurtosis are used by DAP/SAS, PSPP/SPSS, and Excel. However, they are not used by BMDP and (according to ^[7]) they were not used by MINITAB in 1998. Actually, Joanes and Gill in their 1998 study^[7] concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G₁ and G₂, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill^[7]).

Maximum likelihood edit

The procedure is similar to the one followed in the two unknown parameter case. If $Y_{1},Y_{2},...,Y_{N}$ are independent random variables each having a beta distribution with four parameters, the joint log likelihood function for N iid observations is:

{\begin{aligned}\ln \,{\mathcal {L}}(\alpha ,\beta ,a,c|Y)&=\sum _{i=1}^{N}\ln \,{\mathcal {L}}_{i}(\alpha ,\beta ,a,c|Y_{i})\\&=\sum _{i=1}^{N}\ln \,f(Y_{i};\alpha ,\beta ,a,c)\\&=\sum _{i=1}^{N}\ln \,{\frac {(Y_{i}-a)^{\alpha -1}(c-Y_{i})^{\beta -1}}{(c-a)^{\alpha +\beta -1}B(\alpha ,\beta )}}\\&=(\alpha -1)\sum _{i=1}^{N}\ln(Y_{i}-a)+(\beta -1)\sum _{i=1}^{N}\ln(c-Y_{i})-\,N\ln \mathrm {B(\alpha ,\beta )} -N(\alpha +\beta -1)\ln(c-a)\end{aligned}}

Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:

{\frac {\partial \ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{\partial \alpha }}=\sum _{i=1}^{N}\ln(Y_{i}-a)\,-\,N(-\psi (\alpha +\beta )+\psi (\alpha ))-N\ln(c-a)=\,0

{\frac {\partial \ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{\partial \beta }}=\sum _{i=1}^{N}\ln(c-Y_{i})\,-\,N(-\psi (\alpha +\beta )+\psi (\beta ))-N\ln(c-a)=\,0

{\frac {\partial \ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{\partial a}}=-(\alpha -1)\sum _{i=1}^{N}{\frac {1}{Y_{i}-a}}\,+N(\alpha +\beta -1){\frac {1}{c-a}}=\,0

{\frac {\partial \ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{\partial c}}=(\beta -1)\sum _{i=1}^{N}{\frac {1}{c-Y_{i}}}\,-N(\alpha +\beta -1){\frac {1}{c-a}}=\,0

these equations can be re-arranged as the following system of four coupled equations (the first two equations are geometric means and the second two equations are the harmonic means) in terms of the maximum likelihood estimates for the four parameters ${\hat {\alpha }}\,,\,{\hat {\beta }}\,,\,{\hat {a}}\,,\,{\hat {c}}$ :

{\frac {1}{N}}\sum _{i=1}^{N}\ln {\frac {Y_{i}-{\hat {a}}}{{\hat {c}}-{\hat {a}}}}\,=\,\psi ({\hat {\alpha }})-\psi ({\hat {\alpha }}+{\hat {\beta }})=\ln {\hat {G}}_{X}

{\frac {1}{N}}\sum _{i=1}^{N}\ln {\frac {{\hat {c}}-Y_{i}}{{\hat {c}}-{\hat {a}}}}\,=\,\psi ({\hat {\beta }})-\psi ({\hat {\alpha }}+{\hat {\beta }})=\ln {\hat {G}}_{1-X}

{\frac {1}{{\frac {1}{N}}\sum _{i=1}^{N}{\frac {{\hat {c}}-{\hat {a}}}{Y_{i}-{\hat {a}}}}}}\,=\,{\frac {{\hat {\alpha }}-1}{{\hat {\alpha }}+{\hat {\beta }}-1}}={\hat {H}}_{X}

{\frac {1}{{\frac {1}{N}}\sum _{i=1}^{N}{\frac {{\hat {c}}-{\hat {a}}}{{\hat {c}}-Y_{i}}}}}\,=\,{\frac {{\hat {\beta }}-1}{{\hat {\alpha }}+{\hat {\beta }}-1}}={\hat {H}}_{1-X}

with sample geometric means:

{\hat {G}}_{X}=\prod _{i=1}^{N}({\frac {Y_{i}-{\hat {a}}}{{\hat {c}}-{\hat {a}}}})^{\frac {1}{N}}

{\hat {G}}_{(1-X)}=\prod _{i=1}^{N}({\frac {{\hat {c}}-Y_{i}}{{\hat {c}}-{\hat {a}}}})^{\frac {1}{N}}

The parameters ${\hat {a}}\,,\,{\hat {c}}$ are embedded inside the geometric mean expressions in a nonlinear way (to the power 1/N). This precludes, in general, a closed form solution, even for an initial value approximation for iteration purposes. One alternative is to use as initial values for iteration the values obtained from the method of moments solution for the four parameter case. Furthermore, the expressions for the harmonic means are well-defined only for ${\hat {\alpha }}>1\,\&\,{\hat {\beta }}>1$ , which precludes a maximum likelihood solution for shape parameters less than unity in the four-parameter case. Fisher's information matrix for the four parameter case is positive-definite only for α>2 and β>2 (for further discussion, see section on Fisher information matrix, four parameter case), for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. The following Fisher information components (that represent the expectations of the curvature of the log likelihood function) have singularities at the following values: $\scriptstyle \operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial a^{2}}}]={\mathcal {I}}_{a,a}$ at α=2; $\scriptstyle \operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial c^{2}}}]={\mathcal {I}}_{c,c}$ at β=2; $\scriptstyle \operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \alpha }{\partial a}}}]={\mathcal {I}}_{\alpha ,a}$ at α=1; and $\scriptstyle \operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \beta }{\partial c}}}]={\mathcal {I}}_{\beta ,c}$ at β=1 (for further discussion see section on Fisher information matrix). Thus, it is not possible to strictly carry on the maximum likelihood estimation for some well known distributions belonging to the four-parameter beta distribution family, like the uniform distribution (Beta(1,1,a,c)), and the arcsine distribution (Beta(1/2,1/2,a,c)). N.L.Johnson and S.Kotz^[1] ignore the equations for the harmonic means and instead suggest "If a and c are unknown, and maximum likelihood estimators of a, c, α and β are required, the above procedure [for the two unknown parameter case, with X transformed as X =(Y-a)/(c-a)] can be repeated using a succession of trial values of a and c, until the pair (a,c) for which maximum likelihood (given a and c) is as great as possible, is attained" (where, for the purpose of clarity, their notation for the parameters has been translated into the present notation).

Fisher information matrix edit

Let a random variable X have a probability density f(x;α). The partial derivative with respect to the (unknown, and to be estimated) parameter α of the log likelihood function is called the score. The second moment of the score is called the Fisher information:

{\mathcal {I}}(\alpha )=\operatorname {E} [({\frac {\partial }{\partial \alpha }}\ln {\mathcal {L}}(\alpha |X))^{2}],

The expectation of the score is zero, therefore the Fisher information is also the second moment centered around the mean of the score: the variance of the score.

If the log likelihood function is twice differentiable with respect to the parameter α, and under certain regularity conditions,^[8] then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes):

{\mathcal {I}}(\alpha )=-\operatorname {E} [{\frac {\partial ^{2}}{\partial \alpha ^{2}}}\ln {\mathcal {L}}(\alpha |X)]\,.

Thus, the Fisher information is the negative of the expectation of the second derivative with respect to the parameter α of the log likelihood function. Therefore Fisher information is a measure of the curvature of the log likelihood function of α. A low curvature (and therefore high radius of curvature), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large curvature (and therefore low radius of curvature) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a Taylor's series approximation, taken as far as the quadratic terms.^[9] The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any estimator of a parameter α:

\operatorname {var} \left[{\hat {\alpha }}\right]\,\geq \,{\frac {1}{{\mathcal {I}}\left(\alpha \right)}}.

The precision to which one can estimate the estimator of a parameter α is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter.^[10]

When there are N parameters ${\begin{bmatrix}\theta _{1},\theta _{2},\dots ,\theta _{N}\end{bmatrix}}^{\mathrm {T} },$ , then the Fisher information takes the form of an NxN positive semidefinite symmetric matrix, the Fisher Information Matrix, with typical element:

{({\mathcal {I}}(\theta ))}_{i,j}=\operatorname {E} [({\frac {\partial }{\partial \theta _{i}}}\ln {\mathcal {L}})({\frac {\partial }{\partial \theta _{j}}}\ln {\mathcal {L}})].

Under certain regularity conditions,^[8] the Fisher Information Matrix may also be written in the following form, which is often more convenient for computation:

{({\mathcal {I}}(\theta ))}_{i,j}=-\operatorname {E} [{\frac {\partial ^{2}{\ln {\mathcal {L}}}}{\partial \theta _{i}\,\partial \theta _{j}}}]\,.

With $X_{1},X_{2},...,X_{N}$ iid random variables, an N-dimensional "box" can be constructed with sides $X_{1},X_{2},...,X_{N}$ . Costa and Cover^[11] show that the (Shannon) differential entropy h(X) is related to the volume of the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set.

Four parameters edit

Fisher Information I(a,a) for α=β vs range (c-a) and exponent α=β

Fisher Information I(α,a) for α=β, vs. range (c - a) and exponent α=β

If $Y_{1},Y_{2},...,Y_{N}$ are independent random variables each having a beta distribution with four parameters: the exponents α and β, as well as "a" (the minimum of the distribution range), and "c" (the maximum of the distribution range) (section titled "Alternative parametrizations", "Four parameters"), with probability density function:

f(y;\alpha ,\beta ,a,c)={\frac {f(x;\alpha ,\beta )}{c-a}}={\frac {((y-a)/(c-a))^{\alpha -1}((c-y)/(c-a))^{\beta -1}}{(c-a)B(\alpha ,\beta )}}={\frac {(y-a)^{\alpha -1}(c-y)^{\beta -1}}{(c-a)^{\alpha +\beta -1}B(\alpha ,\beta )}}\;.

the joint log likelihood function per N iid observations is:

{\begin{aligned}{\frac {\ln \,{\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N}}=(\alpha -1){\frac {1}{N}}\sum _{i=1}^{N}\ln(Y_{i}-a)+(\beta -1){\frac {1}{N}}\sum _{i=1}^{N}\ln(c-Y_{i})-\,\ln \mathrm {B(\alpha ,\beta )} -(\alpha +\beta -1)\ln(c-a)\end{aligned}}

For the four parameter case, the Fisher information has 4*4=16 components. It has 12 off-diagonal components = (4*4 total - 4 diagonal). Since the Fisher information matrix is symmetric, half of these components (12/2=6) are independent. Therefore the Fisher information matrix has 6 independent off-diagonal + 4 diagonal = 10 independent components. Aryal and Nadarajah^[12] calculated Fisher's information matrix for the four parameter case as follows:

-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial \alpha ^{2}}}=\operatorname {var} [\ln X]=\psi _{1}(\alpha )-\psi _{1}(\alpha +\beta )={\mathcal {I}}_{\alpha ,\alpha }=\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial \alpha ^{2}}}]=\ln \,\operatorname {var_{GX}}

-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial \beta ^{2}}}=\operatorname {var} [\ln(1-X)]=\psi _{1}(\beta )-\psi _{1}(\alpha +\beta )={\mathcal {I}}_{\beta ,\beta }=\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial \beta ^{2}}}]=\ln \,\operatorname {var_{G(1-X)}}

-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \alpha }{\partial \beta }}}=\operatorname {cov} [\ln {X,(1-X)}]=-\psi _{1}(\alpha +\beta )={\mathcal {I}}_{\alpha ,\beta }=\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \alpha }{\partial \beta }}}]=\ln \,\operatorname {cov_{G{X,(1-X)}}}

In the above expressions, the use of X instead of Y in the expressions $\operatorname {var} [\ln X]=\ln \,\operatorname {var_{GX}}$ is not an error. The expressions in terms of the log geometric variances and log geometric covariance occur as functions of the two parameter $X\sim {\rm {Beta}}(\alpha ,\beta )$ parametrization because when taking the partial derivatives with respect to the exponents (α, β) in the four parameter case, one obtains the identical expressions as for the two parameter case: these terms of the four parameter Fisher information matrix are independent of the minimum "a" and maximum "c" of the distribution's range. The only non-zero term upon double differentiation of the log likelihood function with respect to the exponents α and β is the second derivative of the log of the beta function: $\ln \mathrm {B(\alpha ,\beta )}$ . This term is independent of the minimum "a" and maximum "c" of the distribution's range. Double differentiation of this term results in trigamma functions. The sections titled "Maximum likelihood", "Two unknown parameters" and "Four unknown parameters" also show this fact.

The Fisher information for N i.i.d. samples is N times the individual Fisher information (eq. 11.279, page 394 of Cover and Thomas^[13]). (Aryal and Nadarajah^[12] take a single observation, N=1, to calculate the following components of the Fisher information, which leads to the same result as considering the derivatives of the log likelihood per N observations):

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial a^{2}}}]={\mathcal {I}}_{a,a}={\frac {\beta (\alpha +\beta -1)}{(\alpha -2)(c-a)^{2}}}{\text{  conditional on  }}\alpha >2

(In the above expression the erroneous expression for ${\mathcal {I}}_{a,a}$ in Aryal and Nadarajah^[12] has been corrected.)

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N\partial c^{2}}}]={\mathcal {I}}_{c,c}={\frac {\alpha (\alpha +\beta -1)}{(\beta -2)(c-a)^{2}}}{\text{  conditional on  }}\beta >2

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial a}{\partial c}}}]={\mathcal {I}}_{a,c}={\frac {(\alpha +\beta -1)}{(c-a)^{2}}}

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \alpha }{\partial a}}}]={\mathcal {I}}_{\alpha ,a}={\frac {\beta }{(\alpha -1)(c-a)}}{\text{  conditional on  }}\alpha >1

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \alpha }{\partial c}}}]={\mathcal {I}}_{\alpha ,c}={\frac {1}{(c-a)}}

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \beta }{\partial a}}}]={\mathcal {I}}_{\beta ,a}=-{\frac {1}{(c-a)}}

\operatorname {E} [-{\frac {\partial ^{2}\ln {\mathcal {L}}(\alpha ,\beta ,a,c|Y)}{N{\partial \beta }{\partial c}}}]={\mathcal {I}}_{\beta ,c}=-{\frac {\alpha }{(\beta -1)(c-a)}}{\text{  conditional on  }}\beta >1

The lower two diagonal entries of the Fisher information matrix, with respect to the parameter "a" (the minimum of the distribution's range): ${\mathcal {I}}_{a,a}$ , and with respect to the parameter "c" (the maximum of the distribution's range): ${\mathcal {I}}_{c,c}$ are only defined for exponents α > 2 and β > 2 respectively. The Fisher information matrix component ${\mathcal {I}}_{a,a}$ for the minimum "a" approaches infinity for exponent α approaching 2 from above, and the Fisher information matrix component ${\mathcal {I}}_{c,c}$ for the maximum "c" approaches infinity for exponent β approaching 2 from above.

The Fisher information matrix for the four parameter case does not depend on the individual values of the minimum "a" and the maximum "c", but only on the total range (c - a). Moreover, the components of the Fisher information matrix that depend on the range (c - a), depend only through its inverse (or the square of the inverse), such that the Fisher information decreases for increasing range (c - a).

The accompanying images show the Fisher information components ${\mathcal {I}}_{a,a}$ and ${\mathcal {I}}_{\alpha ,a}$ . Images for the Fisher information components ${\mathcal {I}}_{\alpha ,\alpha }$ and ${\mathcal {I}}_{\beta ,\beta }$ are shown in the section titled "Geometric variance". All these Fisher information components look like a basin, with the "walls" of the basin being located at low values of the parameters.

The following four-parameter-beta-distribution Fisher information components can be expressed in terms of the two-parameter : $X\sim {\rm {Beta}}(\alpha ,\beta )$ expectations of the transformed ratio ((1-X)/X) and of its mirror image (X/(1-X)), scaled by the range (c-a), which may be helpful for interpretation:

{\mathcal {I}}_{\alpha ,a}={\frac {\operatorname {E} [{\frac {1-X}{X}}]}{c-a}}={\frac {\beta }{(\alpha -1)(c-a)}}{\text{  conditional on  }}\alpha >1

{\mathcal {I}}_{\beta ,c}=-{\frac {\operatorname {E} [{\frac {X}{1-X}}]}{c-a}}=-{\frac {\alpha }{(\beta -1)(c-a)}}{\text{  conditional on  }}\beta >1

These are also the expected values of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) ^[1] and its mirror image, scaled by the range (c-a).

Also, the following Fisher information components can be expressed in terms of the harmonic (1/X) variances or of variances based on the ratio transformed variables ((1-X)/X) as follows:

{\mathcal {I}}_{a,a}=\operatorname {var} [{\frac {1}{X}}]({\frac {\alpha -1}{c-a}})^{2}=\operatorname {var} [{\frac {1-X}{X}}]({\frac {\alpha -1}{c-a}})^{2}={\frac {\beta (\alpha +\beta -1)}{(\alpha -2)(c-a)^{2}}}{\text{  conditional on  }}\alpha >2

{\mathcal {I}}_{c,c}=\operatorname {var} [{\frac {1}{1-X}}]({\frac {\beta -1}{c-a}})^{2}=\operatorname {var} [{\frac {X}{1-X}}]({\frac {\beta -1}{c-a}})^{2}={\frac {\alpha (\alpha +\beta -1)}{(\beta -2)(c-a)^{2}}}{\text{  conditional on  }}\beta >2

{\mathcal {I}}_{a,c}=\operatorname {cov} [{\frac {1}{X}},{\frac {1}{1-X}}]{\frac {(\alpha -1)(\beta -1)}{(c-a)^{2}}}=\operatorname {cov} [{\frac {1-X}{X}},{\frac {X}{1-X}}]{\frac {(\alpha -1)(\beta -1)}{(c-a)^{2}}}={\frac {(\alpha +\beta -1)}{(c-a)^{2}}}

See section "Moments of linearly-transformed, product and inverted random variables" for these expectations.

The determinant of Fisher's information matrix is of interest (for example for the calculation of Jeffreys prior probability). From the expressions for the individual components, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution with four parameters is:

${\begin{aligned}&\det({\mathcal {I}}(\alpha ,\beta ,a,c))=\\&-{\mathcal {I}}_{a,c}^{2}{\mathcal {I}}_{\alpha ,a}{\mathcal {I}}_{\alpha ,\beta }+{\mathcal {I}}_{a,a}{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,c}{\mathcal {I}}_{\alpha ,\beta }+{\mathcal {I}}_{a,c}^{2}{\mathcal {I}}_{\alpha ,\beta }^{2}-{\mathcal {I}}_{a,a}{\mathcal {I}}_{c,c}{\mathcal {I}}_{\alpha ,\beta }^{2}\\&-{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,a}{\mathcal {I}}_{\alpha ,c}{\mathcal {I}}_{\beta ,a}+{\mathcal {I}}_{a,c}^{2}{\mathcal {I}}_{\alpha ,\alpha }{\mathcal {I}}_{\beta ,a}+2{\mathcal {I}}_{c,c}{\mathcal {I}}_{\alpha ,a}{\mathcal {I}}_{\alpha ,\beta }{\mathcal {I}}_{\beta ,a}\\&-2{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,c}{\mathcal {I}}_{\alpha ,\beta }{\mathcal {I}}_{\beta ,a}+{\mathcal {I}}_{\alpha ,c}^{2}{\mathcal {I}}_{\beta ,a}^{2}-{\mathcal {I}}_{c,c}{\mathcal {I}}_{\alpha ,\alpha }{\mathcal {I}}_{\beta ,a}^{2}+{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,a}^{2}{\mathcal {I}}_{\beta ,c}\\&-{\mathcal {I}}_{a,a}{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,\alpha }{\mathcal {I}}_{\beta ,c}-{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,a}{\mathcal {I}}_{\alpha ,\beta }{\mathcal {I}}_{\beta ,c}+{\mathcal {I}}_{a,a}{\mathcal {I}}_{\alpha ,c}{\mathcal {I}}_{\alpha ,\beta }{\mathcal {I}}_{\beta ,c}\\&-{\mathcal {I}}_{\alpha ,a}{\mathcal {I}}_{\alpha ,c}{\mathcal {I}}_{\beta ,a}{\mathcal {I}}_{\beta ,c}+{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,\alpha }{\mathcal {I}}_{\beta ,a}{\mathcal {I}}_{\beta ,c}-{\mathcal {I}}_{c,c}{\mathcal {I}}_{\alpha ,a}^{2}{\mathcal {I}}_{\beta ,\beta }\\&+2{\mathcal {I}}_{a,c}{\mathcal {I}}_{\alpha ,a}{\mathcal {I}}_{\alpha ,c}{\mathcal {I}}_{\beta ,\beta }-{\mathcal {I}}_{a,a}{\mathcal {I}}_{\alpha ,c}^{2}{\mathcal {I}}_{\beta ,\beta }-{\mathcal {I}}_{a,c}^{2}{\mathcal {I}}_{\alpha ,\alpha }{\mathcal {I}}_{\beta ,\beta }+{\mathcal {I}}_{a,a}{\mathcal {I}}_{c,c}{\mathcal {I}}_{\alpha ,\alpha }{\mathcal {I}}_{\beta ,\beta }{\text{ conditional on }}\alpha >2\,\&\,\beta >2\end{aligned}}$

Using Sylvester's criterion (checking whether the diagonal elements are all positive), and since diagonal components ${\mathcal {I}}_{a,a}$ and ${\mathcal {I}}_{c,c}$ have singularities at α=2 and β=2 it follows that the Fisher information matrix for the four parameter case is positive-definite for α>2 and β>2. Since for α>2 and β>2 the beta distribution is (symmetric or unsymmetric) bell shaped, it follows that the Fisher information matrix is positive-definite only for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. Thus, important well known distributions belonging to the four-parameter beta distribution family, like the parabolic distribution (Beta(2,2,a,c)) and the uniform distribution (Beta(1,1,a,c)) have Fisher information components ( ${\mathcal {I}}_{a,a},{\mathcal {I}}_{c,c},{\mathcal {I}}_{\alpha ,a},{\mathcal {I}}_{\beta ,c}$ ) that blow up (approach infinity) in the four-parameter case (although their Fisher information components are all defined for the two parameter case). The four-parameter Wigner semicircle distribution (Beta(3/2,3/2,a,c)) and arcsine distribution (Beta(1/2,1/2,a,c)) have negative Fisher information determinants for the four-parameter case.

References edit

^ ^a ^b ^c ^d ^e Cite error: The named reference JKB was invoked but never defined (see the help page).
^ ^a ^b ^c ^d ^e ^f Cite error: The named reference Elderton1906 was invoked but never defined (see the help page).
^ Elderton, William Palin and Norman Lloyd Johnson (2009). Systems of Frequency Curves. Cambridge University Press. ISBN 978-0521093361.
^ Cite error: The named reference Pearson was invoked but never defined (see the help page).
^ ^a ^b Bowman, K. O. (2007). "The beta distribution, moment method, Karl Pearson and R.A. Fisher" (PDF). Far East J.Theo.Stat. 23 (2): 133–164. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Cite error: The named reference Pearson1936 was invoked but never defined (see the help page).
^ ^a ^b ^c Joanes, D. N. (1998). "Comparing measures of sample skewness and kurtosis". The Statistician. 47 (Part 1): 183–189. doi:10.1111/1467-9884.00122. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ ^a ^b Silvey, S.D. (1975). Statistical Inference. page 40: Chapman and Hal. ISBN 978-0412138201.{{cite book}}: CS1 maint: location (link)
^ Edwards, A. W. F. (1992). Likelihood. The Johns Hopkins University Press. ISBN 978-0801844430.
^ Jaynes, E.T. (2003). Probability theory, the logic of science. Cambridge University Press. ISBN 978-0521592710.
^ Costa, Max, and Cover, Thomas (1983 (September)). On the similarity of the entropy power inequality and the Brunn Minkowski inequality (PDF). Tech.Report 48, Dept. Statistics, Stanford University. {{cite book}}: Check date values in: |date= (help)CS1 maint: multiple names: authors list (link)
^ ^a ^b ^c Cite error: The named reference Aryal was invoked but never defined (see the help page).
^ Cite error: The named reference Cover and Thomas was invoked but never defined (see the help page).

[JKB-1] Cite error: The named reference JKB was invoked but never defined (see the help page).

[Elderton1906-2] ^ ^a ^b ^c ^d ^e ^f Cite error: The named reference Elderton1906 was invoked but never defined (see the help page).

[Elderton_and_Johnson-3] Elderton, William Palin and Norman Lloyd Johnson (2009). Systems of Frequency Curves. Cambridge University Press. ISBN 978-0521093361.

[Pearson-4] Cite error: The named reference Pearson was invoked but never defined (see the help page).

[BowmanShenton-5] Bowman, K. O. (2007). "The beta distribution, moment method, Karl Pearson and R.A. Fisher" (PDF). Far East J.Theo.Stat. 23 (2): 133–164. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Pearson1936-6] Cite error: The named reference Pearson1936 was invoked but never defined (see the help page).

[Joanes_and_Gill-7] Joanes, D. N. (1998). "Comparing measures of sample skewness and kurtosis". The Statistician. 47 (Part 1): 183–189. doi:10.1111/1467-9884.00122. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Silvey-8] Silvey, S.D. (1975). Statistical Inference. page 40: Chapman and Hal. ISBN 978-0412138201.{{cite book}}: CS1 maint: location (link)

[EdwardsLikelihood-9] Edwards, A. W. F. (1992). Likelihood. The Johns Hopkins University Press. ISBN 978-0801844430.

[Jaynes-10] Jaynes, E.T. (2003). Probability theory, the logic of science. Cambridge University Press. ISBN 978-0521592710.

[CostaCover-11] Costa, Max, and Cover, Thomas (1983 (September)). On the similarity of the entropy power inequality and the Brunn Minkowski inequality (PDF). Tech.Report 48, Dept. Statistics, Stanford University. {{cite book}}: Check date values in: |date= (help)CS1 maint: multiple names: authors list (link)

[Aryal-12] Cite error: The named reference Aryal was invoked but never defined (see the help page).

[Cover_and_Thomas-13] Cite error: The named reference Cover and Thomas was invoked but never defined (see the help page).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Dr. J. Rodal

Contents