Beta regression is a form of regression which is used when the response variable, , takes values within and can be assumed to follow a beta distribution.[1] It is generalisable to variables which takes values in the arbitrary open interval through transformations.[1] Beta regression was developed in the early 2000s by two sets of statisticians: Kieschnick and McCullough in 2003 and Ferrari and Cribari-Neto in 2004.[2]

Description edit

The modern beta regression process is based on the mean/precision parameterisation of the beta distribution. Here the variable is assumed to be distributed according to   where   is the mean and   is the precision. As the mean of the distribution,   is constrained to fall within   but   is not. For given values of  , higher values of   result in a beta with a lower variance, hence its description as a precision parameter.[1]

Beta regression has three major motivations. Firstly, beta-distributed variables are usually heteroscedastic of a form where the scatter is greater closer to the mean value and lesser in the tails, whereas linear regression assumes homoscedasticity.[1] Secondly, while transformations are available to consider beta distributed dependent variables within the generalised linear regression frameworks, these transformation mean that the regressions model   rather than  , so the interpretation is in terms of the mean of   rather than the mean of  , which presents a more awkward interpretation.[1] Thirdly, values within   are generally from skewed distributions.[1]

The basic algebra of the beta regression is linear in terms of the link function, but even in the equal dispersion case presented below, it is not a special case of generalised linear regression:

 

where   is a link function.[1]: 3 

It is also notable that the variance of   is dependent on   in the model, so beta regressions are naturally heteroscedastic.[1]

Variable dispersion beta regression edit

There is also variable dispersion beta regression, where   is modelled independently for each observation rather than being held constant.[further explanation needed] Likelihood ratio tests can be "interpreted as testing the null hypothesis of equidispersion against a specific alternative of variable dispersion"[1] by using normal versus variable dispersions. For example, within the R programming language, the formula " " describes an equidispersion model but it might be compared to any of the following three specific variable dispersion alternatives:

  •  
  •  
  •  

The Breusch-Pagan test can be used to identify   variables.[1]

The choice of link equation can render the need for variable dispersion irrelevant, at least when judged in terms of model fit.[1]

A quasi RESET diagnostic test (inspired by RESET, i.e. regression specification error test) is available for considering misspecification, particularly in the context of link equation choice.[1] If a power of a fitted mean/linear predictor is used as a covariate and it results in a better model than the same formula without the power term, then the original model formula is a misspecification. This quasi-RESET diagnostic procedure may also be considered graphically, for example by comparing the absolute raw residuals for each model as the   values, with the model that has the smaller absolute residual more often is to be preferred.[1]

In general, the closer the observed   values are to the   extremes, the more significant the choice of link function.[1]

The link function can also affect whether the MLE procedure statistical programs use to implement beta regressions converge. Furthermore the MLE procedure can tend to underestimate the standard errors and therefore significance inferences in beta regression.[3] In practice, however, Bias Correction (BC) and Bias Reduction (BR) are essentially diagnostic steps, i.e. the analyst compares the model with neither BC nor BR to two models, each implementing one of BC and BR.[3]

The assumptions of beta regression are:

  • link appropriateness ("deviance residuals vs. indices of observation", at least for the logit link[2])
  • homogenous residuals ("deviance residuals vs. linear predictor"[2])
  • normality ("half-normal plot of deviance residuals"[2])
  • no outliers ("Cook's distance to determine outliers"[2])

References edit

  1. ^ a b c d e f g h i j k l m n Cribari-Neto, Francisco; Zeileis, Achim (2010). "Beta Regression in R" (PDF). cran.r-project.org.
  2. ^ a b c d e Geissinger, Emilie A.; Khoo, Celyn L. L.; Richmond, Isabella C.; Faulkner, Sally J. M.; Schneider, David C. (February 20, 2022). "A case for beta regression in the natural sciences". Ecosphere. 13 (2). doi:10.1002/ecs2.3940. S2CID 247101790.
  3. ^ a b Grün, Bettina; Kosmidis, Ioannis; Zeileis, Achim (2012). "Extended Beta Regression in R: Shaken, Stirred, Mixed, and Partitioned" (PDF). cran.r-project.org.