Quantile-parameterized distribution

A quantile-parameterized distribution (QPD) is a probability distributions that is directly parameterized by data. They were created to meet the need for easy-to-use continuous probability distributions flexible enough to represent a wide range of uncertainties, such as those commonly encountered in business, economics, engineering, and science. Because QPDs are directly parameterized by data, they have the practical advantage of avoiding the intermediate step of parameter estimation, a time-consuming process that typically requires non-linear iterative methods to estimate probability-distribution parameters from data. Some QPDs have virtually unlimited shape flexibility and closed-form moments as well.

History edit

The development of quantile-parameterized distributions was inspired by the practical need for flexible continuous probability distributions that are easy to fit to data. Historically, the Pearson[1] and Johnson[2][3] families of distributions have been used when shape flexibility is needed. That is because both families can match the first four moments (mean, variance, skewness, and kurtosis) of any data set. In many cases, however, these distributions are either difficult to fit to data or not flexible enough to fit the data appropriately.

For example, the beta distribution is a flexible Pearson distribution that is frequently used to model percentages of a population. However, if the characteristics of this population are such that the desired cumulative distribution function (CDF) should run through certain specific CDF points, there may be no beta distribution that meets this need. Because the beta distribution has only two shape parameters, it cannot, in general, match even three specified CDF points. Moreover, the beta parameters that best fit such data can be found only by nonlinear iterative methods.

Practitioners of decision analysis, needing distributions easily parameterized by three or more CDF points (e.g., because such points were specified as the result of an expert-elicitation process), originally invented quantile-parameterized distributions for this purpose. Keelin and Powley (2011)[4] provided the original definition. Subsequently, Keelin (2016)[5] developed the metalog distributions, a family of quantile-parameterized distributions that has virtually unlimited shape flexibility, simple equations, and closed-form moments.

Definition edit

Keelin and Powley[4] define a quantile-parameterized distribution as one whose quantile function (inverse CDF) can be written in the form

 

where

 

and the functions   are continuously differentiable and linearly independent basis functions. Here, essentially,   and   are the lower and upper bounds (if they exist) of a random variable with quantile function  . These distributions are called quantile-parameterized because for a given set of quantile pairs  , where  , and a set of   basis functions  , the coefficients   can be determined by solving a set of linear equations.[4] If one desires to use more quantile pairs than basis functions, then the coefficients   can be chosen to minimize the sum of squared errors between the stated quantiles   and  . Keelin and Powley[4] illustrate this concept for a specific choice of basis functions that is a generalization of quantile function of the normal distribution,  , for which the mean   and standard deviation   are linear functions of cumulative probability  :

 
 

The result is a four-parameter distribution that can be fit to a set of four quantile/probability pairs exactly, or to any number of such pairs by linear least squares. Keelin and Powley[4] call this the Simple Q-Normal distribution. Some skewed and symmetric Simple Q-Normal PDFs are shown in the figures below.

 
Symmetric Simple Q-Normal PDFs
 
Skewed Simple Q-Normal PDFs

Properties edit

QPD’s that meet Keelin and Powley’s definition have the following properties.

Probability density function edit

Differentiating   with respect to   yields  . The reciprocal of this quantity,  , is the probability density function (PDF)

 

where  . Note that this PDF is expressed as a function of cumulative probability   rather than  . To plot it, as shown in the figures, vary   parametrically. Plot   on the horizontal axis and   on the vertical axis.

Feasibility edit

A function of the form of   is a feasible probability distribution if and only if   for all  .[4] This implies a feasibility constraint on the set of coefficients  :

  for all  

In practical applications, feasibility must generally be checked rather than assumed.

Convexity edit

A QPD’s set of feasible coefficients   for all   is convex. Because convex optimization requires convex feasible sets, this property simplifies optimization applications involving QPDs.

Fitting to data edit

The coefficients   can be determined from data by linear least squares. Given   data points   that are intended to characterize the CDF of a QPD, and   matrix   whose elements consist of  , then, so long as   is invertible, coefficients' column vector   can be determined as  , where   and column vector  . If  , this equation reduces to  , where the resulting CDF runs through all data points exactly. An alternate method, implemented as a linear program, determines the coefficients by minimizing the sum of absolute distances between the CDF and the data subject to feasibility constraints.[6]

Shape flexibility edit

A QPD with   terms, where  , has   shape parameters. Thus, QPDs can be far more flexible than the Pearson distributions, which have at most two shape parameters. For example, ten-term metalog distributions parameterized by 105 CDF points from 30 traditional source distributions (including normal, student-t, lognormal, gamma, beta, and extreme value) have been shown to approximate each such source distribution within a K–S distance of 0.001 or less.[7]

Transformations edit

QPD transformations are governed by a general property of quantile functions: for any quantile function   and increasing function   is a quantile function.[8] For example, the quantile function of the normal distribution,  , is a QPD by the Keelin and Powley definition. The natural logarithm,  , is an increasing function, so   is the quantile function of the lognormal distribution with lower bound  . Importantly, this transformation converts an unbounded QPD into a semi-bounded QPD. Similarly, applying this log transformation to the unbounded metalog distribution[9] yields the semi-bounded (log) metalog distribution;[10] likewise, applying the logit transformation,  , yields the bounded (logit) metalog distribution[10] with lower and upper bounds   and  , respectively. Moreover, by considering   to be   distributed, where   is any QPD that meets Keelin and Powley’s definition, the transformed variable maintains the above properties of feasibility, convexity, and fitting to data. Such transformed QPDs have greater shape flexibility than the underlying  , which has   shape parameters; the log transformation has   shape parameters, and the logit transformation has   shape parameters. Moreover, such transformed QPDs share the same set of feasible coefficients as the underlying untransformed QPD.[11]


Moments edit

The   moment of a QPD is:[4]

 

Whether such moments exist in closed form depends on the choice of QPD basis functions  . The unbounded metalog distribution and polynomial QPDs are examples of QPDs for which moments exist in closed form as functions of the coefficients  .

Simulation edit

Since the quantile function   is expressed in closed form, Keelin and Powley QPDs facilitate Monte Carlo simulation. Substituting in uniformly distributed random samples of   produces random samples of   in closed form, thereby eliminating the need to invert a CDF expressed as  .

Related distributions edit

The following probability distributions are QPDs according to Keelin and Powley’s definition:

Like the SPT metalog distributions, the Johnson Quantile-Parameterized Distributions[14][15] (JQPDs) are parameterized by three quantiles. JQPDs do not meet Keelin and Powley’s QPD definition, but rather have their own properties. JQPDs are feasible for all SPT parameter sets that are consistent with the rules of probability.

Applications edit

The original applications of QPDs were by decision analysts wishing to conveniently convert expert-assessed quantiles (e.g., 10th, 50th, and 90th quantiles) into smooth continuous probability distributions. QPDs have also been used to fit output data from simulations in order to represent those outputs (both CDFs and PDFs) as closed-form continuous distributions.[16] Used in this way, they are typically more stable and smoother than histograms. Similarly, since QPDs can impose fewer shape constraints than traditional distributions, they have been used to fit a wide range of empirical data in order to represent those data sets as continuous distributions (e.g., reflecting bimodality that may exist in the data in a straightforward manner[17]). Quantile parameterization enables a closed-form QPD representation of known distributions whose CDFs otherwise have no closed-form expression. Keelin et al. (2019)[18] apply this to the sum of independent identically distributed lognormal distributions, where quantiles of the sum can be determined by a large number of simulations. Nine such quantiles are used to parameterize a semi-bounded metalog distribution that runs through each of these nine quantiles exactly. QPDs have also been applied to assess the risks of asteroid impact,[19] cybersecurity,[6][20] biases in projections of oil-field production when compared to observed production after the fact,[21] and future Canadian population projections based on combining the probabilistic views of multiple experts.[22] See metalog distributions and Keelin (2016)[5] for additional applications of the metalog distribution.


External links edit

References edit

  1. ^ Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, Vol 1, Second Edition, John Wiley & Sons, Ltd, 1994, pp. 15–25.
  2. ^ Johnson, N. L. (1949). "Systems of Frequency Curves Generated by Methods of Translation". Biometrika. 36 (1/2): 149–176. doi:10.2307/2332539. JSTOR 2332539. PMID 18132090.
  3. ^ Tadikamalla, Pandu R.; Johnson, Norman L. (1982). "Systems of Frequency Curves Generated by Transformations of Logistic Variables". Biometrika. 69 (2): 461–465. doi:10.1093/biomet/69.2.461. JSTOR 2335422.
  4. ^ a b c d e f g Keelin, Thomas W.; Powley, Bradford W. (2011). "Quantile-Parameterized Distributions". Decision Analysis. 8 (3): 206–219. doi:10.1287/deca.1110.0213.
  5. ^ a b Keelin, Thomas W. (2016). "The Metalog Distributions". Decision Analysis. 13 (4): 243–277. doi:10.1287/deca.2016.0338.
  6. ^ a b Faber, Isaac Justin; Paté-Cornell, M. Elisabeth; Lin, Herbert; Shachter, Ross D. (2019). Cyber risk management :AI-generated warnings of threats (Thesis). Stanford University.
  7. ^ Keelin, Thomas W. (2016). "The Metalog Distributions". Decision Analysis. 13 (4). Table 8. doi:10.1287/deca.2016.0338.
  8. ^ Gilchrist, W., 2000. Statistical modelling with quantile functions. CRC Press.
  9. ^ Keelin, Thomas W. (2016). "The Metalog Distributions". Decision Analysis. 13 (4). Section 3, pp. 249–257. doi:10.1287/deca.2016.0338.
  10. ^ a b Keelin, Thomas W. (2016). "The Metalog Distributions". Decision Analysis. 13 (4). Section 4. doi:10.1287/deca.2016.0338.
  11. ^ Powley, B.W. (2013). “Quantile Function Methods For Decision Analysis”. Corollary 12, p 30. PhD Dissertation, Stanford University
  12. ^ Keelin, Thomas W.; Powley, Bradford W. (2011). "Quantile-Parameterized Distributions". Decision Analysis. 8 (3). pp. 208–210. doi:10.1287/deca.1110.0213.
  13. ^ Keelin, Thomas W. (2016). "The Metalog Distributions". Decision Analysis. 13 (4): 253. doi:10.1287/deca.2016.0338.
  14. ^ Hadlock, Christopher C.; Bickel, J. Eric (2017). "Johnson Quantile-Parameterized Distributions". Decision Analysis. 14: 35–64. doi:10.1287/deca.2016.0343.
  15. ^ Hadlock, Christopher C.; Bickel, J. Eric (2019). "The Generalized Johnson Quantile-Parameterized Distribution System". Decision Analysis. 16: 67–85. doi:10.1287/deca.2018.0376. S2CID 159339224.
  16. ^ Keelin, T.W. (2016), Section 6.2.2, pp. 271–274.
  17. ^ Keelin, T.W. (2016), Section 6.1.1, Figure 10, pp 266–267.
  18. ^ Mustafee, N. (18 May 2020). The metalog distributions and extremely accurate sums of lognormals in closed form. pp. 3074–3085. ISBN 9781728132839.
  19. ^ Reinhardt, Jason C.; Chen, Xi; Liu, Wenhao; Manchev, Petar; Paté-Cornell, M. Elisabeth (2016). "Asteroid Risk Assessment: A Probabilistic Approach". Risk Analysis. 36 (2): 244–261. Bibcode:2016RiskA..36..244R. doi:10.1111/risa.12453. PMID 26215051. S2CID 23308354.
  20. ^ Wang, Jiali; Neil, Martin; Fenton, Norman (2020). "A Bayesian network approach for cybersecurity risk assessment implementing and extending the FAIR model". Computers & Security. 89: 101659. doi:10.1016/j.cose.2019.101659. S2CID 209099797.
  21. ^ Bratvold, Reidar B.; Mohus, Erlend; Petutschnig, David; Bickel, Eric (2020). "Production Forecasting: Optimistic and Overconfident—Over and over Again". Spe Reservoir Evaluation & Engineering. 23 (3): 0799–0810. doi:10.2118/195914-PA. S2CID 219661316.
  22. ^ Developments in Demographic Forecasting (PDF). The Springer Series on Demographic Methods and Population Analysis. Vol. 49. 2020. pp. 43–62. doi:10.1007/978-3-030-42472-5. hdl:20.500.12657/42565. ISBN 978-3-030-42471-8. S2CID 226615299.