User:Rb88guy/sandbox

Introduction

THIS IS A ROUGH DRAFT, NEEDS A LOT OF WORK

Helmert's distribution of s_n

The distribution of the sample standard deviation s_n was derived by Helmert ^[1], and is given by

s_{n}\,\,\sim \,\,\,{{n^{{n-1} \over 2}} \over {2^{{n-3} \over 2}\,\,\Gamma \left({{n-1} \over 2}\right)\,\,\sigma ^{n-1}}}\,\,\,\,s_{n}^{n-2}\,\,\exp \left[{{-n\,s_{n}^{2}} \over {2\,\sigma ^{2}}}\right]

where n is the sample size, taken from an NID population whose true standard deviation is σ. The statistic s_n is found using

s_{n}\,\,\,=\,\,\,{\sqrt {{\,\sum \limits _{i\,=\,1}^{n}{\left({x_{i}-\,\,{\bar {x}}}\right)^{2}}} \over n}}

as opposed to the statistic s_n−1 as defined above, in which the divisor under the square root is n−1. It can be shown ^[2] that the expected value (mean) of this distribution is

{\rm {E}}\left[{s_{n}}\right]\,\,\,=\,\,\,\sigma \,\,\left\{{{\sqrt {{2\pi } \over n}}\,\,{1 \over {B\left({{{n-1} \over 2},{1 \over 2}}\right)}}}\right\}

where B( ) is the beta function. Using an identity for the beta and gamma functions^[3]

B\left({z,w}\right)\,\,\,=\,\,\,{{\Gamma \left(z\right)\,\,\Gamma \left(w\right)} \over {\Gamma \left({z+w}\right)}}

it follows that

{\rm {E}}\left[{s_{n}}\right]\,\,\,=\,\,\,\sigma \,\,{\sqrt {\,{2 \over n}}}\,\,\,{{\Gamma \left({n \over 2}\right)} \over {\Gamma \left({{n-1} \over 2}\right)}}\,\,\,\,\,=\,\,\,\sigma \,c_{2}

The symbol c₂ is used in quality control ^[4]. In fact, the r^thmoment of this PDF can be found using^[5]

{\rm {E}}\left[{s_{n}^{\,\,r}}\right]\,\,\,=\,\,\,\,\sigma ^{r}\,\,\left({2 \over n}\right)^{r \over 2}\,\,{{\,\Gamma \left({{n+\,\,r-1} \over 2}\right)} \over {\Gamma \left({{n-1} \over 2}\right)}}

Using series expansions, it can be shown that an approximate value for c₂ can be obtained from^[6]

c_{2}\approx \,\,\,1\,\,\,-\,\,\,{3 \over {4\,n}}\,\,\,-\,\,\,{7 \over {32\,n^{2}}}\,\,\,-\,\,\,\cdots

Distribution of normalized s_n

It is useful to have the PDF of the ratio of s_n to σ so that plots, for example, will be scale-independent. This amounts to a simple change of variable in the Helmert distribution. Since σ is a constant, it is straightforward to show that^[7]

{{s_{n}} \over \sigma }\,\,\,\sim \,\,\,{{n^{{n\,-\,1} \over 2}} \over {2^{{n\,-\,3} \over 2}\,\,\,\Gamma \left({{n-1} \over 2}\right)}}\,\,\,\left({{s_{n}} \over \sigma }\right)^{n-2}\exp \left[{-\,\,{n \over 2}\left({{s_{n}} \over \sigma }\right)^{2}}\right]

and the expected value (mean) of this PDF is

{\rm {E}}\left[{{s_{n}} \over \sigma }\right]\,\,\,=\,\,\,c_{2}

To illustrate this PDF, consider Figure 1 (the figures are in a gallery at the bottom of the article). This shows the Helmert PDF (solid line) and a histogram of 10000 sampled s_n values, both normalized to the known standard deviation of the NID population. The vertical dashed line, just visible near the solid line showing the location of c₂, is the location of the observed mean of these s_n values. (The circles plotted on this figure will be addressed below.) Clearly the histogram and the PDF, and the observed mean and c₂ agree well.

Figure 2 shows the behavior of the PDF of the normalized s_n as the sample size increases. The c₂ values, which are the means of the respective PDFs, are indicated. (The c₂ for n=2 is the leftmost thin vertical line.)

Distribution of normalized s_n−1

Since it is the case that

s_{n-1}^{\,2}=\,\,\,s_{n}^{2}\left({n \over {n-1}}\right)\,\,\,\,\,\,\,\Rightarrow \,\,\,\,\,s_{n-1}=\,\,s_{n}{\sqrt {\,{n \over {n-1}}}}

then

{{s_{n-1}} \over \sigma }\,\,\,=\,\,\,s_{n}\,\,\left({{1 \over \sigma }\,\,{\sqrt {\,{n \over {n-1}}}}}\right)

and everything in the parentheses is a constant. Returning to the Helmert PDF and again using the change-of-variable calculations, the result is

{{s_{n-1}} \over \sigma }\,\,\,\sim \,\,\,{{\left({n-1}\right)^{{n\,-\,1} \over 2}} \over {2^{{n\,-\,3} \over 2}\,\,\Gamma \left({{n-1} \over 2}\right)}}\,\,\,\left({{s_{n-1}} \over \sigma }\right)^{n-2}\exp \left[{-\,\,{{n-1} \over 2}\left({{s_{n-1}} \over \sigma }\right)^{2}}\right]

The expected value is^[8]

{\rm {E}}\left[{s_{n-1}}\right]\,\,\,=\,\,\,\sigma \,\,{\sqrt {\,{2 \over {n\,\,-1}}}}\,\,\,{{\Gamma \left({n \over 2}\right)} \over {\Gamma \left({{n-1} \over 2}\right)}}\,\,\,=\,\,\,\sigma \,c_{4}\,\,\,\,\,\,\,\,\Rightarrow \,\,\,\,\,\,\,\,{\rm {E}}\left[{{s_{n-1}} \over \sigma }\right]\,\,\,=\,\,c_{4}

where c₄ again is a statistical quality control symbol; its series approximation is

c_{4}\,\,\approx \,\,\,1\,\,\,-\,\,\,{1 \over {4\,n}}\,\,\,\,-\,\,\,\,{7 \over {32\,n^{2}}}\,\,\,\,-\,\,\,\,\cdots

Simulation results for the s_n−1 case are shown in Figures 3 and 4.

Relation of Helmert to Chi distribution

The Chi PDF ^[9] is

\chi \,\,\,\sim \,\,\,{1 \over {2^{{k \over 2}\,\,-\,\,1}\,\,\,\Gamma \left({k \over 2}\right)}}\,\,\,\chi ^{k\,-\,1}\,\,\,\exp \left[{{-\chi ^{2}} \over 2}\right]

where k is the number of degrees of freedom. Taking k = n − 1, making the substitution

\chi \,\,\,=\,\,\,{\sqrt {n}}\,\,{{s_{n}} \over \sigma }

and using the change-of-variable calculations once again,

{{s_{n}} \over \sigma }\,\,\,\sim \,\,\,{1 \over {2^{{n\,-\,3} \over 2}\,\,\Gamma \left({{n-1} \over 2}\right)}}\,\,\,\left({{{\sqrt {n}}\,s_{n}} \over \sigma }\right)^{n-2}\exp \left[{-{1 \over 2}\left({{{\sqrt {n}}\,s_{n}} \over \sigma }\right)^{2}}\right]\left({\sqrt {n}}\right)

which reduces to the previously-found Helmert PDF for a normalized s_n

{{s_{n}} \over \sigma }\,\,\,\sim \,\,\,{{n^{{n\,-\,1} \over 2}} \over {2^{{n\,-\,3} \over 2}\,\,\,\Gamma \left({{n-1} \over 2}\right)}}\,\,\,\left({{s_{n}} \over \sigma }\right)^{n-2}\exp \left[{-\,\,{n \over 2}\left({{s_{n}} \over \sigma }\right)^{2}}\right]

A similar process for s_n−1, using the substitution

\chi \,\,\,=\,\,\,{\sqrt {n-1}}\,\,\,{{s_{n-1}} \over \sigma }

can be shown to reproduce the Helmert normalized s_n−1 PDF. The circles on the histogram plots in the figures are obtained from these calculations.

Summary

The bias-correction constants are defined as

c_{2}\,\equiv \,\,\,{\sqrt {\,{2 \over n}}}\,\,\,{{\Gamma \left({n \over 2}\right)} \over {\Gamma \left({{n-1} \over 2}\right)}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,c_{4}\,\,\,\,\equiv \,\,\,{\sqrt {\,{2 \over {n-1}}}}\,\,\,{{\Gamma \left({n \over 2}\right)} \over {\Gamma \left({{n-1} \over 2}\right)}}

so that

c_{2}\,\,=\,\,{\sqrt {\,{{n-1} \over n}}}\,\,\,c_{4}

While the series approximations

c_{2}\approx \,\,\,1\,\,\,-\,\,\,{3 \over {4\,n}}\,\,\,-\,\,\,{7 \over {32\,n^{2}}}\,\,\,-\,\,\,\cdots \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,c_{4}\approx \,\,\,1\,\,\,-\,\,\,{1 \over {4\,n}}\,\,\,-\,\,\,{7 \over {32\,n^{2}}}\,\,\,-\,\,\,\cdots

are useful, modern software should permit the direct calculation of these correction factors, using the gamma functions. Figure 5 shows the behavior of these factors as a function of sample size.

Finally, to obtain an unbiased estimate of the population standard deviation for NID data, use either

{\hat {\sigma }}=\,\,{{s_{n}} \over {c_{2}}}\,\,\,\,\,\,\,\,\,{\rm {or}}\,\,\,\,\,\,\,\,{\hat {\sigma }}=\,\,\,{{s_{n-1}} \over {c_{4}}}

Figure gallery

Figure 1
Histogram and PDF, s_n
Figure 2
PDFs vs n for s_n
Figure 3
Histogram and PDF, s_n-1
Figure 4
PDFs vs n, s_n-1
Figure 5
Correction factors vs n

References

^ Deming, W. E., Some Theory of Sampling, Wiley (1950), p. 495. Also see pp. 495-7 and all of Chapter 15. The table on p. 530 is useful. A more recent reprint of this text is published by Dover (1984) ISBN 048664684X.
^ Deming, p. 496
^ Abramowitz and Stegun, Handbook of Mathematical Functions, NBS Applied Mathematics Series 55 (1964) p. 258, Eq 6.2.2 This book is available online, free, in electronic form: [1]
^ For example, Wheeler, D. J., Advanced Topics in Statistical Process Control, SPC Press (1995) ISBN 0-945320-45-0, p. 58
^ Lindgren, B. W., Statistical Theory, 3rd Ed., Macmillan (1976), ISBN 0-02-370830-1, p. 340
^ Deming, p. 521
^ Meyer, S. L., Data Analysis for Scientists and Engineers, Wiley (1975), ISBN 0-471-59995-6 p. 149 Eq 20.24
^ Duncan, A. J., Quality Control and Industrial Statistics, 4th Ed., Irwin (1974), ISBN 0-256-01558-9, p. 139 and Appendix II, Table M
^ Johnson and Kotz, Distributions in Statistics: Continuous Univariate Distributions- I, Wiley (1970), ISBN 0-471-44626-2, p. 197

[1] Deming, W. E., Some Theory of Sampling, Wiley (1950), p. 495. Also see pp. 495-7 and all of Chapter 15. The table on p. 530 is useful. A more recent reprint of this text is published by Dover (1984) ISBN 048664684X.

[2] Deming, p. 496

[3] Abramowitz and Stegun, Handbook of Mathematical Functions, NBS Applied Mathematics Series 55 (1964) p. 258, Eq 6.2.2 This book is available online, free, in electronic form: [1]

[4] For example, Wheeler, D. J., Advanced Topics in Statistical Process Control, SPC Press (1995) ISBN 0-945320-45-0, p. 58

[5] Lindgren, B. W., Statistical Theory, 3rd Ed., Macmillan (1976), ISBN 0-02-370830-1, p. 340

[6] Deming, p. 521

[7] Meyer, S. L., Data Analysis for Scientists and Engineers, Wiley (1975), ISBN 0-471-59995-6 p. 149 Eq 20.24

[8] Duncan, A. J., Quality Control and Industrial Statistics, 4th Ed., Irwin (1974), ISBN 0-256-01558-9, p. 139 and Appendix II, Table M

[9] Johnson and Kotz, Distributions in Statistics: Continuous Univariate Distributions- I, Wiley (1970), ISBN 0-471-44626-2, p. 197

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]