Talk:Pearson correlation coefficient/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

in computer software

Latest comment: 16 years ago2 comments2 people in discussion

why have the, "in computer software" section, it doesn't appear in any other statistics aarticle that I'm aware of. Any opinions? Pdbailey (talk) 03:01, 24 January 2008 (UTC)

Agree. Not encyclopedic. The info is straightforward to find using the help system in any half-decent software. And where would it end? Qwfp (talk) 09:41, 28 February 2008 (UTC)

Merge with Correlation

Latest comment: 15 years ago7 comments5 people in discussion

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

The result was Keep due to no consensus. -- -Sykko-(talk to me) 02:02, 26 September 2008 (UTC)

This topic is covered in Correlation. I move that we merge the two and have this page redirect to Correlation.

I disagree. Its better to cover the details in a separate page. —Preceding unsigned comment added by Sekarnet (talk • contribs) 08:44, 20 August 2008 (UTC)

I disagree, I think it is fine to cover the details of the specific method in a separate page, allowing the 'central' subject article to be clear for non technical people. However, the stuff about linear regression here should probably be removed. Any reason this page isn't listed under rank correlation coefficient? Hmmm... I guess it isn't a 'rank' method. --Dan|^(talk) 08:02, 8 August 2007 (UTC)

I agree with the proposal, merge it with Correlation, it's actually already covered much better in there --mcld (talk) 10:27, 30 May 2008 (UTC)

I disagree. I suggest moving much of the mathematical stuff pertaining to the correlation coefficient to be under "Pearson... ", and that it should be extended with results about the sampling distribution under the joint-normal case. This would leave the "Correlation" article to given a general description and to compare with other measures of dependence. Melcombe (talk) 11:54, 30 May 2008 (UTC)

Keep I came to Wikipedia to learn more about this subject and I am glad that I was able to find this detailed info on it's own. I already knew what correlation was and would have been disappointed to be redirected to that article when looking up this one. Although if it is covered better on the correlation article then perhaps some of the info there should be used to improve this article. -Sykko-(talk to me) 23:45, 12 September 2008 (UTC)

Since there was no consensus I am going to remove the tag and close the discussion now -Sykko-(talk to me) 02:02, 26 September 2008 (UTC)

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Pseudocode

Latest comment: 14 years ago3 comments3 people in discussion

I moved the pseudocode over from correlation and may do so with a few other sections that are specific to Pearson correlation. I'm leaving this note here to point to an extended debate on the Talk:correlation page regarding this section. Skbkekas (talk) 02:42, 5 June 2009 (UTC)

I came to correlation and this article looking for an algorithm, found the single pass pseudocode, and tried it. The wording in its section seems to suggest that it provides good numerical stability and only requires a single pass. While it might have good numerical stability for a single pass implementation, I found it was not sufficiently stable (compared to say, Excel) for a simple analysis of small series of financial returns. Considering the existing debate on the wisdom of including pseudocode at all, I suggest that someone either add two-pass pseudocode for comparison, or emphasize the limited/specific utility of the single-pass psuedocode. Expyram (talk) 15:25, 30 July 2009 (UTC)

You can see code for a stability analysis on my talk page. Given that it is nontrivial to construct an example exhibiting significant inaccuracy (and hence this almost never happens for real-world financial time series), it seems plausible that Expyram wrote a bug. I would suggest benchmarking against the worked examples. Brianboonstra (talk) 21:46, 3 August 2009 (UTC)

Section merger proposal

Latest comment: 14 years ago2 comments2 people in discussion

I have commented on the correlation article talk page about why I disagree with the section merger proposal for the "sensitivity to the data distribution" section. Skbkekas (talk) 03:58, 2 November 2009 (UTC)

So have I. JamesBWatson (talk) 12:13, 2 November 2009 (UTC)

Question from IP

Latest comment: 14 years ago3 comments3 people in discussion

The derivation that the coefficient of determination is the square of the correlation coefficient depends on the existence of a "constant" term in the regression, i.e. a model like y = mx + b. To see this break down, consider the data points (-1,0) and (0,1). If you run a linear model of the form y = mx, you will have a negative r^2 value [computing as 1 - RSS / TSS] —Preceding unsigned comment added by 66.227.30.35 (talk) 21:02, 3 February 2010 (UTC)

I don't understand. R-squared is only valid if there is a constant in the regression. 0¹⁸ (talk) 22:44, 3 February 2010 (UTC)

There is no claim that the account given is valid for anything other than the model y = mx + b, so the fact that formulas do not apply to other models is not surprising. JamesBWatson (talk) 09:54, 5 February 2010 (UTC)

single pass code

Latest comment: 14 years ago1 comment1 person in discussion

The section titled, "Computing correlation accurately in a single pass" could probably use more text. Questions to be answered: what is sweep? what is mean_x ? what is delta_x? what is the formula that shows it works? I also wonder, why not show how to do it in three passes first and then show the single pass code? 0¹⁸ (talk) 00:14, 19 April 2010 (UTC)

reflective correlation

Latest comment: 13 years ago2 comments2 people in discussion

It would be helpful if there were references for the "reflective correlation" section. I have this formula in some code and I'm trying to track down its origin, but the phrase "reflective correlation" does not seem to be common when searching on the *net, except in links back to this page. 75.194.255.109 (talk) 20:14, 9 September 2010 (UTC)

You could try searching for "uncentered correlation": this does bring up some uses of these formulae. And this may be a better name to use for the idea here. Melcombe (talk) 08:48, 10 September 2010 (UTC)

Pearson's Correlation Coefficient is Biased

Latest comment: 13 years ago4 comments3 people in discussion

I believe the statement that Pearson's Correlation Coefficient is "asymptotically unbiased" is incorrect, or at least, non-ideal phrasing. The estimator is biased even for bivariate normal data. In the normal case, Fisher showed an approximate solution for the expectation of the sample estimate of correlation is E[r] = ρ −ρ (1− ρ 2 ) / 2n. Check out the article http://www.uv.es/revispsi/articulos1.03/9.ZUMBO.pdf. This bias goes to zero as n goes to infinity, so perhaps the term "consistent" was meant by "asymptotically unbiased". MrYdobon (talk) 13:47, 24 October 2010 (UTC)

"Asymptotically unbiased" means that the bias goes to zero as n goes to infinity. Consistency is something else entirely. Skbkekas (talk) 17:05, 24 October 2010 (UTC)

No it isn't, it's closely related: An estimator that is asymptotically unbiased is consistent if its variance decreases to zero as the sample size tends to infinity, which in practice it nearly always does. See Talk:Consistent estimator#related to other concepts Qwfp (talk) 20:11, 24 October 2010 (UTC)

Being "asymptotically unbiased" means that the (E[r_n] - ρ)/s_n goes to zero, where s_n is the standard deviation of r_n -- I left out the s_n term before. By linearization, it's easy to see that the variance of r_n decays like 1/n, thus s_n is proportional to n^-1/2, so "asymptotically unbiased" means that n^1/2(E[r_n] - ρ) goes to zero. Also by linearization, it's easy to see that n^b(E[r_n]−ρ) goes to zero, for b<1. Thus the correlation coefficient is asymptotically unbiased. The only condition needed here is that for the linearization arguments to go through, you need some finite moments, four of them I think. Skbkekas (talk) 22:24, 24 October 2010 (UTC)

Correlation coefficient and linear transformations

Latest comment: 13 years ago3 comments3 people in discussion

In the section Removing correlation it states "It is always possible to remove the correlation between random variables with a linear transformation" however, in Mathematical properties it is stated that the PPMCC "is invariant to changes in location and scale. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants, without changing the correlation coefficient". These two statements seem contradictory to me, in that we're saying simultaneously that for any linear transformation the correlation coefficient remains unchanged, but there still exists a particular linear transformation which removes correlation. The only way I can make sense of this is if the linear transformation results in a dataset with no correlation, but with a non-zero Pearson product-moment correlation coefficient. - Clarification in the text of this apparent discrepancy would be appreciated. -- 140.142.20.229 (talk) 23:01, 15 December 2010 (UTC)

A clarification has been attempted. I would prefer not to have the Mathematical properties section expanded beyond this as it would be off-topic for that section, but something more might be added to Removing correlation if neccesary. But remember this is not a text-book. Melcombe (talk) 09:56, 16 December 2010 (UTC)

I don't know what clarification was attempted, but I came here to comment on the exact same issue as 140.142.20.229: the two statements still appear contradictory (at least to me).

My understanding is admittedly naive but if the geometric interpretation of the correlation coefficient is the cosine of an angle between the vectors corresponding to each set of centered data, how could a full-blown linear transformation ever hope to preserve the correlation coefficient in general?Yes, translations are fine, but surely a change in slope would change the coefficient? --Saforrest (talk) 08:04, 3 February 2011 (UTC)

My understanding is that the correlation is being removed by mixing X and Y. It's a linear transformation on the vector, not just on the individual variables.

Correct me if I'm wrong

Latest comment: 13 years ago2 comments2 people in discussion

I'm not extremely familiar with the Pearson Correlation Coefficient, but I needed to implement it as a function for a program I was making, so I implemented this equation from the article:

r={\frac {1}{n-1}}\sum _{i=1}^{n}\left({\frac {X_{i}-{\bar {X}}}{s_{X}}}\right)\left({\frac {Y_{i}-{\bar {Y}}}{s_{Y}}}\right)

This seemed pretty straightforward and was easily implemented, I tested it on the dataset x{1,2,3,4,5} y{1,2,3,4,5}, on which it returned r = 1.25. I was suprised since I thought correlations went from -1 to 1 so I calculated r by hand using the above formula thinking I made a mistake in my code and recieved r = 1.25. After this I adjusted the formula to be the average of the products of the standard scores:

r={\frac {1}{n}}\sum _{i=1}^{n}\left({\frac {X_{i}-{\bar {X}}}{s_{X}}}\right)\left({\frac {Y_{i}-{\bar {Y}}}{s_{Y}}}\right)

Upon which I recieved the expected r = 1 both by hand and from the program, can someone confirm if this is the correct formula? --Guruthegreat0 (talk) 00:02, 15 April 2011 (UTC)

The correlation is a ratio involving a covariance and two variances. You can estimate these quantities using the unbiased estimate (normalizing by N-1), or using the maximum likelihood estimate (normalizing by N), see Sample mean and sample covariance. When calculating the correlation coefficient, it doesn't matter which normalization you use, but you should use the same normalization for all three quantities. You are getting r=1.25 because you are using the N-1 normalization for the covariance but the N normalizations for the variances. Skbkekas (talk) 14:27, 15 April 2011 (UTC)

Odd comments

Latest comment: 13 years ago3 comments3 people in discussion

Correlations are rarely if ever 0? Is it really that hard to find things which are completely uncorrelated? —Preceding unsigned comment added by 169.237.24.238 (talk) 19:46, 12 June 2008 (UTC)

Just as hard as getting a random number that is exactly zero. It just never happens. Jmath666 (talk) 06:27, 5 August 2008 (UTC)

Is there a particular reason that the formula in the article is linked as an seperate image, rather than using Wikipedia's TeX markup? I noticed lots of experimenting on in the article history. -- DrBob 21:50 May 5, 2003 (UTC)

TeX insisted on putting r = in the numerator. That was probably my fault but having spent enough time on it I decided to put in an image till I could figure it out. The image is less than ideal, though, so I will be trying again to write the formula in TeX. Jfitzg

It's easy when you know how, eh? My TeX looked something like that, but obviously not enough. Thanks.Jfitzg

I remember. I had \sum in the wrong place. I was tired. Really.

It's also the curly braces {} that tell TeX how to do the grouping, rather than showing up in the text like normal braces (). They look rather too similar in some fonts, so it can be hard to spot the difference. See the Tex markup article for more examples. -- Anon.

Thanks, anon. I finally got round to converting the others.

Would it be possible to add a label to the diagram Correlation_examples2.svg‎? The center picture has an undefined correlation coefficient which I think should be labeled in the diagram, not just the description. I found the lack of a label confusing at first. — Preceding unsigned comment added by 149.76.197.101 (talk) 23:48, 6 June 2011 (UTC)

This article could benefit from a little example with some numbers and an actual calculation of r. AxelBoldt 22:26 22 May 2003 (UTC)

Feel free.

Pearson Distance

Latest comment: 12 years ago2 comments2 people in discussion

Is the Pearson Distance really a distance? In the metric sense. Symmetry, positivity and the null property is obvious but does the triangle inequality for sure hold? What we have is that: 1-r(1,2) \leq 2-[r(1,3)+r(2,3)]. That is r(1,2)\geq r(1,3)+r(2,3)-1. (Let the random variables 1,2,3 be just some vectors in R^n, nothing fancy). I don't know how to see this inequality. And by the way, I also don't know how to imagine that two random variables are the same in the general sense (their distributions are the same? Up to some moments?). --78.104.124.124 (talk) 00:16, 11 March 2012 (UTC)

I agree.

1-r(x,y)

is not a distance. The correlation coefficient is

r(x,y)=\cos(\alpha )

with

\alpha

the angle between

x,y

. For

1-r(x,y)

to be a distance would require

2\cos({\frac {alpha}{2}})-\cos(\alpha )\leq 1\forall \alpha

, which is not true, thus

1-r(x,y)

is a semi-metric. — Preceding unsigned comment added by 66.233.184.215 (talk) 18:22, 11 June 2012 (UTC)

"Direction"

Latest comment: 12 years ago1 comment1 person in discussion

Ok so cool you've mentioned "strength" of linear association - but what about direction? This is what the - and + signs indicate 129.180.166.53 (talk) 08:10, 16 June 2012 (UTC)

Pearson's r being "misleading"

Latest comment: 12 years ago1 comment1 person in discussion

It should be included that Pearson's r can sometimes be misleading, and that data should ALWAYS be visualized. For example, a perfect parabolic shape, shows r=0. Does this means there is "no" relationship? No. It just means it has no linear relationship 129.180.166.53 (talk) 08:53, 16 June 2012 (UTC)

Scatterplot Graph

Latest comment: 11 years ago1 comment1 person in discussion

I am not convinced that the scatterplot in the second column of the third row of the figure indeed has a correlation of zero (this is the one that looks like a tilted rectangle, but not a diamond). I generated a similar scatter plot with the following R code and obtained a Pearson Correlation Coefficient of .32.

x <- NULL y <- NULL for(xrep in 1:100) {

for(yrep in 1:100)
{
  x <- c(x, xrep + (100 - yrep))
  y <- c(y, xrep + yrep * .5)
}

} plot(x,y) cor(x,y)

(Sorry, I cannot figure out how to get the Wikipedia editor to treat the entire sequence of code as code. This is the best I could do.)

Kmarkus (talk) —Preceding undated comment added 18:46, 14 July 2012 (UTC)

Requested move

Latest comment: 11 years ago7 comments3 people in discussion

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section.

The result of the move request was: nomination withdrawn. Favonian (talk) 07:07, 11 August 2012 (UTC)

Pearson product-moment correlation coefficient → Pearson product–moment correlation coefficient –

Per WP:MOSDASH and many authoritative external style guides (even though not always practised by statisticians). Also needs to match sibling article titles. Tony (talk) 00:38, 11 August 2012 (UTC)

Comment – I presume Tony means MOS:DASH. I might support if I understood the intended meaning of this term. Looks like a fair number of articles and books do use the en dash, so it's probably right. But can someone explain the intended relationship between the words product and moment here. Is it a correlation of a product moment (the moment of a product)? That would need a hyphen. Or a correlation between a product and a moment (or between products and moments)? That would need an en dash. Or something else? The article doesn't make this clear, and so far neither do the sources I've looked at. Once we know what it means, deciding the punctuation per MOS:DASH should be easy. Dicklyon (talk) 02:24, 11 August 2012 (UTC)

Dick, it's a correlation; a correlation coefficient at that. Is that alone not a dash context? Tony (talk) 02:30, 11 August 2012 (UTC)

It cetainly sounds like it. An A–B correlation would normally get an en dash. But I'm not sure that's what it is here. As I look at it, it appears to be a correlation computed as the first moment of the product of the mean-compensated random variables, in which case the hyphen might be signalling that it's a correlation based on a "product moment", no? Until I know more, I have to reserve judgement. Maybe some statisticians will know better what the intended reading is. If hyphen is right, then you're certainly not among the first ones to think otherwise, as lots of sources do use the en dash. Dicklyon (talk) 02:36, 11 August 2012 (UTC)

Tony, hate to tell you, but not this one. See this article which includes essentially our formula, with explanation "For the covariance we shall take the product moment of the deviations of the x's and y's..." That is, it's a correlation coefficient based on a product moment, not a correlation between a product and a moment; more specifically, it's an "origin moment of the product" as this paper calls it, in which "the first origin moment is the mean" (so the correlation is the mean of the product, as the equation shows). Suggest you withdraw the RM. Dicklyon (talk) 03:30, 11 August 2012 (UTC)

Dick, thanks for that bit of research; I'd have been incapable of it. And it shows the importance of typographical distinctions. Withdrawing RM. (Since the bot is down, I'm flagging this to Favonian.) Tony (talk) 03:33, 11 August 2012 (UTC)

The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.

Image

Latest comment: 11 years ago1 comment1 person in discussion

An image is added to show x-y axis, values and correlation lines more clearly, the previous several set point image looks like astrophysics pictures, there are no x-y axis, no line and the lower sets of points can be confusing. The new image is adapted from an image in Applied Statistics and Probability by Montgomery and generally in most other statistics books, it is actually a classic illustration. Some books don’t show the line for example in Statistics 3rd edition by S. Ross. — Preceding unsigned comment added by Kiatdd (talk • contribs) 20:04, 21 October 2012 (UTC)

interpretation

Latest comment: 11 years ago1 comment1 person in discussion

Recently, redundant material has been inserted into this article. There is a section [Interpretation of the size of a correlation] in the article, with references. It is not necessary, as in recent edits, to insert another section about this at the end of the article. Similar edits have been reverted at least twice. Please read the full article before editing. Mathstat (talk) 14:41, 31 December 2012 (UTC)

Inference

Latest comment: 11 years ago2 comments2 people in discussion

Could we get some examples of applied papers, preferably recent papers, using inference on a correlation? I'm a little curious in particular about the graph showing the minimum sample size to show a correlation significantly different from zero. I don't recall this ever being emphasized or even noted in papers I've read. It would be nice to know whether it is used much in practice and why or why not. — Preceding unsigned comment added by 193.205.23.67 (talk) 13:13, 1 February 2013 (UTC)

Assuming that the data is bivariate normal, then the transformation

t={\frac {r{\sqrt {n-2}}}{\sqrt {1-r^{2}}}}

can be applied and critical values of

t_{n-2}

used for the test decision. The graph simply plots the inverse of that:

r={\frac {t}{\sqrt {n-2+t^{2}}}}

as a function of sample size n (where

t=t_{n-2,.05}

is the critical value for n-2 df at level .05). To generate the plot in R:

r <- function(n, sig=.05) {
  q <- qt(1-sig/2, df=n-2)
  q / sqrt(n-2 + q^2)
}
curve(r(x), from=3, to=200)

Most undergraduate statistics textbooks refer to this correlation test for bivariate normal data. See e.g. Larsen and Marx 4e. In practice the plot would not be used for inference because it is easier and more accurate to refer to t critical values. The plot gives an idea about the power of the test - relating the sample size to the critical value in terms of r. Mathstat (talk) 22:30, 1 February 2013 (UTC)

Needs Basic Reference

Latest comment: 10 years ago1 comment1 person in discussion

if Pearson really invented this, then there should be a citation to a specific paper in which Pearson invented it. What is the citation? Jfgrcar (talk) 19:34, 9 October 2013 (UTC)

Something doesn't seem right...

Latest comment: 10 years ago10 comments3 people in discussion

... it looks to me like you are using the formula for r as the population correlation coefficient, not the sample coefficient. According to my textbook, the formula is:

r={n(\sum {{X_{i}}{Y_{i}}})-(\sum _{i=1}^{n}{X_{i}})(\sum _{i=1}^{n}{Y_{i}}) \over {\sqrt {[n(\sum _{i=1}^{n}{X_{i}}^{2})-(\sum _{i=1}^{n}{X_{i}})^{2}]\times [n(\sum _{i=1}^{n}{Y_{i}}^{2})-(\sum _{i=1}^{n}{Y_{i}})^{2}]}}}

What am I missing here? - 114.76.235.170 (talk) 14:51, 12 August 2010 (UTC)

Note: have updated the sum of notations. - 114.76.235.170 (talk) 08:19, 15 August 2010 (UTC)

A good first step might be to write them in the same notation. You will also have to rearrange the sums and know that (sum of x) = (average of x) times n. I'm also not sure if you are commenting on the article or asking for help with the algebra. If you want help, the math reference desk is the best place on Wikipedia (a teacher's office hours might be the best if you are taking a class). WP:RD/MA 0¹⁸ (talk) 15:59, 12 August 2010 (UTC)

Not taking a class, was commenting here because I was generally curious about the formula. My understanding is that the coefficient is the sum of the product of the z-scores for the dependent and independent variables, over the number of elements in the sample. However, z-scores use the standard deviation - but there are two formulas for the standard deviation, one using Bessel's correction for the unbiased estimator (sample deviation) and the other doesn't (population deviation). I was wondering where this comes into play, if at all. - 114.76.235.170 (talk) 03:05, 14 August 2010 (UTC)

I see, is there an n versus (n-1) issue? No. All correlations have range [-1,1] and so adjustment like that would necessarily move the range. Another way you can think of it is that the correction term would cancel out because it would appear in the numerator and the denominator. Similarly, I think your textbook's version uses a trick that would have been useful in the 1970s and 1980s that multiplies the numerator and denominator by n to make the computation easier. The older statistical methods are often written down in difficult to understand notations like this to "help the reader" because they were easier to compute/used fewer resources on the machines of the time. Now they are just anachronistic. 0¹⁸ (talk) 16:36, 14 August 2010 (UTC)

Ah, I see... I think :-) - 114.76.235.170 (talk) 08:19, 15 August 2010 (UTC)

OK, so the formula here is:

r={\frac {\sum _{i=1}^{n}(X_{i}-{\bar {X}})(Y_{i}-{\bar {Y}})}{{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}\times {\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}}}

Now if you multiple the numerator and the denominator by n then this gives you:

r={\frac {n({\sum _{i=1}^{n}(X_{i}-{\bar {X}})(Y_{i}-{\bar {Y}})})}{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}}

Then you expand the numerator:

r={\frac {n({\sum _{i=1}^{n}({X_{i}}{Y_{i}}-{\bar {X}}{Y_{i}}-{\bar {Y}}{X_{i}}-{\bar {X}}{\bar {Y}}}))}{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}}

Which is the same as:

r={\frac {n({\sum _{i=1}^{n}({X_{i}}{Y_{i}}-({\sum _{i=1}^{n}X_{i} \over n})\times {Y_{i}}-({\sum _{i=1}^{n}Y_{i} \over n})\times {X_{i}}-({\sum _{i=1}^{n}X_{i} \over n})({\sum _{i=1}^{n}Y_{i} \over n})}))}{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}}

Which is:

r={\frac {n({\sum _{i=1}^{n}({X_{i}}{Y_{i}}-({\sum _{i=1}^{n}{X_{i}}{Y_{i}} \over n})-({\sum _{i=1}^{n}{Y_{i}}{X_{i}} \over n})-({\sum _{i=1}^{n}X_{i} \over n})({\sum _{i=1}^{n}Y_{i} \over n})}))}{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}}

This cancels a term...

r={\frac {n({\sum _{i=1}^{n}({X_{i}}{Y_{i}}-({\sum _{i=1}^{n}X_{i} \over n})({\sum _{i=1}^{n}Y_{i} \over n})}))}{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}}

Leading to the simplified numerator...

r={\frac {n(\sum _{i=1}^{n}{{X_{i}}{Y_{i}}})-(\sum _{i=1}^{n}{X_{i}})(\sum _{i=1}^{n}{Y_{i}})}{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}}

So expanding the denominator:

r={\frac {n(\sum _{i=1}^{n}{{X_{i}}{Y_{i}}})-(\sum _{i=1}^{n}{X_{i}})(\sum _{i=1}^{n}{Y_{i}})}{n[{\sqrt {\sum _{i=1}^{n}{X_{i}}^{2}-2(\sum _{i=1}^{n}{X_{i}})({\bar {X}})+{\bar {X}}^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}{Y_{i}}^{2}-2(\sum _{i=1}^{n}{Y_{i}})({\bar {Y}})+{\bar {Y}}^{2}}}]}}

Now

{\bar {X}}

is

{\sum _{i=1}^{n}{X_{i}}} \over n

and

{\bar {Y}}

is similarly

{\sum _{i=1}^{n}{Y_{i}}} \over n

, therefore the denominator can be rewritten further:

r={\frac {n(\sum _{i=1}^{n}{{X_{i}}{Y_{i}}})-(\sum _{i=1}^{n}{X_{i}})(\sum _{i=1}^{n}{Y_{i}})}{n[{\sqrt {\sum _{i=1}^{n}{X_{i}}^{2}-2(\sum _{i=1}^{n}{X_{i}})({{\sum _{i=1}^{n}{X_{i}}} \over n})+({{\sum _{i=1}^{n}{X_{i}}} \over n})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}{Y_{i}}^{2}-2(\sum _{i=1}^{n}{Y_{i}})({{\sum _{i=1}^{n}{Y_{i}}} \over n})+({{\sum _{i=1}^{n}{Y_{i}}} \over n})^{2}}}]}}

114.76.235.170 (talk) 09:14, 15 August 2010 (UTC)

Nope, stuck. Can't work out how they get from here:

{n[{\sqrt {\sum _{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}]\times n[{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}]}

to here:

{\sqrt {[n(\sum _{i=1}^{n}{X_{i}}^{2})-(\sum _{i=1}^{n}{X_{i}})^{2}]\times [n(\sum _{i=1}^{n}{Y_{i}}^{2})-(\sum _{i=1}^{n}{Y_{i}})^{2}]}}

Any ideas? - 114.76.235.170 (talk) 12:54, 15 August 2010 (UTC)

Hi, this discussion is better suited for WP:RD/MA. However, I will tell you that you need to keep track of your subscripts: when you use the subscript i in the outer and inner sum, you can easily get the wrong answers. so if you have

\sum _{i=0}^{n}x_{i}\sum _{j=0}^{n}y_{j}=\sum _{i=0}^{n}\sum _{j=0}^{n}y_{j}x_{i}\neq \sum _{i=0}^{n}\sum _{i=0}^{n}y_{i}x_{i}=n\sum _{i=0}^{n}y_{i}x_{i}

This mistake you make and then recover from in the numerator. In the denominator, you use $n(A\times B)=nA\times nB$ , this is how addition works. 0¹⁸ (talk) 16:42, 15 August 2010 (UTC)

Uh, under "Mathematical properties", the two last formulas start from the same, but end up being equal to two different expressions, with (n-1) in one and just n in the other. What is this supposed to mean? 146.107.37.111 (talk) 13:16, 2 December 2013 (UTC)

Questionable statement

Latest comment: 9 years ago1 comment1 person in discussion

The article currently says (under the section Interpretation): For centered data (i.e., data which have been shifted by the sample mean so as to have an average of zero), the correlation coefficient can also be viewed as the cosine of the angle \ \theta between the two vectors of samples drawn from the two random variables (see below).

There is no citation for this statement. In fact, I believe the statement is not correct. Additionally, the example below it is not particularly helpful (mostly because it deals with the special case of r=1, i.e. data that exactly fits a linear function). Either way, the section should be edited, either by finding a reference for the statement or by deleting this statement and the subsequent example. [this edit made by 140.140.160.95 on 27 Feb. 2015]

I'll add a citation. Loraof (talk) 14:57, 17 June 2015 (UTC)

Assessment comment

The comment(s) below were originally left at Talk:Pearson correlation coefficient/Comments, and are posted here for posterity. Following several discussions in past years, these subpages are now deprecated. The comments may be irrelevant or outdated; if so, please feel free to remove this section.

Hi Everyone,

I am not a mathematician so I learn about statistics and math a lot easier if I can visualize what all the mathematical formulas are representing. Would it be possible to post some sort of graph of this Pearson article. It would also be nice to show a small (real world type) example of how to use Pearson correlation, when it should be used and why it should be used. Actually think that it would be helpful to post graphs and examples on all mathematical items if possible.

Baudeagle

Last edited at 08:27, 16 July 2009 (UTC). Substituted at 20:10, 1 May 2016 (UTC)

Suggestion to help Average reader

Latest comment: 7 years ago1 comment1 person in discussion

An average reader is likely to be browsing for a quick refresher, or to learn more about the Pearson correlation. A VERY helpful addition to the introductory section, i.e. as a second paragraph, would be a concise summary of the conditions/assumptions that must be met in order to use the tool. With the ease of conducting this test in software packages such as R, those assumptions are key for readers to understand. If there is any discussion along those lines, it seems buried in this rather long article. It should be clearly stated in the introductory section.

It's good that most symbols later on are defined. — Preceding unsigned comment added by 172.250.254.17 (talk) 22:12, 29 November 2016 (UTC)

Requested move 5 January 2017

Latest comment: 7 years ago10 comments3 people in discussion

The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section.

The result of the move request was: Moved to Pearson correlation coefficient. The consensus seems to be for this title, (including "coefficient") at the end, so I'm closing it to that effect and will move the page there. At the time of writing the article has been moved to Pearson correlation, which is not what was agreed below, and would require a separate discussion is someone feels it's warranted to remove "coefficient" from the title. — Amakuru (talk) 11:18, 12 January 2017 (UTC)

Pearson product-moment correlation coefficient → Pearson correlation coefficient – More common name, as attested by number of hits in Google Books: 67,300 vs. 29,000 ([1] vs. [2]). fgnievinski (talk) 04:05, 5 January 2017 (UTC)

Well, the shorter "Person correlation" has even more hits: 164,000 ([3]), so I went with it. fgnievinski (talk) 02:31, 7 January 2017 (UTC)

Support "Pearson correlation coefficient" -- apparently a non-admin tried to close to a different result than was proposed, and mangled the close, so it's still listed as open. Closer's rationale of "even more hits" on the subset of the name makes no sense, and got no support that I can see. So let's fix it. Dicklyon (talk) 03:47, 12 January 2017 (UTC)

Note Just noticed that closer = nom, and not 7 days yet. Can I just revert that mess and get on with the original RM discussion/proposal, which was much less bad? Dicklyon (talk) 03:53, 12 January 2017 (UTC)

I decided to withdraw the nomination and go ahead with a regular page move, presumably non-controversial. But it seems I messed up with the templates -- sorry about that. Would you please help to clean up the formal move request. Thanks. fgnievinski (talk) 06:38, 12 January 2017 (UTC)

I closed some RMs a while back, but am not up on the mechanics. But I'll go ahead and move the page to where you originally proposed if that's OK. Dicklyon (talk) 06:48, 12 January 2017 (UTC)

Here is some good suggestive evidence for the three-word topic being commonname. Dicklyon (talk) 06:50, 12 January 2017 (UTC)

Original move closure with incorrect templates

The result of the move request was: page moved. (non-admin closure) fgnievinski (talk) 02:56, 7 January 2017 (UTC)

Seemed non-controversial given the weight of evidence. fgnievinski (talk) 02:56, 7 January 2017 (UTC)

The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.

Geometric interpretation

Latest comment: 7 years ago2 comments1 person in discussion

The section Pearson correlation coefficient#Geometric interpretation says

For uncentered data, it is possible to obtain a relation between correlation coefficient and the angle $\varphi$ between both the two regression lines, y=g_x(x) and x=g_y(y), obtained by regressing y on x, and x on y, respectively. One can show ... that r = sec $\ \varphi$ – tan $\ \varphi$ .

This equals ${\frac {1-\sin \varphi }{\cos \varphi }}.$ The numerator is always non-negative, and since the angle is always between –90° and 90° (since both curves have the same sign of their slopes), the denominator is always positive. Thus this passage asserts that r is always non-negative, which of course is not true. Is the left-hand side of the asserted relationship supposed to be |r |, or. maybe r ²? Loraof (talk) 16:34, 24 February 2017 (UTC)

(cont.) I can't get into the cited article beyond its first page via my JSTOR account. But I've derived that this formula holds only if the standard deviations are equal and the correlation coefficient is positive. If the standard deviations are equal and the correlation coefficient is negative, then the trig expression equals ~~–1/r~~ –r. And if the standard deviations are unequal, the trig expression equals a complicated function of the standard deviations and the correlation coefficient. I'm going to put a caveat into the above-quoted passage to say when it's true. Loraof (talk) 20:43, 24 February 2017 (UTC)

Parametric data

Latest comment: 6 years ago1 comment1 person in discussion

   I am not a mathematician, so I do not wish to modify this article myself, but I would like to say that my understanding was that Pearson's test was typically used for parametric data, as opposed to Spearman's rank order coefficient, which is used for non-parametric data. This article could amplify this difference between the usages of these two tests. Vorbee (talk) 15:55, 10 August 2017 (UTC)

Plot of mininum coefficient with sample size

Latest comment: 6 years ago1 comment1 person in discussion

I was looking for a reference to how this figure was generated: https://commons.wikimedia.org/wiki/File:Correlation_significance.svg (it has the Python code in the description but I can't find a source for the maths behind the code) I found a similar expression in this paper: http://www.tqmp.org/Content/vol10-1/p029/p029.pdf - is this (or something like it) where the figure is derived from? — Preceding unsigned comment added by 129.206.102.91 (talk) 13:14, 28 February 2018 (UTC)

Title of this page should include "r", and title of page on R^2 should include "R^2" or "R-squared"

Latest comment: 4 years ago1 comment1 person in discussion

Terms like "Pearson product-moment" and "coefficient of determination" are old-fashioned, somewhat obscure and probably specific (even after translation) to English-language sources. A lot of people who do statistics professionally are confused as to which of them refer to r and which to R^2. On the other hand there is a much clearer and more consistent understanding of what "r" and "R^2" are, and having those in the page title and the search box would let users know immediately whether that is the page they are looking for. Every time I have searched for one page or the other I had to click through 1-2 other pages such as correlation coefficient in order to find what I wanted, even though I know this material and was just looking for a reference to something. 73.149.246.232 (talk) 05:59, 14 April 2020 (UTC)

WP:COMMONNAME would be "correlation coefficient" or simply "correlation"

Latest comment: 4 years ago2 comments1 person in discussion

In the literature and in conversation people overwhelmingly call this quantity "r", "(the) correlation", or "correlation coefficient". Things like "Pearson r" or "product-moment" are out of date. If there is no objection I will eventually post a move request to make the primary title "correlation coefficient 'r' ", with the other names given in the lede and available in the search box as redirects. 73.149.246.232 (talk) 03:43, 22 April 2020 (UTC)

Hmmm. Correlation coefficient already exists though arguably that should be "measures of correlation" or "correlation statistic(s)". Notice that in most of the examples the word "coefficient" is added in the article for consistency with the lede, rather than being part of the term as most often used in the literature. I think the trend over time is to drop "coefficient" from most or all of those measures. 73.149.246.232 (talk) 03:49, 22 April 2020 (UTC)

Interpretation as inner product

Latest comment: 4 years ago3 comments2 people in discussion

The Pearson correlation coefficient has an interpretation as the inner product between distributions (having mean = 0). This does not seem to be mentioned in the article, but lies at the heart of understanding what is really going on. I hope someone knowledgeable on this subject will incorporate this into the article.50.205.142.50 (talk) 03:32, 16 June 2020 (UTC)

The normalized dot product, or cosine interpretation, is in this section: Pearson correlation coefficient#Geometric interpretation. Dicklyon (talk) 03:47, 16 June 2020 (UTC)

Yes, thank you, Dickylon. But as far as I can tell, that section describes only the dot-product interpretation as regards an *empirical distribution* of samples. I would like to see it address the dot-product interpretation as regards the actual underlying distribution of a random variable.50.205.142.50 (talk) 14:43, 16 June 2020 (UTC)

Naming of correlation coefficient after a documented eugenicist (Pearson) even though the term 'Pearson correlation coefficient' is not in common usage

Latest comment: 4 years ago1 comment1 person in discussion

In light of the facts that Karl_Pearson was a eugenicist and anti-Semite (as evidenced by the fact that UCL recently decided to change the name of a building that had been named after him), that the term 'Pearson correlation coefficient' is rarely used, that the coefficient was originally defined by Auguste Bravais, and that there are various alternative names already available (correlation coefficient, bivariate coefficient, r-value), it is proposed to change the title of this article and the name of the coefficient described therein.Gunningo (talk) 06:47, 30 June 2020 (UTC)

Concrete examples

Latest comment: 4 years ago1 comment1 person in discussion

I think our readers would appreciate it if we could include a few concrete examples. Nerd271 (talk) 14:45, 6 July 2020 (UTC)

Confidence Interval

Latest comment: 4 years ago1 comment1 person in discussion

The last paragraph of the first section of the confidence interval article states "A confidence interval does not predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained. (An interval intended to have such a property, called a credible interval, can be estimated using Bayesian methods; but such methods bring with them their own distinct strengths and weaknesses)." which is inconsistent with "The other aim is to construct a confidence interval around r that has a given probability of containing ρ." from this article. The latter is a common mistake that people make when interpreting the confidence interval. I think this should be changed.

Somebody forgot to sign. Nerd271 (talk) 14:46, 6 July 2020 (UTC)

Source for later reading

Latest comment: 4 years ago1 comment1 person in discussion

http://www.pindling.org/Math/Statistics/Textbook/Chapter3_Regression_Correlation/Chapter3_Regres_Corr_Overview.htm

Somebody forgot to sign. Nerd271 (talk) 14:46, 6 July 2020 (UTC)

Display font

Latest comment: 4 years ago1 comment1 person in discussion

[new unsigned edit moved here from the article]

The format of the formulas is way too small on my display.

Somebody forgot to sign. Nerd271 (talk) 14:46, 6 July 2020 (UTC)