# Talk:Mann–Whitney U

WikiProject Statistics (Rated Start-class, High-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start  This article has been rated as Start-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.

## Ties

"All the formulae here are made more complicated in the presence of tied ranks, but if the number of these is small (and especially if there are no large tie bands) these can be ignored when doing calculations by hand. The computer statistical packages will use them as a matter of routine."

First, what does one do with ties? (A link is sufficient if it is described elsewhere.)

Second, what do computer statistical packages use routinely? The ignoring procedure, or the proper (undescribed) way to handle ties?

dfrankow (talk) 19:45, 29 December 2008 (UTC)

Yes, that was not well expressed. I have had a go at rephrasing it - does it make better sense now? The actual formula in the case of ties would be a bit of a pig to enter in Wiki code and I will leave that job for someone more fluent in the coding than I am, though it's certainly true that for completeness we ought to have it here. seglea (talk) 01:40, 30 December 2008 (UTC)
↑Jump back a section

## One-tailed versus two-tailed distributions

"Note that since U1 + U2 = n1 n2, the mean n1 n2/2 used in the normal approximation is the mean of the two values of U. Therefore, you can use U and get the same result, the only difference being between a left-tailed test and a right-tailed test.

Huh? Perhaps it would be clearer to say which value goes with a left-tailed and which with a right-tailed test. dfrankow (talk) 19:45, 29 December 2008 (UTC)

↑Jump back a section

## Assumptions

In the general formulation, the null hypothesis of Mann-Whitney U-test is not about the equality of distributions. Is is about the symmetry between two populations with respect to the probability of obtaining a larger observation. Of course, two identical distributions possess the property of symmetry but two different distributions (for example, 2 normals with the same mean but different variances) can also be perfectly symmetric with respect to the probability of obtaining a larger observation.

The whole issue of correct formulating a null hypothesis is very important for consideration of the power of the test. Consider again 2 normal distribution with the same mean and different variances. If the null hypothesis is defined as the equality of 2 distributions we are likely to fail to reject the null hypothesis even that we know a priori that it is not true. Only for 2 distributions with similar variance but different means (more specifically, 2 distributions with a location shift) we will have a fair chance (i.e. good power) of rejecting the null hypothesis. So such a formulation of null hypothesis severely restricts the applicability of the test.

However, if we define the null hypothesis as a hypothesis of symmetry with respect to obtaining a larger observation then everything works perfectly and the power of the test does not depend on diverging variances. Indeed, the inability to reject the null hypothesis for 2 normals with the same means but different variances is not a failure of the test because we know a priori that in this case the null hypothesis is satisfied.

—Preceding unsigned comment added by Marenty (talkcontribs) 02:19, 29 July 2010 (UTC)

The article gives a misleading information on the assumptions of Mann-Whitney U test. It says:

"In a less general formulation, the Wilcoxon-Mann-Whitney two-sample test may be thought of as testing the null hypothesis that the probability of an observation from one population exceeding an observation from the second population is 0.5. This formulation requires the additional assumption that the distributions of the two populations are identical except for possibly a shift (i.e. f1(x) = f2(x + δ) )"

Testing the alternative hypothesis P(A>B) > 0.5 (where A is from population 1 and B is from pupulation 2) does not require the restricting assumption that both distributions are equal except for a shift in location! How come? What is the basis for this statement? The test statistic in U-test is just the proportion of pairs such that the first observation is from population 1 and the second from population 2. The distribution of this test statistic can be perhaps most easily be theoretically calculated for the special case shifted distributions but it does not restrict the use of the test and has nothing to do with test assumptions! —Preceding unsigned comment added by Marenty (talkcontribs) 22:06, 24 May 2008 (UTC)

Assumptions seem necessary, in the unequal variance case under the null hypothesis p-values are not uniformly distributed (I used two normals, same mean different variance). 66.218.169.47 (talk) 23:55, 13 February 2009 (UTC)

can anyone typeset the formulae better? I am not familiar with Tex. seglea 05:39, 17 Jan 2004 (UTC)

↑Jump back a section

## Assumptions

The hypothesis stated in this article refers both to the testing of equality of central tendency, and equality of distribution. The central tendancy hypothesis requires the additional assumption that the distribution of the two samples are the same except for a shift (i.e. f1(X) = f2(X+delta)). The test can also be described as a general test of equality of distribution (H0: f1=f2). In this case the shift alternative is not required, however, the test is used most often as a test of central tendency, so the original formulation (with the addition of the shift assumption) is most appropriate. I have added this assumption to the main page. —Preceding unsigned comment added by 132.239.102.171 (talk) 00:20, 30 January 2008 (UTC)

"The hypothesis stated in this article refers both to the testing of equality of central tendency, and equality of distribution"

Really, this test is precisely only for testing stochastic dominance of two variables A and B, that is, of Prob(A>B) > Prob(B>A). In other words, it tests whether a randomly chosen sample from A is expected to be greater than a sample from B. Look at the test statistic: it is a function of the proportion of pairs A>B where A is from the 1st distribution and B is from the 2nd distribution. For testing the stochastic dominance, ao additional assumptions are needed (beside the assumption that the underlying distribution is ordinal.)

It is incorrect to use MU U-test for general testing of "equality of two distributions" as asserted. Two normal distributions A and B with the same mean and different variances are different distributions, the test will(incorrectly) never reject the null hypothesis if we are testing for equality of the distributions. But if we are testing testing for stochastic dominance instead, then the test (correctly) does not reject the null hypothesis.

On the other hand, "central tendency" is a nebulous concept, but in reality testing "equality of central tendency" with U-test will be nothing more than testing of stochastic dominance. If we want to use the test for detection of a "shift", then we do need to add an assumption about distribution74.0.49.2 (talk) 01:51, 8 June 2009 (UTC)s A and B having the same shapes. But this additional (and unnecessary) assumption follows from the very definition of the "shift" rather than from intrinsic requirements of U test.

In summary, the test should be used in general for testing that Prob(A>B)>0.5, and as such has only one assumption that the samples are comparable (i.e. ordinal).

74.0.49.2 (talk) 01:51, 8 June 2009 (UTC)

↑Jump back a section

## P value

I beleive that one cannot interpret results from this test with out understanding the P-value. As I understand it the smaller the P value the more different the two populations are. What I would like to know if there is a critical value like there is with a T-test? Thanks ADS

↑Jump back a section

## Inexact explanation of what this test should be used for

A significant MW tests does not necessarily imply that the distributions have different medians. This is a common misconception. It is most powerful for detecting a difference in medians, which is why this is commonly misstated. The MW tests that the samples were taken from different distributions.

↑Jump back a section

## calculation of mu in the normal approx

Is this right? I would have thought it should be symmetrical in n1 and n2...

161.130.68.84 21:11, 28 December 2006 (UTC)JWD

↑Jump back a section

## Introduction requires clarity

These two statements are not equivelent

• It requires the two samples to be independent, and the observations to be ordinal or continuous measurements
• i.e. one can at least say, of any two observations, which is the greater.
↑Jump back a section

## Link to rank article needs to be more specific

The 'rank' link currently points to a disambiguation page which doesn't include an article explaining what the rank of a sample is.Tim (talk) 02:31, 3 June 2008 (UTC)

↑Jump back a section

## Distribution of U-statistic and Table of Values

The table of values link (pdf) is broken, I am changing the address to a different document. I have been unable to find on the web some kind of explanation of the distribution of the U-statistic. This article could use at least some explanation of how the statistic is distributed, and optimally a formula or plot, if possible. I'll keep working, but if someone's got it on hand, that would be great. Lovewarcoffee (talk) 19:57, 4 August 2008 (UTC)

↑Jump back a section

## Just what is it that we are talking about?

The article starts:

In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test) is. . . .

Thereafter it talks of "MWW". "MWW" strikes me as an odd abbreviation for "Mann-Whitney U test." If this article is correctly titled, I suggest that the test should be abbreviated as "MW."

David J. Sheskin devotes pp 513–75 of Handbook of Parametric and Nonparametric Statistical Procedures, 4th ed. (Boca Raton: Chapman & Hall, 2007) to this one test, which he calls the "Mann–Whitney U" test. (If you're a purist, note the dash: it's not one statistician with a double-barreled name, but two separate people, Mann and Whitney.) He writes at the start:

Two versions of the test to be described under the label of the Mann–Whitney U test were independently developed by Mann and Whitney (1947) and Wilcoxon (1949). The version to be described here is commonly identified as the Mann–Whitney U test while the version developed by Wilcoxon (1949) is usually referred to as the Wilcoxon–Mann–Whitney test. Although they employ different equations and different tables, the two versions of the test yield comparable results. (513)

(Unfortunately even Sheskin's 1700+ pages don't include any further coverage of [what he calls] the Wilcoxon–Mann–Whitney test.)

And Sheskin adds in an endnote:

The test to be described in this chapter is also referred to as the Wilcoxon rank-sum test and the Mann–Whitney–Wilcoxon test. . . . (569)

This of course doesn't agree with what's written in this Wikipedia article. To follow Sheskin, it would instead say something like:

In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW) or Wilcoxon rank-sum test) is. . . . (Although very similar, the "Wilcoxon-Mann-Whitney" test is different.) . . .

What's the authority for what the article now says? Tama1988 (talk) 09:49, 17 November 2008 (UTC)

The Mann-Whitney U test and Wilcoxon two sample test were developed independently, but provide an identical test statistic. Both Snedecor and Cochran and Sokal and Rolf retain the distinction and neither concatenate the names. Regards—G716 <T·C> 20:58, 18 November 2008 (UTC)
In statistics, the Mann-Whitney U test — also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon two sample test, or Wilcoxon rank-sum test — is. . . . (Although very similar, the "Wilcoxon-Mann-Whitney" test is different.) . . .
? Or are you saying that when Sheskin talks of "comparable results" he means "the same results" and that Wilcoxon's 1949 test (the "Wilcoxon-Mann-Whitney test") is the same as Mann-Whitney? Tama1988 (talk) 09:14, 19 November 2008 (UTC)
↑Jump back a section

I have removed the following from the section on Herrnstein's rho. As it stands, it does not make sense. It may well be true, but if so it needs a lot more explanation.

"ρ is also known as the area under the receiver operating characteristic (ROC) curve."

seglea (talk) 23:14, 12 January 2009 (UTC)

This is well known so I'm unclear on why this was removed. See

``` author = {Hanley, J. A. and McNeil, B. J.},
year = 1982,
title = {The meaning and use of the area under a receiver operating
characteristic ({ROC}) curve},
volume = 143,
pages = {29-36},
annote = {diagnosis;testing;ROC;c index}
```

Harrelfe (talk) 14:32, 14 February 2009 (UTC)

↑Jump back a section

## Assumptions and Formalization of Hypotheses

I will shortly change the "Formal statement of object of test" and "Assumptions" sections and put them into one section. There were several errors.

1. Previously the article stated that the MWW test does not test for differences in medians. But if you make the location shift assumptions, then it does in fact strictly test for the differences in medians.

2. Previously the article stated that one proper formulation for the MWW test is to have the null hypothesis be that P(X>Y)=0.5. In fact, this is not true. If you have two normal distributions with the same mean but different variances then the MWW test is no longer valid under that null even though P(X>Y)=.5 (see Pratt, 1964, Journal of the American Statistical Association, 665-680).

3. Previously the article stated "Without making such a strong assumption [about the location shift] (and verifying its validity) it is incorrect to use the MWW test as a test for shift in location." In fact, although we can invalidate the location shift assumption, we cannot verify that assumption with a finite amount of data. (Testing and finding no significant violation of an assumption is not the same a verifying that assumption. You could have not been able to find significance because of a small sample size). In statistics we make assumptions all the time, so saying it is incorrect to use an assumption without verifying it seems contrary to the practice of statistics. Although it may be a good idea to check the assumption if you can.

4. The following statement was made: "the Mann-Whitney U test is valid for testing of stochastic dominance under very broad conditions, without making any additional assumptions, including any additional assumptions about variances of the two samples". This is not correct, see Pratt, 1964 referenced above. The paragraph following was mostly redundant, so I deleted it. Mpf3205 (talk) 05:18, 7 March 2010 (UTC)

↑Jump back a section

## Mann-Whitney vs. Wilcoxon

I spent a great deal of time puzzling over the table of critical values in the second external link (www.stat.auckland.ac.nz), wondering why it didn't match up with the first link, and more importantly, why some of the values appeared to be theoretically impossible (e.g. greater than 100 for a 10*10 test). After careful reading, I realized that the test statistic in the link was calculated differently than the one in the article. (i.e. a straight R1 sum of ranks versus the R1 - n1(n1+1)/2).

I have no external experience with this, but the best I can tell from the external link, while the Mann-Whitney and Wilcoxon tests are equivalent, they are not identical, in that the numeric form of the statistic differs. Whereas the Mann-Whitney includes the n1(n1+1)/2 adjustment, the Wilcoxon is a straight sum of ranks. While not changing the application or conclusions of the test, this is crucial to know when looking at critical value tables, as what works for one won't work for the other.

I altered the text for the external link so hopefully others will not be as confused, but could someone who has a better understanding of the history and situation add a clarification about the different functional forms to the article? (If you would add info about why they're equivalent, and why one form might be preferred over the other, so much the better.)

P.S. While you're at it, a discussion on how to treat identical valued items in calculating the test statistic would be also be appreciated. The Auckland link discusses it for the straight Wilcoxon sum of ranks, but I'm still not sure how they are accounted for in the Mann-Whitney statistic. -- 140.142.20.229 (talk) 22:30, 8 March 2010 (UTC)

Just thought I'd point to a reference which touches on the difference: Journal of the American Statistical Association, Vol. 59, No. 307 (Sep., 1964), pp. 925-934 [1] -- 140.142.20.229 (talk) 22:39, 8 March 2010 (UTC)
↑Jump back a section

## What do you need to assume under the null hypothesis?

Under the section that describes the assumptions, I had previously added that you need to have both distributions be equal under the null. That was deleted. But I assert that you need that assumption. If you state the null hypothesis as only needing that Pr[X>Y]+ .5 Pr[X=Y] = .5, that does not give sufficient conditions for validity. Here is a counter example (see Pratt, 1964, JASA, cited in my previous notes above): if you have two normal distributions with the same mean but different variances then Pr[X>Y]+ .5 Pr[X=Y] = .5, but your type I error can be inflated (i.e., the test can reject the null hypothesis more often than the nominal significance level).

The paragraph that I deleted also had that same mistaken idea. —Preceding unsigned comment added by Mpf3205 (talkcontribs) 05:31, 27 August 2010 (UTC)

↑Jump back a section

## Un-clarity in terms : ranks or observations ? lower rank or smaller ?

"Taking each observation in sample 1, count the number of observations in sample 2 that are smaller than it (count a half for any that are equal to it)."

If I understand correctly the test does not require to compare observed values but just the ranks, so this sentence should be changed using ranks.

"Choose the sample for which the ranks seem to be smaller" "count the number of hares it is beaten by (lower rank)"

is 1st the lowest rank ? For me it is the opposite. The word smaller seems less ambiguous to me.

"Arrange all the observations into a single ranked series. That is, rank all the observations without regard to which sample they are in."

This has to be done for small samples or big samples, so this sentence should precede.

I will do the modifications I propose, it is probably better so you see what I mean.

Arnaud —Preceding unsigned comment added by 132.183.93.37 (talk) 16:10, 16 November 2010 (UTC)

↑Jump back a section

## Where is the non-technical summary?

I'm sorry, but WikiPedia is used by non-experts to gain an understanding of something they may have run across in a technical or semi-technical setting. As such, one of the joys of reading many WP articles is that someone has taken the time to explain, in layman's terms, just what exactly is covered in the topic. That is NOT the case here. I think it's great that so many of you can chime in here as "experts", able to contribute because you have the required background.

But when the first sentence of the article uses a phrase "have equally large values" -- it just doesn't make sense. On the face of it, why should it be difficult to determine whether two "samples" have "equally large values"? Doesn't that simply mean looking at the largest value in each sample and seeing if they are the same? Clearly not, which is precisely why a layman's version, at least in the first paragraph, should be offered.

I hope someone is willing to stoop to the level of the non-cognoscente and explain what the heck this is all about. I find that in statistics, almost more than in any other discipline, practitioners are unwilling to translate their statements into simple real-world examples and plain speaking. I often wonder if that's because they're afraid, in some way, that someone will claim the emperor has no clothes? -roricka 1/1/11 — Preceding unsigned comment added by Roricka (talkcontribs) 22:29, 1 January 2011 (UTC)

Hi Roricka. Reading your comment, I truly would like to help out but am not sure how. You wrote how the first sentence doesn't make sense. Well, indeed it doesn't make sense to anyone not familiar with the most basic notions of probability and statistics. However, you can not put them into the first sentence since it would require explaining what a random variable is and how that relates to statistical tests and hypothesis testing (notice how proper wikilinks are present in the first sentence).
Also, IMHO, I don't think all wikipedia articles can be formulated as independent modules of knowledge, easily understood without context. This particular article is a good example of that.
If, after reading more, you'd gain an insight as to how to make this article clearer - I'll be most interested to see how.
Yours,
Talgalili (talk) 10:32, 2 January 2011 (UTC)

I read the Spearman ciefficient page and understood it immediately. This page I just found incomprehensible in comparison. I have to echo what the OP said. I suggest using the Spearman page as an example of "how to do it right" maybe? Especially the images which were great! — Preceding unsigned comment added by 84.92.230.173 (talk) 14:27, 6 June 2011 (UTC)

↑Jump back a section

## Separate pages for Mann-Whitney U test and Wilcoxon rank-sum test?

Despite Mann-Whitney U test and Wilcoxon rank-sum test are equivalent, they are two different tests, as pointed out by 140.142.20.229. In the current version, Mann–Whitney U is described, while Wilcoxon rank-sum test is not. Since the are two different tests, shouldn't we create a new page for Wilcoxon rank-sum test, containing the description of this method, and then say that Mann-Whitney U test and Wilcoxon rank-sum test are equivalent?--Gorif (talk) 23:42, 12 February 2012 (UTC)

↑Jump back a section