User:Reuqr/A necessary and sufficient condition for Simpson's paradox to occur

This is not a Wikipedia article: It is an individual user's work-in-progress page, and may be incomplete and/or unreliable. For guidance on developing this draft, see Wikipedia:So you made a userspace draft.

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs
This page was last edited by Reuqr (talk | contribs) 10 years ago. (Update timer)

Finished writing a draft article? Are you ready to request an experienced editor review it for possible inclusion in Wikipedia? Submit your draft for review!

We are considering the simplest case in which Simpson's paradox can occur, namely a 2 × 2 × 2 count table. A specific example might be a test of whether a new drug for a medical condition is more effective than a placebo. One starts with a set of individuals with the medical condition, and divides it into two groups. To all individuals in the first group one gives the drug (say there are $N_{D}$ individuals in this group), and to each individual in the other, the placebo (say there are $N_{P}$ individuals in that group). Then in each group one counts in how many individuals the medical condition improved (say $n_{D}$ and $n_{P}$ , respectively). If the rate of improvement for the drug group, $n_{D}/N_{D}$ , is greater than the rate of improvement in the placebo group, $n_{P}/N_{P}$ , then the experiment suggests that the drug is effective; otherwise we conclude that the drug is no more effective than the placebo. (In a real study, the case of the favorable outcome, $n_{D}/N_{D}>n_{P}/N_{P}$ , is not in and of itself enough to conclude that the drug is effective; one would at least also need to make sure that the probability that this happened by accident (i.e. the p-value) is small enough.)

But it could be that there is a confounding variable, which is something that can in principle influence the probability of improvement as much as what treatment one gets. For example, it may matter whether the medical condition is, say, early-stage or late-stage. So now one decides to stratify the sample, i.e. to consider separately the individuals with the early-stage and the late-stage condition. Suppose that in the drug group, there were $E_{D}$ individuals with the early-stage condition and $L_{D}$ individuals with the late-stage condition ( $E_{D}+L_{D}=N_{D}$ ); the condition improved in $e_{D}$ and $\ell _{D}$ individuals, respectively ( $e_{D}+\ell _{D}=n_{D}$ ). The corresponding numbers for the placebo group are $E_{P}$ , $L_{P}$ , $e_{P}$ , and $\ell _{P}$ . (Note that the data now consist of a total of 8 numbers, which can be thought of as organized into a 2 × 2 × 2 table.)

	placebo	drug
early stage	Group 1 ${\frac {e_{P}}{E_{P}}}$	Group 2 ${\frac {e_{D}}{E_{D}}}$
late stage	Group 3 ${\frac {\ell _{P}}{L_{P}}},$	Group 4 ${\frac {\ell _{D}}{L_{D}}}$
both stages	${\frac {e_{P}+\ell _{P}}{E_{P}+L_{P}}}$	${\frac {e_{D}+\ell _{D}}{E_{D}+L_{D}}}$

And now it can happen that whatever conclusion one reached about the effectiveness of the drug before the groups were stratified, the conclusion when the strata are considered separately could be the opposite: if that happens, one has Simpson's paradox. For example, it could be that if each stratum is considered separately, it appears that the drug is more effective than the placebo,

{\frac {e_{P}}{E_{P}}}<{\frac {e_{D}}{E_{D}}}

and

{\frac {\ell _{P}}{L_{P}}}<{\frac {\ell _{D}}{L_{D}}},

(1)

even though the non-stratified data suggested that the drug is no better than the placebo:

{\frac {e_{P}+\ell _{P}}{E_{P}+L_{P}}}\geq {\frac {e_{D}+\ell _{D}}{E_{D}+L_{D}}}.

(2)

We are now going to derive a necessary and sufficient condition for Simpson's paradox to occur. The trick is to rewrite each side of Eq. (2) as a weighted average of (equivalently, linear interpolation betwen) the quantities appearing in Eq. (1). In both D and L cases, this is done as follows:

{\frac {e+\ell }{E+L}}={\frac {e}{E+L}}+{\frac {\ell }{E+L}};

now in the first term we multiply and divide by E, and in the second, by L:

$=\underbrace {\frac {E}{E+L}} _{\mu }\underbrace {\frac {e}{E}} _{f}+\underbrace {\frac {L}{E+L}} _{\nu }\underbrace {\frac {\ell }{L}} _{k}=\mu \,f+\nu k;$

(3)

note that $\mu +\nu =1$ . (So on may say that $\mu$ and $\nu$ are normalized weighted means). Thus, we rewrite Eq. (2) as

Editing

$\mu _{P}\,f_{P}+\nu _{P}\,k_{p}\geq \mu _{D}\,f_{D}+\nu _{D}\,k_{D},$

(4)

while Eq. (1) becomes

$f_{P}<f_{D}$ and $k_{P}<k_{D}.$

(5)

How can Eq. (4) hold given that Eq. (5) does?

To see how this can be possible, one must first realize the following: if $0\leq \mu ,\,\nu \leq 1$ and $\mu +\nu =1$ , then the weighted average $\mu \,f+\nu k$ may be any number between $f$ and $k$ .

Proof of the preceding statement (Proof 1)
First note that if $0\leq \alpha \leq 1$ , then also $0\leq 1-\alpha \leq 1$ (multiply $0\leq \alpha \leq 1$ by $-1$ to get $0\geq -\alpha \geq -1,$ and now add $1$ to everything). Now let $p$ be the lesser of $f,\,k$ , and $q$ the larger, so that $p\leq q$ . If $p$ is $f$ then let $\alpha =\mu$ , whereas if $p$ is $k$ let $\alpha =\nu$ . Then, 1. since both $1-\alpha$ and $q-p$ are positive or zero, then so is their product: $0\leq (1-\alpha )\,(q-p)$ . Add $p$ to both sides to get $p\leq p+(1-\alpha )\,(q-p)=p-(1-\alpha )\,p+(1-\alpha )\,q=\alpha \,p+(1-\alpha )\,q.$ 2. Similarly, since $\alpha$ is positive or zero, we have $0\leq \alpha \,(q-p)$ . Mulitiply by $-1$ to get $0\geq -\alpha \,(q-p)$ ; add $q$ to both sides to get $q\geq q-\alpha \,(q-p)=\alpha \,p+(1-\alpha )\,q.$ Combining 1. and 2. we get $p\leq \alpha \,p+(1-\alpha )\,q\leq q.$ Now replace $p$ , $q$ , and $\alpha$ by the original symbols, obtaining $min\{f,\,k\}\leq \mu f+\nu k\leq max\{f,\,k\}$ ; we are done. ⃞

From this it should be obvious, to begin with, under what circumstances Eq. (4) cannot hold (no matter what weights— $\mu$ 's and $\nu$ 's—one chooses): it cannot hold if both of $\{f_{D},\,k_{D}\}$ are greater than both of $\{f_{P},\,k_{P}\}.$

Proof of the preceding statement (Proof 2)
Assume that both of $\{f_{D},\,k_{D}\}$ are greater than both of $\{f_{P},\,k_{P}\}.$ This means that $min\{f_{D},\,k_{D}\}>max\{f_{P},\,k_{P}\}.$ By what was proved in Proof 1 above, the "P" side of Eq. (4) $\leq max\{f_{P},\,k_{P}\},$ while the "D" side $\geq min\{f_{D},\,k_{D}\}.$ Combining these inequalities, we have "P" side of the equation $\leq max\{f_{P},\,k_{P}\}<min\{f_{D},\,k_{D}\}\leq$ "D" side of the equation, so that "P" side of the equation < "D" side of the equation, contradicting Eq. (4). ⃞

Editing

So, for Simpson's paradox to be possible, there must be some overlap between the "P" interval (i.e. the interval between $f_{P}$ and $k_{P}$ ) and the "D" interval (i.e. the interval between $f_{D}$ and $k_{D}$ ); see Fig. 1.

Figure 1. An example of how the "P" and "D" intervals could overlap and still be consistent with Eq. (5). Note that one could also have the

k

's be the lower ends of the intervals, and the

f

's on the higher.

If there is such an overlap, then we will have Simpson's paradox whenever most of the weight on the "P" side is given to the higher end of the interval, and most of the weight on the "D" side, to the lower end. Below we derive the precise condition for this to happen.

Note that Eq. (5) does allow this to happen: it just says what happens when " $f$ 's are compared with $f$ 's, and $k$ 's with $k$ 's" (namely, that the ␣ $_{P}$ s will be less than the ␣ $_{D}$ s). But it says nothing about what happens when " $f$ 's are compared with $k$ 's": it can happen that one of the ␣ $_{P}$ 's be greater than one of the ␣ $_{D}$ 's. (See Fig. 1. A numerical example: $\underbrace {f_{P}} _{1}<\underbrace {f_{D}} _{2}$ and $\underbrace {k_{P}} _{3}<\underbrace {k_{D}} _{4},$ , but $\underbrace {k_{P}} _{3}>\underbrace {f_{D}} _{2}.$ )

To proceed, it will be convenient to change notation slightly so as to make manifest the ordering among the $f$ 's and $k$ 's. To start with, we will make it so that the smallest quantity is always denoted by $p_{P}$ . This is done as follows: given Eq. (5), the smallest is either $f_{P}$ or $k_{P}$ . If it is $f_{P}$ , then we replace all the $f$ 's by $p$ 's and all the $\mu$ 's by $\alpha$ 's, and also all the $k$ 's by $q$ 's, and all the $\nu$ 's by $\beta$ 's. In short, $\left(f'{\mbox{s}},\,\mu '{\mbox{s}}\right)\to \left(p'{\mbox{s}},\,\alpha '{\mbox{s}}\right)$ and $\left(k'{\mbox{s}},\,\nu '{\mbox{s}}\right)\to \left(q'{\mbox{s}},\,\beta '{\mbox{s}}\right).$ If $k_{P}$ is the smallest, then we do it the other way around, i.e. in the replacements scheme we switch places of $\left(f'{\mbox{s}},\,\mu '{\mbox{s}}\right)$ and $\left(k'{\mbox{s}},\,\nu '{\mbox{s}}\right).$

Equation (5) becomes

$p_{P}<p_{D}$ and $q_{P}<q_{D}$ (with $p_{P}<q_{P}$ ),

(6)

while Eq. (4) becomes

$\alpha _{P}\,p_{P}+\beta _{P}\,q_{P}\geq \alpha _{D}\,p_{D}+\beta _{D}\,q_{D}.$

(7)

To visualize this, think of $\alpha \,p+\beta q$ as a line segment between $p$ and $q,$ on a number line where the values increase to the right. First, have the "P" interval, between $p_{P}$ and $q_{P}$ ${\Big (}$ namely, the interval $\left[p_{P},\,q_{P}\right]{\Big )}.$

There is also the "D" interval, between $p_{D}$ and $q_{D}$ . Equation (6) does not settle which of these is to the right and which to the left. All it says is that $q_{D}$ lies entirely to the right of the "P" interval. But $p_{D}$ may, as far as Eq. (6) is concerned, lie anywhere to the right of the start of the "P" interval (i.e. to the right of $p_{P}$ ), including to the right of $q_{D}$ . However, in order for Simpson's paradox to be possible, the "P" and "D" intervals must overlap, i.e. $p_{D}$ should lie inside the "P" interval (i.e. the ordering must be $p_{P}<p_{D}<q_{P}<q_{D}$ ).

Proof that Simpson's paradox cannot occur if $p_{D}>q_{P}.$ (Proof 3)
This really follows directly from Proof 2 above, but here is complete proof anyway. Recall that $min(p,\,q)\leq \alpha \,p+\beta \,q\leq max(p,\,q).$ Thus $\alpha _{P}\,p_{P}+\beta _{P}\,q_{P}\leq q_{P},$ and $\alpha _{D}\,p_{D}+\beta _{D}\,q_{D}\geq min\{p_{D},\,q_{D}\},$ . But assuming $p_{D}>q_{P},$ we have that both $p_{D}$ and $q_{D}$ are greater than $q_{P},$ and so $\alpha _{P}\,p_{P}+\beta _{P}\,q_{P}\leq q_{P}<min\{p_{D},\,q_{D}\}\leq \alpha _{D}\,p_{D}+\beta _{D}\,q_{D}.$ In other words, $\alpha _{P}\,p_{P}+\beta _{P}\,q_{P}<q_{P}<\alpha _{D}\,p_{D}+\beta _{D}\,q_{D},$ contradicting Eq. (7). ⃞

Editing

similarly, we have the "D" interval. Equation (5) says that the beginning of the "P" interval (i.e. its left end) must be to the left of the beginning of the "D" interval, and that the end of the "P" interval (i.e. its right end) must be to the left of the end of the "D" interval. But there is still the possibility that the intervals overlap—that "D" interval begins before the "P" interval ends.

To proceed, we first need to figure out the ordering among the $f$ 's and $k$ 's. Given Eq. (5), the smallest is either $f_{P}$ or $k_{P}$ .

Let us relabel, in Eqs. (4) and (5), the $f$ 's and $k$ 's by $p$ 's and $q$ 's (and the $\mu$ 's and $\nu$ 's by $\alpha$ 's and $\beta$ 's), but not necessarily respectively; we do it so that $p_{P}$ is the smallest. In other words, if $f_{P}$ is the smallest, then we replace all the $f$ 's by $p$ 's and all the $k$ 's by $q$ 's (also, all the $\mu$ 's by $\alpha$ 's and all the $\nu$ 's by $\beta$ 's); if $k_{P}$ is the smallest, then we do it the other way around. Equation (5) becomes

$p_{P}<p_{D}$ and $q_{P}<q_{D}$ (with $p_{P}<q_{P}.$ ),

(6)

and we write Eq. (4) as

$\alpha _{P}\,p_{P}+\beta _{P}\,q_{P}\geq \alpha _{D}\,p_{D}+\beta _{D}\,q_{D}.$

(7)

Why we chose to use write $\beta$ in terms of $\alpha$ on the left (i.e. $\beta _{D}=1-\alpha _{D}$ ), but the other way around on the right, will become apparent in a moment. Note that Eq. (6) says that we have $p_{P}<q_{P}<q_{D};$ one key question is where in this ordering fits $p_{D}$ . We first show that one cannot have Simpson's paradox unless $p_{D}\leq q_{P}$ . First note that a linear interpolation $\alpha \,p+(1-\alpha )\,q$ , with $0\leq \alpha \leq 1$ , must lie at or between $p$ and $q$ , i.e.

$min(p,\,q)\leq \alpha \,p+(1-\alpha )\,q\leq max(p,\,q).$

(8)

Now if $p_{D}>q_{P}$ then, since $p_{P}<q_{P}<q_{D}$ , we have that

$max(p_{P},\,q_{P})<min(p_{D},\,q_{D});$

(9)

but, applying Eq. (8) to $\alpha _{P}\,p_{P}+(1-\alpha _{P})\,q_{P}$ and $(1-\beta _{D})\,p_{D}+\beta _{D}\,q_{D}$ , we get $\alpha _{P}\,p_{P}+(1-\alpha _{P})\,q_{P}<(1-\beta _{D})\,p_{D}+\beta _{D}\,q_{D}$ , i.e. the opposite of Eq. (7).

So let us therefore consider the case where $p_{D}\leq q_{P}$ . In that case we have $p_{P}<p_{D}\leq q_{P}<q_{D}$ .