Statement
edit
If X is a nonnegative random variable and a > 0 , then the probability
that X is at least a is at most the expectation of X divided by a :[1]
P
(
X
≥
a
)
≤
E
(
X
)
a
.
{\displaystyle \operatorname {P} (X\geq a)\leq {\frac {\operatorname {E} (X)}{a}}.}
Let
a
=
a
~
⋅
E
(
X
)
{\displaystyle a={\tilde {a}}\cdot \operatorname {E} (X)}
(where
a
~
>
0
{\displaystyle {\tilde {a}}>0}
); then we can rewrite the previous inequality as
P
(
X
≥
a
~
⋅
E
(
X
)
)
≤
1
a
~
.
{\displaystyle \operatorname {P} (X\geq {\tilde {a}}\cdot \operatorname {E} (X))\leq {\frac {1}{\tilde {a}}}.}
In the language of measure theory , Markov's inequality states that if (X , Σ, μ ) is a measure space ,
f
{\displaystyle f}
is a measurable extended real -valued function, and ε > 0 , then
μ
(
{
x
∈
X
:
|
f
(
x
)
|
≥
ε
}
)
≤
1
ε
∫
X
|
f
|
d
μ
.
{\displaystyle \mu (\{x\in X:|f(x)|\geq \varepsilon \})\leq {\frac {1}{\varepsilon }}\int _{X}|f|\,d\mu .}
This measure-theoretic definition is sometimes referred to as Chebyshev's inequality .[2]
Extended version for nondecreasing functions
edit
If φ is a nondecreasing nonnegative function, X is a (not necessarily nonnegative) random variable, and φ (a ) > 0 , then[3]
P
(
X
≥
a
)
≤
E
(
φ
(
X
)
)
φ
(
a
)
.
{\displaystyle \operatorname {P} (X\geq a)\leq {\frac {\operatorname {E} (\varphi (X))}{\varphi (a)}}.}
An immediate corollary, using higher moments of X supported on values larger than 0, is
P
(
|
X
|
≥
a
)
≤
E
(
|
X
|
n
)
a
n
.
{\displaystyle \operatorname {P} (|X|\geq a)\leq {\frac {\operatorname {E} (|X|^{n})}{a^{n}}}.}
We separate the case in which the measure space is a probability space from the more general case because the probability case is more accessible for the general reader.
Intuition
edit
E
(
X
)
=
P
(
X
<
a
)
⋅
E
(
X
|
X
<
a
)
+
P
(
X
≥
a
)
⋅
E
(
X
|
X
≥
a
)
{\displaystyle \operatorname {E} (X)=\operatorname {P} (X<a)\cdot \operatorname {E} (X|X<a)+\operatorname {P} (X\geq a)\cdot \operatorname {E} (X|X\geq a)}
where
E
(
X
|
X
<
a
)
{\displaystyle \operatorname {E} (X|X<a)}
is larger than or equal to 0 as the random variable
X
{\displaystyle X}
is non-negative and
E
(
X
|
X
≥
a
)
{\displaystyle \operatorname {E} (X|X\geq a)}
is larger than or equal to
a
{\displaystyle a}
because the conditional expectation only takes into account of values larger than or equal to
a
{\displaystyle a}
which r.v.
X
{\displaystyle X}
can take.
Hence intuitively
E
(
X
)
≥
P
(
X
≥
a
)
⋅
E
(
X
|
X
≥
a
)
≥
a
⋅
P
(
X
≥
a
)
{\displaystyle \operatorname {E} (X)\geq \operatorname {P} (X\geq a)\cdot \operatorname {E} (X|X\geq a)\geq a\cdot \operatorname {P} (X\geq a)}
, which directly leads to
P
(
X
≥
a
)
≤
E
(
X
)
a
{\displaystyle \operatorname {P} (X\geq a)\leq {\frac {\operatorname {E} (X)}{a}}}
.
Probability-theoretic proof
edit
Method 1:
From the definition of expectation:
E
(
X
)
=
∫
−
∞
∞
x
f
(
x
)
d
x
{\displaystyle \operatorname {E} (X)=\int _{-\infty }^{\infty }xf(x)\,dx}
However, X is a non-negative random variable thus,
E
(
X
)
=
∫
−
∞
∞
x
f
(
x
)
d
x
=
∫
0
∞
x
f
(
x
)
d
x
{\displaystyle \operatorname {E} (X)=\int _{-\infty }^{\infty }xf(x)\,dx=\int _{0}^{\infty }xf(x)\,dx}
From this we can derive,
E
(
X
)
=
∫
0
a
x
f
(
x
)
d
x
+
∫
a
∞
x
f
(
x
)
d
x
≥
∫
a
∞
x
f
(
x
)
d
x
≥
∫
a
∞
a
f
(
x
)
d
x
=
a
∫
a
∞
f
(
x
)
d
x
=
a
Pr
(
X
≥
a
)
{\displaystyle \operatorname {E} (X)=\int _{0}^{a}xf(x)\,dx+\int _{a}^{\infty }xf(x)\,dx\geq \int _{a}^{\infty }xf(x)\,dx\geq \int _{a}^{\infty }af(x)\,dx=a\int _{a}^{\infty }f(x)\,dx=a\operatorname {Pr} (X\geq a)}
From here, dividing through by
a
{\displaystyle a}
allows us to see that
Pr
(
X
≥
a
)
≤
E
(
X
)
/
a
{\displaystyle \Pr(X\geq a)\leq \operatorname {E} (X)/a}
Method 2:
For any event
E
{\displaystyle E}
, let
I
E
{\displaystyle I_{E}}
be the indicator random variable of
E
{\displaystyle E}
, that is,
I
E
=
1
{\displaystyle I_{E}=1}
if
E
{\displaystyle E}
occurs and
I
E
=
0
{\displaystyle I_{E}=0}
otherwise.
Using this notation, we have
I
(
X
≥
a
)
=
1
{\displaystyle I_{(X\geq a)}=1}
if the event
X
≥
a
{\displaystyle X\geq a}
occurs, and
I
(
X
≥
a
)
=
0
{\displaystyle I_{(X\geq a)}=0}
if
X
<
a
{\displaystyle X<a}
. Then, given
a
>
0
{\displaystyle a>0}
,
a
I
(
X
≥
a
)
≤
X
{\displaystyle aI_{(X\geq a)}\leq X}
which is clear if we consider the two possible values of
X
≥
a
{\displaystyle X\geq a}
. If
X
<
a
{\displaystyle X<a}
, then
I
(
X
≥
a
)
=
0
{\displaystyle I_{(X\geq a)}=0}
, and so
a
I
(
X
≥
a
)
=
0
≤
X
{\displaystyle aI_{(X\geq a)}=0\leq X}
. Otherwise, we have
X
≥
a
{\displaystyle X\geq a}
, for which
I
X
≥
a
=
1
{\displaystyle I_{X\geq a}=1}
and so
a
I
X
≥
a
=
a
≤
X
{\displaystyle aI_{X\geq a}=a\leq X}
.
Since
E
{\displaystyle \operatorname {E} }
is a monotonically increasing function, taking expectation of both sides of an inequality cannot reverse it. Therefore,
E
(
a
I
(
X
≥
a
)
)
≤
E
(
X
)
.
{\displaystyle \operatorname {E} (aI_{(X\geq a)})\leq \operatorname {E} (X).}
Now, using linearity of expectations, the left side of this inequality is the same as
a
E
(
I
(
X
≥
a
)
)
=
a
(
1
⋅
P
(
X
≥
a
)
+
0
⋅
P
(
X
<
a
)
)
=
a
P
(
X
≥
a
)
.
{\displaystyle a\operatorname {E} (I_{(X\geq a)})=a(1\cdot \operatorname {P} (X\geq a)+0\cdot \operatorname {P} (X<a))=a\operatorname {P} (X\geq a).}
Thus we have
a
P
(
X
≥
a
)
≤
E
(
X
)
{\displaystyle a\operatorname {P} (X\geq a)\leq \operatorname {E} (X)}
and since a > 0, we can divide both sides by a .
Measure-theoretic proof
edit
We may assume that the function
f
{\displaystyle f}
is non-negative, since only its absolute value enters in the equation. Now, consider the real-valued function s on X given by
s
(
x
)
=
{
ε
,
if
f
(
x
)
≥
ε
0
,
if
f
(
x
)
<
ε
{\displaystyle s(x)={\begin{cases}\varepsilon ,&{\text{if }}f(x)\geq \varepsilon \\0,&{\text{if }}f(x)<\varepsilon \end{cases}}}
Then
0
≤
s
(
x
)
≤
f
(
x
)
{\displaystyle 0\leq s(x)\leq f(x)}
. By the definition of the Lebesgue integral
∫
X
f
(
x
)
d
μ
≥
∫
X
s
(
x
)
d
μ
=
ε
μ
(
{
x
∈
X
:
f
(
x
)
≥
ε
}
)
{\displaystyle \int _{X}f(x)\,d\mu \geq \int _{X}s(x)\,d\mu =\varepsilon \mu (\{x\in X:\,f(x)\geq \varepsilon \})}
and since
ε
>
0
{\displaystyle \varepsilon >0}
, both sides can be divided by
ε
{\displaystyle \varepsilon }
, obtaining
μ
(
{
x
∈
X
:
f
(
x
)
≥
ε
}
)
≤
1
ε
∫
X
f
d
μ
.
{\displaystyle \mu (\{x\in X:\,f(x)\geq \varepsilon \})\leq {1 \over \varepsilon }\int _{X}f\,d\mu .}
Discrete case
edit
We now provide a proof for the special case when
X
{\displaystyle X}
is a discrete random variable which only takes on non-negative integer values.
Let
a
{\displaystyle a}
be a positive integer. By definition
a
Pr
(
X
>
a
)
{\displaystyle a\operatorname {Pr} (X>a)}
=
a
Pr
(
X
=
a
+
1
)
+
a
Pr
(
X
=
a
+
2
)
+
a
Pr
(
X
=
a
+
3
)
+
.
.
.
{\displaystyle =a\operatorname {Pr} (X=a+1)+a\operatorname {Pr} (X=a+2)+a\operatorname {Pr} (X=a+3)+...}
≤
a
Pr
(
X
=
a
)
+
(
a
+
1
)
Pr
(
X
=
a
+
1
)
+
(
a
+
2
)
Pr
(
X
=
a
+
2
)
+
.
.
.
{\displaystyle \leq a\operatorname {Pr} (X=a)+(a+1)\operatorname {Pr} (X=a+1)+(a+2)\operatorname {Pr} (X=a+2)+...}
≤
Pr
(
X
=
1
)
+
2
Pr
(
X
=
2
)
+
3
Pr
(
X
=
3
)
+
.
.
.
{\displaystyle \leq \operatorname {Pr} (X=1)+2\operatorname {Pr} (X=2)+3\operatorname {Pr} (X=3)+...}
+
a
Pr
(
X
=
a
)
+
(
a
+
1
)
Pr
(
X
=
a
+
1
)
+
(
a
+
2
)
Pr
(
X
=
a
+
2
)
+
.
.
.
{\displaystyle +a\operatorname {Pr} (X=a)+(a+1)\operatorname {Pr} (X=a+1)+(a+2)\operatorname {Pr} (X=a+2)+...}
=
E
(
X
)
{\displaystyle =\operatorname {E} (X)}
Dividing by
a
{\displaystyle a}
yields the desired result.
Corollaries
edit
Chebyshev's inequality
edit
Chebyshev's inequality uses the variance to bound the probability that a random variable deviates far from the mean. Specifically,
P
(
|
X
−
E
(
X
)
|
≥
a
)
≤
Var
(
X
)
a
2
,
{\displaystyle \operatorname {P} (|X-\operatorname {E} (X)|\geq a)\leq {\frac {\operatorname {Var} (X)}{a^{2}}},}
for any a > 0 .[3] Here Var(X ) is the variance of X, defined as:
Var
(
X
)
=
E
[
(
X
−
E
(
X
)
)
2
]
.
{\displaystyle \operatorname {Var} (X)=\operatorname {E} [(X-\operatorname {E} (X))^{2}].}
Chebyshev's inequality follows from Markov's inequality by considering the random variable
(
X
−
E
(
X
)
)
2
{\displaystyle (X-\operatorname {E} (X))^{2}}
and the constant
a
2
,
{\displaystyle a^{2},}
for which Markov's inequality reads
P
(
(
X
−
E
(
X
)
)
2
≥
a
2
)
≤
Var
(
X
)
a
2
.
{\displaystyle \operatorname {P} ((X-\operatorname {E} (X))^{2}\geq a^{2})\leq {\frac {\operatorname {Var} (X)}{a^{2}}}.}
This argument can be summarized (where "MI" indicates use of Markov's inequality):
P
(
|
X
−
E
(
X
)
|
≥
a
)
=
P
(
(
X
−
E
(
X
)
)
2
≥
a
2
)
≤
M
I
E
(
(
X
−
E
(
X
)
)
2
)
a
2
=
Var
(
X
)
a
2
.
{\displaystyle \operatorname {P} (|X-\operatorname {E} (X)|\geq a)=\operatorname {P} \left((X-\operatorname {E} (X))^{2}\geq a^{2}\right)\,{\overset {\underset {\mathrm {MI} }{}}{\leq }}\,{\frac {\operatorname {E} \left((X-\operatorname {E} (X))^{2}\right)}{a^{2}}}={\frac {\operatorname {Var} (X)}{a^{2}}}.}
Other corollaries
edit
The "monotonic" result can be demonstrated by:
P
(
|
X
|
≥
a
)
=
P
(
φ
(
|
X
|
)
≥
φ
(
a
)
)
≤
M
I
E
(
φ
(
|
X
|
)
)
φ
(
a
)
{\displaystyle \operatorname {P} (|X|\geq a)=\operatorname {P} {\big (}\varphi (|X|)\geq \varphi (a){\big )}\,{\overset {\underset {\mathrm {MI} }{}}{\leq }}\,{\frac {\operatorname {E} (\varphi (|X|))}{\varphi (a)}}}
The result that, for a nonnegative random variable X , the quantile function of X satisfies:
Q
X
(
1
−
p
)
≤
E
(
X
)
p
,
{\displaystyle Q_{X}(1-p)\leq {\frac {\operatorname {E} (X)}{p}},}
the proof using
p
≤
P
(
X
≥
Q
X
(
1
−
p
)
)
≤
M
I
E
(
X
)
Q
X
(
1
−
p
)
.
{\displaystyle p\leq \operatorname {P} (X\geq Q_{X}(1-p))\,{\overset {\underset {\mathrm {MI} }{}}{\leq }}\,{\frac {\operatorname {E} (X)}{Q_{X}(1-p)}}.}
Let
M
⪰
0
{\displaystyle M\succeq 0}
be a self-adjoint matrix-valued random variable and a > 0 . Then
P
(
M
⋠
a
⋅
I
)
≤
tr
(
E
(
M
)
)
n
a
.
{\displaystyle \operatorname {P} (M\npreceq a\cdot I)\leq {\frac {\operatorname {tr} \left(E(M)\right)}{na}}.}
can be shown in a similar manner.
Examples
edit
Assuming no income is negative, Markov's inequality shows that no more than 10% (1/10) of the population can have more than 10 times the average income.[4]
Another simple example is as follows: Andrew makes 4 mistakes on average on a random test his Statistics course tests. The best upper bound on the probability that Andrew will do at least 10 mistakes is 0.4 as
P
(
X
≥
10
)
≤
E
(
X
)
α
=
4
10
.
{\displaystyle \operatorname {P} (X\geq 10)\leq {\frac {\operatorname {E} (X)}{\alpha }}={\frac {4}{10}}.}
Note that Andrew might do exactly 10 mistakes with probability 0.4 and make no mistakes with probability 0.6; the expectation is exactly 4 mistakes.
See also
edit
References
edit
^ a b Huber, Mark (2019-11-26). "Halving the Bounds for the Markov, Chebyshev, and Chernoff Inequalities Using Smoothing" . The American Mathematical Monthly . 126 (10): 915–927. arXiv :1803.06361 . doi :10.1080/00029890.2019.1656484 . ISSN 0002-9890 .
^ Stein, E. M. ; Shakarchi, R. (2005), Real Analysis , Princeton Lectures in Analysis , vol. 3 (1st ed.), p. 91 .
^ a b Lin, Zhengyan (2010). Probability inequalities . Springer. p. 52.
^ Ross, Kevin. 5.4 Probability inequalitlies | An Introduction to Probability and Simulation .
External links
edit