Let
X
{\displaystyle {\mathcal {X}}}
and
Y
{\displaystyle {\mathcal {Y}}}
be sets. A channel, with input alphabet
X
{\displaystyle {\mathcal {X}}}
and output alphabet
Y
{\displaystyle {\mathcal {Y}}}
, is a sequence of conditional probability distributions
{
p
(
n
)
(
y
n
|
x
n
)
}
n
=
1
∞
{\displaystyle \{p^{(n)}(y^{n}|x^{n})\}_{n=1}^{\infty }}
where
p
(
n
)
(
y
n
|
x
n
)
:
Y
n
×
X
n
→
R
≥
0
.
{\displaystyle p^{(n)}(y^{n}|x^{n}):{\mathcal {Y}}^{n}\times {\mathcal {X}}^{n}\rightarrow \mathbb {R} _{\geq 0}.}
The channel is said to be discrete if both
X
{\displaystyle {\mathcal {X}}}
and
Y
{\displaystyle {\mathcal {Y}}}
are finite sets. A channel is called memoryless if for every positive integer
n
{\displaystyle n}
we have
p
(
n
)
(
y
n
|
x
n
)
=
∏
i
=
1
n
p
i
(
y
i
|
x
i
)
.
{\displaystyle p^{(n)}(y^{n}|x^{n})=\prod _{i=1}^{n}p_{i}(y_{i}|x_{i}).}
We say a channel is stationary if every stationary input results in a stationary output. A memoryless channel is stationary if the functions
p
i
(
.
|
.
)
{\displaystyle p_{i}(.|.)}
are equal for all
i
{\displaystyle i}
.
Therefore a stationary memoryless channel can be simply represented as the triple
(
X
,
p
(
y
|
x
)
,
Y
)
.
{\displaystyle ({\mathcal {X}},p(y|x),{\mathcal {Y}}).}
A (2nR ,n ) code consists of a message set
W
=
{
1
,
…
,
⌈
2
n
R
⌉
}
,
{\displaystyle {\mathcal {W}}=\{1,\dots ,\lceil 2^{nR}\rceil \},}
an encoder
f
n
:
W
→
X
n
,
{\displaystyle f_{n}:{\mathcal {W}}\rightarrow {\mathcal {X}}^{n},}
a decoder
g
n
:
Y
n
→
W
.
{\displaystyle g_{n}:{\mathcal {Y}}^{n}\rightarrow {\mathcal {W}}.}
The average probability of error of the code is given by
P
e
(
n
)
=
1
M
∑
(
w
,
y
n
)
:
g
n
(
y
n
)
≠
w
p
(
y
n
|
f
n
(
w
)
)
.
{\displaystyle P_{e}^{(n)}={\frac {1}{M}}\sum _{(w,y^{n}):g_{n}(y^{n})\neq w}p(y^{n}|f_{n}(w)).}
The value of n is known as the blocklength of the code.
A rate R (which is a nonnegative number) is said to be achievable, if there exists a sequence of (2nR ,n ) codes with P(n) e going to zero as
n goes to infinity. The noisy-channel coding theorem states that a rate R is achievable if and only if R is smaller than the capacity of the channel C , where
C
=
max
p
(
x
)
I
(
X
;
Y
)
.
{\displaystyle C=\max _{p(x)}I(X;Y).}
Wolfowitz 's theorem states that for any discrete memoryless channel with capacity C
and any (2nR ,n ) code with rate R >C ,
P
e
(
n
)
≥
1
−
4
A
n
(
R
−
C
)
2
−
e
−
n
(
R
−
C
)
2
{\displaystyle P_{e}^{(n)}\geq 1-{\frac {4A}{n(R-C)^{2}}}-e^{-{\frac {n(R-C)}{2}}}}
for some positive constant A dependent on the channel but not on n or
M . The proof which follows is based on Gallager 's book[1] .
For the proof we first require a lemma. This lemma is essentially a special case of the method of Lagrange multipliers for a concave function defined on the standard simplex
Δ
n
{\displaystyle \Delta ^{n}}
. It is then followed by a corollary which simply applies the lemma to the mutual information.
Let
f
:
Δ
n
→
R
{\displaystyle f:\Delta ^{n}\rightarrow \mathbb {R} }
be a concave function. Suppose f has continuous
partial derivatives on its domain. Then
α
∗
=
(
α
k
∗
)
k
=
1
n
∈
Δ
n
{\displaystyle \alpha ^{*}=(\alpha _{k}^{*})_{k=1}^{n}\in \Delta ^{n}}
maximizes
f
{\displaystyle f}
iff there exists some real
λ
{\displaystyle \lambda }
such that for every
k
∈
s
u
p
p
(
α
∗
)
,
{\displaystyle k\in \mathrm {supp} (\alpha ^{*}),}
∂
f
∂
α
k
|
α
=
α
∗
=
λ
,
{\displaystyle {\frac {\partial f}{\partial \alpha _{k}}}{\bigg |}_{\alpha =\alpha ^{*}}=\lambda ,}
and for every
k
∉
s
u
p
p
(
α
∗
)
,
{\displaystyle k\notin \mathrm {supp} (\alpha ^{*}),}
∂
f
∂
α
k
|
α
=
α
∗
≤
λ
.
{\displaystyle {\frac {\partial f}{\partial \alpha _{k}}}{\bigg |}_{\alpha =\alpha ^{*}}\leq \lambda .}
Proof of Lemma
edit
Suppose
α
∗
∈
Δ
n
{\displaystyle \alpha ^{*}\in \Delta ^{n}}
satisfies the above conditions. We'll
show that
f
{\displaystyle f}
achieves its maximum at
α
∗
{\displaystyle \alpha ^{*}}
. Let
α
{\displaystyle \alpha }
be any element of
Δ
n
{\displaystyle \Delta ^{n}}
. By the concavity
of
f
{\displaystyle f}
for any
θ
∈
[
0
,
1
]
{\displaystyle \theta \in [0,1]}
, we have
θ
f
(
α
)
+
(
1
−
θ
)
f
(
α
∗
)
≤
f
(
θ
α
+
(
1
−
θ
)
α
∗
)
,
{\displaystyle \theta f(\alpha )+(1-\theta )f(\alpha ^{*})\leq f(\theta \alpha +(1-\theta )\alpha ^{*}),}
thus
f
(
α
)
−
f
(
α
∗
)
≤
f
(
θ
α
+
(
1
−
θ
)
α
∗
)
−
f
(
α
)
θ
.
{\displaystyle f(\alpha )-f(\alpha ^{*})\leq {\frac {f(\theta \alpha +(1-\theta )\alpha ^{*})-f(\alpha )}{\theta }}.}
Allowing
θ
→
0
+
{\displaystyle \theta \rightarrow 0^{+}}
and making use of the continuity of partial
derivatives results in
f
(
α
)
−
f
(
α
∗
)
≤
d
f
(
θ
α
+
(
1
−
θ
)
α
∗
)
d
θ
|
θ
=
0
=
∑
k
=
1
n
∂
f
∂
α
k
|
α
=
α
∗
(
α
k
−
α
k
∗
)
≤
λ
∑
k
=
1
n
(
α
k
−
α
k
∗
)
=
0.
{\displaystyle {\begin{aligned}f(\alpha )-f(\alpha ^{*})&\leq {\frac {df(\theta \alpha +(1-\theta )\alpha ^{*})}{d\theta }}{\big |}_{\theta =0}\\&=\sum _{k=1}^{n}{\frac {\partial f}{\partial \alpha _{k}}}{\big |}_{\alpha =\alpha ^{*}}(\alpha _{k}-\alpha _{k}^{*})\\&\leq \lambda \sum _{k=1}^{n}(\alpha _{k}-\alpha _{k}^{*})=0.\end{aligned}}}
For the other direction suppose
α
∗
{\displaystyle \alpha ^{*}}
maximizes
f
{\displaystyle f}
. Then for every
α
∈
Δ
n
{\displaystyle \alpha \in \Delta ^{n}}
and every
θ
∈
[
0
,
1
]
{\displaystyle \theta \in [0,1]}
,
f
(
θ
α
+
(
1
−
θ
)
α
∗
)
−
f
(
α
∗
)
≤
0.
{\displaystyle f(\theta \alpha +(1-\theta )\alpha ^{*})-f(\alpha ^{*})\leq 0.}
This implies
d
f
(
θ
α
+
(
1
−
θ
)
α
∗
)
d
θ
|
θ
=
0
+
≤
0
,
{\displaystyle {\frac {df(\theta \alpha +(1-\theta )\alpha ^{*})}{d\theta }}{\big |}_{\theta =0^{+}}\leq 0,}
and by the continuity of the partial derivatives,
∑
k
=
1
n
∂
f
∂
α
k
|
α
∗
(
α
k
−
α
k
∗
)
≤
0.
{\displaystyle \sum _{k=1}^{n}{\frac {\partial f}{\partial \alpha _{k}}}{\big |}_{\alpha ^{*}}(\alpha _{k}-\alpha _{k}^{*})\leq 0.}
(1 )
Since
α
∗
∈
Δ
n
{\displaystyle \alpha ^{*}\in \Delta ^{n}}
, at least one of its components, say
α
1
∗
{\displaystyle \alpha _{1}^{*}}
is strictly positive. Now let
j
{\displaystyle j}
be an arbitrary element of
{
2
,
…
,
n
}
{\displaystyle \{2,\dots ,n\}}
. Furthermore,
for every
k
∈
{
1
,
…
,
n
}
{\displaystyle k\in \{1,\dots ,n\}}
, let
e
k
{\displaystyle e_{k}}
denote the element of
Δ
n
{\displaystyle \Delta ^{n}}
that
consists of all zeros but one one in the
k
th
{\displaystyle k^{\text{th}}}
position. Define
α
=
α
∗
+
α
1
∗
(
e
j
−
e
1
)
.
{\displaystyle \alpha =\alpha ^{*}+\alpha _{1}^{*}(e_{j}-e_{1}).}
Then inequality (1 ) simplifies to
∂
f
∂
α
j
|
α
∗
≤
∂
f
∂
α
1
|
α
∗
.
{\displaystyle {\frac {\partial f}{\partial \alpha _{j}}}{\big |}_{\alpha ^{*}}\leq {\frac {\partial f}{\partial \alpha _{1}}}{\big |}_{\alpha ^{*}}.}
In addition, if
α
j
∗
>
0
{\displaystyle \alpha _{j}^{*}>0}
and we define
α
=
α
∗
+
α
j
∗
(
e
1
−
e
j
)
,
{\displaystyle \alpha =\alpha ^{*}+\alpha _{j}^{*}(e_{1}-e_{j}),}
then (1 ) results in
∂
f
∂
α
1
|
α
∗
≤
∂
f
∂
α
j
|
α
∗
.
{\displaystyle {\frac {\partial f}{\partial \alpha _{1}}}{\big |}_{\alpha ^{*}}\leq {\frac {\partial f}{\partial \alpha _{j}}}{\big |}_{\alpha ^{*}}.}
Thus if we define
λ
{\displaystyle \lambda }
as
λ
=
∂
f
∂
α
1
|
α
∗
,
{\displaystyle \lambda ={\frac {\partial f}{\partial \alpha _{1}}}{\big |}_{\alpha ^{*}},}
the result follows.
Corollary
edit
For any discrete memoryless channel the distribution
p
∗
(
x
)
{\displaystyle p^{*}(x)}
achieves capacity iff there exists a real number
C
{\displaystyle C}
such that
I
∗
(
x
;
Y
)
=
C
{\displaystyle I^{*}(x;Y)=C}
for
x
∈
supp
(
p
∗
)
{\displaystyle x\in {\text{supp}}(p^{*})}
, and
I
∗
(
x
;
Y
)
≤
C
{\displaystyle I^{*}(x;Y)\leq C}
for
x
∉
supp
(
p
∗
)
{\displaystyle x\notin {\text{supp}}(p^{*})}
, where
I
∗
(
x
;
Y
)
=
∑
y
p
(
y
|
x
)
log
p
(
y
|
x
)
∑
x
′
p
∗
(
x
′
)
p
(
y
|
x
′
)
{\displaystyle I^{*}(x;Y)=\sum _{y}p(y|x)\log {\frac {p(y|x)}{\sum _{x'}p^{*}(x')p(y|x')}}}
.
Furthermore,
C
{\displaystyle C}
is the capacity of the channel.
Proof of Corollary
edit
The proof of the corollary is straightforward and follows directly from the lemma.
To see this, note that for any
x
,
x
′
∈
X
{\displaystyle x,x'\in {\mathcal {X}}}
,
∂
I
(
x
′
;
Y
)
∂
p
(
x
)
=
1.
{\displaystyle {\frac {\partial I(x';Y)}{\partial p(x)}}=1.}
Since
I
(
X
;
Y
)
=
∑
x
′
p
(
x
′
)
I
(
x
′
;
Y
)
,
{\displaystyle I(X;Y)=\sum _{x'}p(x')I(x';Y),}
this implies
∂
I
(
X
;
Y
)
∂
p
(
x
)
=
I
(
x
;
Y
)
+
∑
x
′
p
(
x
′
)
∂
I
(
x
′
;
Y
)
∂
p
(
x
)
=
I
(
x
;
Y
)
+
∑
x
′
p
(
x
′
)
=
I
(
x
;
Y
)
+
1.
{\displaystyle {\begin{aligned}{\frac {\partial I(X;Y)}{\partial p(x)}}&=I(x;Y)+\sum _{x'}p(x'){\frac {\partial I(x';Y)}{\partial p(x)}}\\&=I(x;Y)+\sum _{x'}p(x')=I(x;Y)+1.\end{aligned}}}
Now using the lemma, the claim follows.
Proof of the Strong Converse
edit
For any two random variables X and Y define the information
density as
i
(
X
,
Y
)
=
log
p
(
Y
|
X
)
p
(
Y
)
.
{\displaystyle i(X,Y)=\log {\frac {p(Y|X)}{p(Y)}}.}
Note that
I
(
x
;
Y
)
=
E
[
i
(
X
,
Y
)
|
X
=
x
]
,
{\displaystyle I(x;Y)=\mathbb {E} [i(X,Y)|X=x],}
and
I
(
X
;
Y
)
=
E
[
i
(
X
,
Y
)
]
.
{\displaystyle I(X;Y)=\mathbb {E} [i(X,Y)].}
Let
p
∗
(
y
)
{\displaystyle p^{*}(y)}
be the capacity-achieving output distribution.
For any positive integer n , define
p
∗
(
y
n
)
=
∏
i
=
1
n
p
∗
(
y
i
)
{\displaystyle p^{*}(y^{n})=\prod _{i=1}^{n}p^{*}(y_{i})}
For any
(
x
n
,
y
n
)
∈
X
n
×
Y
n
{\displaystyle (x^{n},y^{n})\in {\mathcal {X}}^{n}\times {\mathcal {Y}}^{n}}
,
define
i
(
x
n
,
y
n
)
=
log
p
(
y
n
|
x
n
)
p
∗
(
y
n
)
=
∑
i
=
1
n
i
(
x
i
,
y
i
)
,
{\displaystyle i(x^{n},y^{n})=\log {\frac {p(y^{n}|x^{n})}{p^{*}(y^{n})}}=\sum _{i=1}^{n}i(x_{i},y_{i}),}
where
i
(
x
i
,
y
i
)
=
log
p
(
y
i
|
x
i
)
p
∗
(
y
i
)
.
{\displaystyle i(x_{i},y_{i})=\log {\frac {p(y_{i}|x_{i})}{p^{*}(y_{i})}}.}
Consider a (2nR ,n ) code with codewords
{
x
i
n
}
i
=
1
M
{\displaystyle \{x_{i}^{n}\}_{i=1}^{M}}
and decoding regions
{
D
i
}
i
=
1
M
{\displaystyle \{D_{i}\}_{i=1}^{M}}
. Then the probability that a codeword is
decoded correctly is given by
P
c
=
1
M
∑
m
=
1
M
∑
y
n
∈
D
m
p
(
y
n
|
x
m
n
)
.
{\displaystyle P_{c}={\frac {1}{M}}\sum _{m=1}^{M}\sum _{y^{n}\in D_{m}}p(y^{n}|x_{m}^{n}).}
Fix positive ε . For every m , define
the set
B
m
=
{
y
n
:
i
(
x
m
n
,
y
n
)
>
n
(
C
+
ϵ
)
}
.
{\displaystyle B_{m}=\{y^{n}:i(x_{m}^{n},y^{n})>n(C+\epsilon )\}.}
Then
P
c
=
1
M
∑
m
=
1
M
∑
y
n
∈
D
m
∩
B
m
p
(
y
n
|
x
m
n
)
+
1
M
∑
m
=
1
M
∑
y
n
∈
D
m
∩
B
m
c
p
(
y
n
|
x
m
n
)
.
{\displaystyle P_{c}={\frac {1}{M}}\sum _{m=1}^{M}\sum _{y^{n}\in D_{m}\cap B_{m}}p(y^{n}|x_{m}^{n})+{\frac {1}{M}}\sum _{m=1}^{M}\sum _{y^{n}\in D_{m}\cap B_{m}^{c}}p(y^{n}|x_{m}^{n}).}
Based on the definition of Bm , the second sum can be upper bounded as
1
M
∑
m
=
1
M
∑
y
n
∈
D
m
∩
B
m
c
p
(
y
n
|
x
m
n
)
≤
1
M
∑
m
=
1
M
∑
y
n
∈
D
m
∩
B
m
c
e
n
(
C
+
ϵ
)
p
∗
(
y
n
)
=
e
n
(
C
+
ϵ
)
M
∑
m
=
1
M
∑
y
n
∈
D
m
∩
B
m
c
p
∗
(
y
n
)
≤
1
M
e
n
(
C
+
ϵ
)
.
{\displaystyle {\begin{aligned}{\frac {1}{M}}\sum _{m=1}^{M}\sum _{y^{n}\in D_{m}\cap B_{m}^{c}}p(y^{n}|x_{m}^{n})&\leq {\frac {1}{M}}\sum _{m=1}^{M}\sum _{y^{n}\in D_{m}\cap B_{m}^{c}}e^{n(C+\epsilon )}p^{*}(y^{n})\\&={\frac {e^{n(C+\epsilon )}}{M}}\sum _{m=1}^{M}\sum _{y^{n}\in D_{m}\cap B_{m}^{c}}p^{*}(y^{n})\leq {\frac {1}{M}}e^{n(C+\epsilon )}.\end{aligned}}}
Using Chebyshev's inequality we can find an upper bound on the first sum
∑
y
n
∈
D
m
∩
B
m
p
(
y
n
|
x
m
n
)
≤
∑
y
n
∈
B
m
p
(
y
n
|
x
m
n
)
=
P
r
[
i
(
X
n
,
Y
n
)
>
n
(
C
+
ϵ
)
|
X
n
=
x
m
n
]
=
P
r
[
∑
i
=
1
n
i
(
X
i
,
Y
i
)
>
n
(
C
+
ϵ
)
|
X
n
=
x
m
n
]
≤
∑
i
=
1
n
V
a
r
(
i
(
X
i
,
Y
i
)
|
X
i
=
x
m
i
)
n
2
(
C
−
∑
i
=
1
n
E
[
i
(
X
i
,
Y
i
|
X
i
=
x
m
i
)
]
+
ϵ
)
2
≤
∑
i
=
1
n
V
a
r
(
i
(
X
i
,
Y
i
)
|
X
i
=
x
m
i
)
n
2
(
C
−
∑
i
=
1
n
I
(
x
m
i
;
Y
i
)
+
ϵ
)
2
≤
1
n
2
ϵ
2
∑
i
=
1
n
V
a
r
(
i
(
X
i
,
Y
i
)
|
X
i
=
x
m
i
)
,
{\displaystyle {\begin{aligned}\sum _{y^{n}\in D_{m}\cap B_{m}}p(y^{n}|x_{m}^{n})&\leq \sum _{y^{n}\in B_{m}}p(y^{n}|x_{m}^{n})\\&=\mathrm {Pr} [i(X^{n},Y^{n})>n(C+\epsilon )|X^{n}=x_{m}^{n}]\\&=\mathrm {Pr} {\Big [}\sum _{i=1}^{n}i(X_{i},Y_{i})>n(C+\epsilon )|X^{n}=x_{m}^{n}{\Big ]}\\&\leq {\frac {\sum _{i=1}^{n}\mathrm {Var} (i(X_{i},Y_{i})|X_{i}=x_{mi})}{n^{2}\left(C-\sum _{i=1}^{n}\mathbb {E} [i(X_{i},Y_{i}|X_{i}=x_{mi})]+\epsilon \right)^{2}}}\\&\leq {\frac {\sum _{i=1}^{n}\mathrm {Var} (i(X_{i},Y_{i})|X_{i}=x_{mi})}{n^{2}\left(C-\sum _{i=1}^{n}I(x_{mi};Y_{i})+\epsilon \right)^{2}}}\\&\leq {\frac {1}{n^{2}\epsilon ^{2}}}\sum _{i=1}^{n}\mathrm {Var} (i(X_{i},Y_{i})|X_{i}=x_{mi}),\end{aligned}}}
where
V
a
r
(
i
(
X
i
,
Y
i
)
|
X
i
=
x
m
i
)
=
∑
y
p
(
y
|
x
m
i
)
(
log
p
(
y
|
x
m
i
)
p
∗
(
y
)
)
2
−
(
∑
y
p
(
y
|
x
m
i
)
log
p
(
y
|
x
m
i
)
p
∗
(
y
)
)
2
.
{\displaystyle \mathrm {Var} (i(X_{i},Y_{i})|X_{i}=x_{mi})=\sum _{y}p(y|x_{mi}){\Big (}\log {\frac {p(y|x_{mi})}{p^{*}(y)}}{\Big )}^{2}-{\Big (}\sum _{y}p(y|x_{mi})\log {\frac {p(y|x_{mi})}{p^{*}(y)}}{\Big )}^{2}.}
If we define
A
=
max
x
∈
X
V
a
r
(
i
(
X
,
Y
)
|
X
=
x
)
,
{\displaystyle A=\max _{x\in {\mathcal {X}}}\mathrm {Var} (i(X,Y)|X=x),}
(note that A only depends on the channel and is independent of n and M ) then
∑
y
n
∈
B
m
p
(
y
n
|
x
m
n
)
≤
A
n
ϵ
2
.
{\displaystyle \sum _{y^{n}\in B_{m}}p(y^{n}|x_{m}^{n})\leq {\frac {A}{n\epsilon ^{2}}}.}
Therefore
P
c
≤
A
n
ϵ
2
+
1
M
e
n
(
C
+
ϵ
)
.
{\displaystyle P_{c}\leq {\frac {A}{n\epsilon ^{2}}}+{\frac {1}{M}}e^{n(C+\epsilon )}.}
Since
M
=
⌈
2
n
R
⌉
{\displaystyle M=\lceil 2^{nR}\rceil }
, we get
P
e
(
n
)
≥
1
−
A
n
ϵ
2
−
1
M
e
n
(
C
+
ϵ
)
≥
1
−
A
n
ϵ
2
−
e
n
(
C
−
R
+
ϵ
)
.
{\displaystyle {\begin{aligned}P_{e}^{(n)}&\geq 1-{\frac {A}{n\epsilon ^{2}}}-{\frac {1}{M}}e^{n(C+\epsilon )}\\&\geq 1-{\frac {A}{n\epsilon ^{2}}}-e^{n(C-R+\epsilon )}.\end{aligned}}}
Should R>C , setting ε
to R-C / 2 results in
P
e
(
n
)
≥
1
−
4
A
n
(
R
−
C
)
2
−
e
−
n
2
(
R
−
C
)
.
{\displaystyle P_{e}^{(n)}\geq 1-{\frac {4A}{n(R-C)^{2}}}-e^{-{\frac {n}{2}}(R-C)}.}