# Fano's inequality

In information theory, Fano's inequality (also known as the Fano converse and the Fano lemma) relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

It is used to find a lower bound on the error probability of any decoder as well as the lower bounds for minimax risks in density estimation.

Let the random variables $X$ and $Y$ represent input and output messages with a joint probability $P(x,y)$ . Let $e$ represent an occurrence of error; i.e., that $X\neq {\tilde {X}}$ , with ${\tilde {X}}=f(Y)$ being an approximate version of $X$ . Fano's inequality is

$H(X|Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1),$ where ${\mathcal {X}}$ denotes the support of $X$ ,

$H\left(X|Y\right)=-\sum _{i,j}P(x_{i},y_{j})\log P\left(x_{i}|y_{j}\right)$ is the conditional entropy,

$P(e)=P(X\neq {\tilde {X}})$ is the probability of the communication error, and

$H_{b}(e)=-P(e)\log P(e)-(1-P(e))\log(1-P(e))$ is the corresponding binary entropy.

## Proof

Define an indicator random variable $E$ , that indicates the event that our estimate ${\tilde {X}}=f(Y)$  is in error,

$E:={\begin{cases}1~&{\text{ if }}~{\tilde {X}}\neq X~,\\0~&{\text{ if }}~{\tilde {X}}=X~.\end{cases}}$

Consider $H(E,X|{\tilde {X}})$ . We can use the chain rule for entropies to expand this in two different ways

{\begin{aligned}H(E,X|{\tilde {X}})&=H(X|{\tilde {X}})+\underbrace {H(E|X,{\tilde {X}})} _{=0}\\&=H(E|{\tilde {X}})+H(X|E,{\tilde {X}})\end{aligned}}

Equating the two

$H(X|{\tilde {X}})=H(E|{\tilde {X}})+H(X|E,{\tilde {X}})$

Expanding the right most term, $H(X|E,{\tilde {X}})$

{\begin{aligned}H(X|E,{\tilde {X}})&=\underbrace {H(X|E=0,{\tilde {X}})} _{=0}\cdot P(E=0)+H(X|E=1,{\tilde {X}})\cdot \underbrace {P(E=1)} _{=P(e)}\\&=H(X|E=1,{\tilde {X}})\cdot P(e)\end{aligned}}

Since $E=0$  means $X={\tilde {X}}$ ; being given the value of ${\tilde {X}}$  allows us to know the value of $X$  with certainty. This makes the term $H(X|E=0,{\tilde {X}})=0$ . On the other hand, $E=1$  means that ${\tilde {X}}\neq X$ , hence given the value of ${\tilde {X}}$ , we can narrow down $X$  to one of $|{\mathcal {X}}|-1$  different values, allowing us to upper bound the conditional entropy $H(X|E=1,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)$ . Hence

$H(X|E,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)\cdot P(e)$

The other term, $H(E|{\tilde {X}})\leq H(E)$ , because conditioning reduces entropy. Because of the way $E$  is defined, $H(E)=H_{b}(e)$ , meaning that $H(E|{\tilde {X}})\leq H_{b}(e)$ . Putting it all together,

$H(X|{\tilde {X}})\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)$

Because $X\rightarrow Y\rightarrow {\tilde {X}}$  is a Markov chain, we have $I(X;{\tilde {X}})\leq I(X;Y)$  by the data processing inequality, and hence $H(X|{\tilde {X}})\geq H(X|Y)$ , giving us

$H(X|Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)$

## Alternative formulation

Let $X$  be a random variable with density equal to one of $r+1$  possible densities $f_{1},\ldots ,f_{r+1}$ . Furthermore, the Kullback–Leibler divergence between any pair of densities cannot be too large,

$D_{KL}(f_{i}\|f_{j})\leq \beta$  for all $i\not =j.$

Let $\psi (X)\in \{1,\ldots ,r+1\}$  be an estimate of the index. Then

$\sup _{i}P_{i}(\psi (X)\not =i)\geq 1-{\frac {\beta +\log 2}{\log r}}$

where $P_{i}$  is the probability induced by $f_{i}$

## Generalization

The following generalization is due to Ibragimov and Khasminskii (1979), Assouad and Birge (1983).

Let F be a class of densities with a subclass of r + 1 densities ƒθ such that for any θ ≠ θ

$\|f_{\theta }-f_{\theta '}\|_{L_{1}}\geq \alpha ,$
$D_{KL}(f_{\theta }\|f_{\theta '})\leq \beta .$

Then in the worst case the expected value of error of estimation is bound from below,

$\sup _{f\in \mathbf {F} }E\|f_{n}-f\|_{L_{1}}\geq {\frac {\alpha }{2}}\left(1-{\frac {n\beta +\log 2}{\log r}}\right)$

where ƒn is any density estimator based on a sample of size n.