# Fano's inequality

In information theory, Fano's inequality (also known as the Fano converse and the Fano lemma) relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

It is used to find a lower bound on the error probability of any decoder as well as the lower bounds for minimax risks in density estimation.

Let the random variables ${\displaystyle X}$ and ${\displaystyle Y}$ represent input and output messages with a joint probability ${\displaystyle P(x,y)}$. Let ${\displaystyle e}$ represent an occurrence of error; i.e., that ${\displaystyle X\neq {\tilde {X}}}$, with ${\displaystyle {\tilde {X}}=f(Y)}$ being an approximate version of ${\displaystyle X}$. Fano's inequality is

${\displaystyle H(X|Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1),}$

where ${\displaystyle {\mathcal {X}}}$ denotes the support of ${\displaystyle X}$,

${\displaystyle H\left(X|Y\right)=-\sum _{i,j}P(x_{i},y_{j})\log P\left(x_{i}|y_{j}\right)}$

is the conditional entropy,

${\displaystyle P(e)=P(X\neq {\tilde {X}})}$

is the probability of the communication error, and

${\displaystyle H_{b}(e)=-P(e)\log P(e)-(1-P(e))\log(1-P(e))}$

is the corresponding binary entropy.

## Proof

Define an indicator random variable ${\displaystyle E}$ , that indicates the event that our estimate ${\displaystyle {\tilde {X}}=f(Y)}$  is in error,

${\displaystyle E:={\begin{cases}1~&{\text{ if }}~{\tilde {X}}\neq X~,\\0~&{\text{ if }}~{\tilde {X}}=X~.\end{cases}}}$

Consider ${\displaystyle H(E,X|{\tilde {X}})}$ . We can use the chain rule for entropies to expand this in two different ways

{\displaystyle {\begin{aligned}H(E,X|{\tilde {X}})&=H(X|{\tilde {X}})+\underbrace {H(E|X,{\tilde {X}})} _{=0}\\&=H(E|{\tilde {X}})+H(X|E,{\tilde {X}})\end{aligned}}}

Equating the two

${\displaystyle H(X|{\tilde {X}})=H(E|{\tilde {X}})+H(X|E,{\tilde {X}})}$

Expanding the right most term, ${\displaystyle H(X|E,{\tilde {X}})}$

{\displaystyle {\begin{aligned}H(X|E,{\tilde {X}})&=\underbrace {H(X|E=0,{\tilde {X}})} _{=0}\cdot P(E=0)+H(X|E=1,{\tilde {X}})\cdot \underbrace {P(E=1)} _{=P(e)}\\&=H(X|E=1,{\tilde {X}})\cdot P(e)\end{aligned}}}

Since ${\displaystyle E=0}$  means ${\displaystyle X={\tilde {X}}}$ ; being given the value of ${\displaystyle {\tilde {X}}}$  allows us to know the value of ${\displaystyle X}$  with certainty. This makes the term ${\displaystyle H(X|E=0,{\tilde {X}})=0}$ . On the other hand, ${\displaystyle E=1}$  means that ${\displaystyle {\tilde {X}}\neq X}$ , hence given the value of ${\displaystyle {\tilde {X}}}$ , we can narrow down ${\displaystyle X}$  to one of ${\displaystyle |{\mathcal {X}}|-1}$  different values, allowing us to upper bound the conditional entropy ${\displaystyle H(X|E=1,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)}$ . Hence

${\displaystyle H(X|E,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)\cdot P(e)}$

The other term, ${\displaystyle H(E|{\tilde {X}})\leq H(E)}$ , because conditioning reduces entropy. Because of the way ${\displaystyle E}$  is defined, ${\displaystyle H(E)=H_{b}(e)}$ , meaning that ${\displaystyle H(E|{\tilde {X}})\leq H_{b}(e)}$ . Putting it all together,

${\displaystyle H(X|{\tilde {X}})\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)}$

Because ${\displaystyle X\rightarrow Y\rightarrow {\tilde {X}}}$  is a Markov chain, we have ${\displaystyle I(X;{\tilde {X}})\leq I(X;Y)}$  by the data processing inequality, and hence ${\displaystyle H(X|{\tilde {X}})\geq H(X|Y)}$ , giving us

${\displaystyle H(X|Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)}$

## Alternative formulation

Let ${\displaystyle X}$  be a random variable with density equal to one of ${\displaystyle r+1}$  possible densities ${\displaystyle f_{1},\ldots ,f_{r+1}}$ . Furthermore, the Kullback–Leibler divergence between any pair of densities cannot be too large,

${\displaystyle D_{KL}(f_{i}\|f_{j})\leq \beta }$  for all ${\displaystyle i\not =j.}$

Let ${\displaystyle \psi (X)\in \{1,\ldots ,r+1\}}$  be an estimate of the index. Then

${\displaystyle \sup _{i}P_{i}(\psi (X)\not =i)\geq 1-{\frac {\beta +\log 2}{\log r}}}$

where ${\displaystyle P_{i}}$  is the probability induced by ${\displaystyle f_{i}}$

## Generalization

The following generalization is due to Ibragimov and Khasminskii (1979), Assouad and Birge (1983).

Let F be a class of densities with a subclass of r + 1 densities ƒθ such that for any θ ≠ θ

${\displaystyle \|f_{\theta }-f_{\theta '}\|_{L_{1}}\geq \alpha ,}$
${\displaystyle D_{KL}(f_{\theta }\|f_{\theta '})\leq \beta .}$

Then in the worst case the expected value of error of estimation is bound from below,

${\displaystyle \sup _{f\in \mathbf {F} }E\|f_{n}-f\|_{L_{1}}\geq {\frac {\alpha }{2}}\left(1-{\frac {n\beta +\log 2}{\log r}}\right)}$

where ƒn is any density estimator based on a sample of size n.

## References

• P. Assouad, "Deux remarques sur l'estimation", Comptes Rendus de l'Académie des Sciences de Paris, Vol. 296, pp. 1021–1024, 1983.
• L. Birge, "Estimating a density under order restrictions: nonasymptotic minimax risk", Technical report, UER de Sciences Économiques, Universite Paris X, Nanterre, France, 1983.
• T. Cover, J. Thomas (1991). Elements of Information Theory. pp. 38–42. ISBN 978-0-471-06259-2.
• L. Devroye, A Course in Density Estimation. Progress in probability and statistics, Vol 14. Boston, Birkhauser, 1987. ISBN 0-8176-3365-0, ISBN 3-7643-3365-0.
• Fano, Robert (1968). Transmission of information: a statistical theory of communications. Cambridge, Mass: MIT Press. ISBN 978-0-262-56169-3. OCLC 804123877.
• R. Fano, Fano inequality Scholarpedia, 2008.
• I. A. Ibragimov, R. Z. Has′minskii, Statistical estimation, asymptotic theory. Applications of Mathematics, vol. 16, Springer-Verlag, New York, 1981. ISBN 0-387-90523-5