EE411 Information Theory and Coding

Outline

Lecture 1: Introduction
Lecture 2: Entropy

Lecture 2: Entropy

Monotonous: less probable $\rightarrow$ more information
Additive: Total info = $\sum$ info independent events

For a discrete random variable $X$:

Realizations: $\mathcal X=\{x_1,\dots,x_q\}$
Probabilities: $\{p_1,\dots,p_q\}$
Self-info of $X=x_i$: $-\log p_i$
Avg. information: $I(X)=-\sum\limits_{i=1}^q p_i \log p_i$

Entropy:

A measure of the amount of info required to describe a r.v., average bits to describe the random variable.

$H(X) = -\sum\limits_{x\in\mathcal X}p(x)\log p(x)=-E[\log p(X)]$

Lemma 1. $H(X)\ge 0$

Lemma 2. $H_b(X)=(\log_ba)H_a(X)$

Denote $H(p) = -p\log p - (1-p)\log(1-p)$

Joint Entropy:

$H(X,Y)=-\sum\limits_{x\in\mathcal X}\sum\limits_{y\in\mathcal Y}p(x,y)\log p(x,y) = -E[\log p(X,Y)]$

If $X$ and $Y$ are independent, then $H(X,Y) = H(X) + H(Y)$

Conditional Entropy:

$H(Y\mid X)=-\sum\limits_{x\in\mathcal X}\sum\limits_{y\in\mathcal Y}p(x,y)\log p(y\mid x) = -E[\log p(Y\mid X)]$

Chain Rule of Entropy:

Proof by induction.

$\begin{align*} \text{Joint} &=\ \ \ \text{Single} +\text{Conditional}\\ H(X,Y) &=\ \ \ H(X) +H(Y \mid X)\\ H(X_1,\dots, X_2) &=\ \ \ \sum\limits_{i=1}^n H(X_i\mid X_{i-1},\dots,X_1) \end{align*}$

Relative Entropy (KL Divergence):

A measure of the distance between two distributions, extra bits needed for describing another r.v. based on the given r.v.

$D(p \parallel q) = \sum\limits_{x\in\mathcal X}p(x)\log\frac {p(x)}{q(x)}$

$D(p \parallel q) \ne D(q \parallel p)$

Mutual Information:

The relative entropy between the joint distribution and the product distribution.

$I(X; Y) = -\sum\limits_{x\in\mathcal X}\sum\limits_{y\in\mathcal Y}p(x,y)\log \frac {p(x, y)}{p(x)p(y)} = D(p(x,y)\parallel p(x)p(y))$

$\begin{align*} I(X;Y) &= H(X) - H(X\mid Y)\\ I(X;Y) &= H(Y) - H(Y\mid X)\\ I(X;Y) &= H(X) + H(Y) - H(X, Y)\\ I(X;Y) &= I(Y;X)\\ I(X;X) &= H(X) \end{align*}$

Conditional Mutual Information:

$\begin{align*} I(X;Y\mid Z) &= H(X\mid Z) - H(X\mid Y,Z)\\ &= E_{p(x,y,z)}\log\frac{p(X,Y\mid Z)}{p(X\mid Z)p(Y\mid Z)} \end{align*}$

Chain Rule for Mutual Information:

$I(X_1,\dots,X_n;Y) = \sum\limits_{i=1}^n I(X_i;Y\mid X_{i-1},\dots,X_1)$

Conditional Relative Entropy:

Given joint probability mass functions $p(x, y)$ and $q(x, y)$, the conditional relative entropy is:

$\begin{align*} D(p(y \mid x) \parallel q(y \mid x)) & =\sum_{x} p(x) \sum_{y} p(y \mid x) \log \frac{p(y \mid x)}{q(y \mid x)} \\ & =E_{p(x, y)} \log \frac{p(Y \mid X)}{q(Y \mid X)} \end{align*}$

Chain Rule for Relative Entropy

$D(p(x,y)\parallel q(x,y)) = D(p(x)\parallel q(x)) + D (p(y\mid x)\parallel q(y\mid x))$

Another way to explain mutual information

$I(X;Y) = D(p(x,y)\| p(x)p(y))$

For r.v. $X,Y \sim p(x,y) $, $\tilde X, \tilde Y \sim q(x,y)$:

If $q(x,y) = p(x)\cdot p(y)$, then $q(x) = p(x)$, $q(y) = p(y)$

Therefore $q(x,y) = q(x) \cdot q(y)$, $\tilde X \bot \tilde Y$ .

Lecture 3: Inequality

$f(x)$ is convex IFF over $(a,b)$, $\forall x_1,x_2 \in (a,b)$ and $0\le \lambda \le 1$ :

$f(\lambda x_1+(1-\lambda)x_2) \le \lambda f(x_1) + (1-\lambda) f(x_2)$

$f(x)$ is strictly convex if equality holds only if $\lambda = 0/1$.

$f(x)$ is concave if $-f(x)$ is convex.

Theorem: Jensen’s Equality

For convex function $f(x)$ and r.v. $X$, $E[f(X)] \ge f(E[X])$. If strict convex, equality holds only if $X$ is constant.

Proof by induction.

Theorem: Information Inequality

For any probability mass function $p(x)$ and $q(x)$, $D(p(x)|q(x)) \ge 0$.

Equality IFF $\forall x, p(x)=q(x)$.

Theorem: Nonnegativity of mutual information

For any random variables $X$, $Y$, $I(X;Y)=D(p(x,y)|p(x)p(y))\ge 0$.

Equality IFF $X \bot Y$.

Corollaries

$D(p(y|x)|q(y|x))\ge0$, with equality IFF
$\forall y\forall x \text{\ s.t.\ } p(x)>0, p(y|x)=q(y|x).$

$I(X;Y|Z)\ge0$, with equality IFF $X$ and $Y$ are conditionally independent given $Z$.

Theorem: The maximum entropy distribution

For discrete r.v. $X$, $H(X)\le \log | \mathcal X |$, with equality IFF $X$ has a uniform distribution over $|\mathcal X|$.

$u(x)=\frac 1 {|\mathcal X|}, \forall p(x), 0 \le D(p\|u)=\sum\limits_x p(x) \log \frac{p(x)}{u(x)} = \log |\mathcal X| - H(X)$

Theorem: Conditioning reduces entropy

$H(X|Y)\le H(X)$ with equality IFF $X\bot Y$.

Theorem: Independence bound on entropy

$H(X_1,\dots,X_n)\le\sum\limits_{i=1}^n H(X_i)$ with equality IFF $\forall i\ne j, X_i\bot X_j$

Markov Chain

Random variables $\{X_n\}$ form a Markov chain $X_1\rightarrow\cdots\rightarrow X_n$if the p.m.f. satisfies

$p(x_i|x_1,\dots,x_{i-1}) = p(x_i|x_i-1)$

Properties

$X\rightarrow Y\rightarrow Z\implies p(x,z|y)=p(x|y)p(z|y)$

$X\rightarrow Y\rightarrow Z\implies Z\rightarrow Y\rightarrow X$

If $Z=f(Y)$, then $X\rightarrow Y\rightarrow Z$.

Theorem: Data-processing inequality

If $X\rightarrow Y \rightarrow Z$, then $I(X;Y)\ge I(X;Z)$.

$\begin{align*} I(X;Y,Z) &= I(X;Z) + I(X;Y|Z)\\ &= I(X;Y) + I(X;Z|Y) \end{align*}$

Since $X\rightarrow Y \rightarrow Z$, we have $I(X;Z|Y)=0$. Since $I(X;Y|Z)\ge 0$, it holds.