Outline
- Lecture 1: Introduction
- Lecture 2: Entropy
Lecture 2: Entropy
- Monotonous: less probable $\rightarrow$ more information
- Additive: Total info = $\sum$ info independent events
For a discrete random variable $X$:
- Realizations: $\mathcal X=\{x_1,\dots,x_q\}$
- Probabilities: $\{p_1,\dots,p_q\}$
- Self-info of $X=x_i$: $-\log p_i$
- Avg. information: $I(X)=-\sum\limits_{i=1}^q p_i \log p_i$
Entropy:
A measure of the amount of info required to describe a r.v., average bits to describe the random variable.
Lemma 1. $H(X)\ge 0$
Lemma 2. $H_b(X)=(\log_ba)H_a(X)$
Denote $H(p) = -p\log p - (1-p)\log(1-p)$
Joint Entropy:
If $X$ and $Y$ are independent, then $H(X,Y) = H(X) + H(Y)$
Conditional Entropy:
Chain Rule of Entropy:
Proof by induction.
Relative Entropy (KL Divergence):
A measure of the distance between two distributions, extra bits needed for describing another r.v. based on the given r.v.
Mutual Information:
The relative entropy between the joint distribution and the product distribution.
Conditional Mutual Information:
Chain Rule for Mutual Information:
Conditional Relative Entropy:
Given joint probability mass functions $p(x, y)$ and $q(x, y)$, the conditional relative entropy is:
Chain Rule for Relative Entropy
Another way to explain mutual information
For r.v. $X,Y \sim p(x,y) $, $\tilde X, \tilde Y \sim q(x,y)$:
If $q(x,y) = p(x)\cdot p(y)$, then $q(x) = p(x)$, $q(y) = p(y)$
Therefore $q(x,y) = q(x) \cdot q(y)$, $\tilde X \bot \tilde Y$ .
Lecture 3: Inequality
$f(x)$ is convex IFF over $(a,b)$, $\forall x_1,x_2 \in (a,b)$ and $0\le \lambda \le 1$ :
$f(x)$ is strictly convex if equality holds only if $\lambda = 0/1$.
$f(x)$ is concave if $-f(x)$ is convex.
Theorem: Jensen’s Equality
For convex function $f(x)$ and r.v. $X$, $E[f(X)] \ge f(E[X])$. If strict convex, equality holds only if $X$ is constant.
Proof by induction.
Theorem: Information Inequality
For any probability mass function $p(x)$ and $q(x)$, $D(p(x)|q(x)) \ge 0$.
Equality IFF $\forall x, p(x)=q(x)$.
Theorem: Nonnegativity of mutual information
For any random variables $X$, $Y$, $I(X;Y)=D(p(x,y)|p(x)p(y))\ge 0$.
Equality IFF $X \bot Y$.
Corollaries
$D(p(y|x)|q(y|x))\ge0$, with equality IFF
$I(X;Y|Z)\ge0$, with equality IFF $X$ and $Y$ are conditionally independent given $Z$.
Theorem: The maximum entropy distribution
For discrete r.v. $X$, $H(X)\le \log | \mathcal X |$, with equality IFF $X$ has a uniform distribution over $|\mathcal X|$.
Theorem: Conditioning reduces entropy
$H(X|Y)\le H(X)$ with equality IFF $X\bot Y$.
Theorem: Independence bound on entropy
$H(X_1,\dots,X_n)\le\sum\limits_{i=1}^n H(X_i)$ with equality IFF $\forall i\ne j, X_i\bot X_j$
Markov Chain
Random variables $\{X_n\}$ form a Markov chain $X_1\rightarrow\cdots\rightarrow X_n$if the p.m.f. satisfies
Properties
- $X\rightarrow Y\rightarrow Z\implies p(x,z|y)=p(x|y)p(z|y)$
- $X\rightarrow Y\rightarrow Z\implies Z\rightarrow Y\rightarrow X$
- If $Z=f(Y)$, then $X\rightarrow Y\rightarrow Z$.
Theorem: Data-processing inequality
If $X\rightarrow Y \rightarrow Z$, then $I(X;Y)\ge I(X;Z)$.
Since $X\rightarrow Y \rightarrow Z$, we have $I(X;Z|Y)=0$. Since $I(X;Y|Z)\ge 0$, it holds.