Outline
- Introduction
- Preliminary
- Distributions
- Linear Models for Regression
- Linear Models for Classification
- Neural Networks
- Sparse Kernel Machines
- Mixture Models and EM Learning
- Sequential Data
- Markov Decision Process
Chapter 1 - Introduction
Concepts
- Machine Learning—minimization of some loss function for generalizing data sets with models.
- Datasets—annotated, indexed, organized
- Models—tree, distance, probabilistic, graph, bio-inspired
- Optimization—algorithms can minimize the loss
Linear Optimization Model
Find $X^\star$ to minimize the loss function:
Euclidean Distance Optimization
Point $x_0$, model $\mathbf b^\text Tx=d$.
We have:
Lagrange optimization:
Convex Optimization
Unconstrained optimization
Gradient descent
Gauss-Newton’s Method
Use a second-order approximation to function:
Choose $\Delta x$ to minimize above:
Descent direction:
Batch Gradient Descent
Minimize empirical loss, assuming it’s convex and unconstrained.
- Gradient descent on the empirical loss
- Gradient is the average of the gradient for all samples
- Very slow when $n$ is very large
Stochastic Gradient Descent
Compute gradient from just one or a few samples
Constrained optimization
- Lagrange methods
- Bayesian methods
Non-convex Optimization
- Heuristic algorithms
- Random search
Algorithms
- Bayes
- KNN
- K-means
- Decision Tree
- SVM
- Boosting
- Ensemble Learning
- Linear Statistical Learning
- Nonlinear Statistical Learning
- Deep Neural Networks
- Generative Adversarial Networks
- Bayesian Networks
- Reinforcement Learning
- Federated Learning
Chapter 2 - Preliminary
Curve Fitting and Regularization
Polynomial Curve Fitting
Sum-of-Squares Error Function
Root-Mean-Square (RMS) Error
where $\mathbf w^\star=\mathop{\text{argmin}}\limits_{\bf w}\ E(\bf w)$.
Regularization
Discourage the coefficients from reaching large values.
where $|{\bf w}|={\bf w}^T{\bf w}=w_0^2+w_1^2+\cdots+w_M^2$.
Note that often the coefficient $w_0$ is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al., 2001), or it may be included but with its own regularization coefficient
Probabilities Theory
Basic Concepts
Marginal/Joint/Conditional Probability
Sum/Product Rule
Probability Densities
Bayes’ Theorem
Prior/Posterior Probability
Transformed Densities
Expectations
Variances and Covariances
Gaussian Distribution
The Gaussian Distribution
Gaussian Mean and Variance
Likelihood Function
For a data set of independent observations $\mathbf x = (x_1,\cdots,x_N)^\text T$ of a Gaussian Distribution, its likelihood function is
Log Likelihood Function:
By maximizing the log likelihood function, we have
corresponding to the sample mean and the sample variance.
Properties of $\mu_{ML}$ and $\sigma^2_{ML}$
The MLE will obtain the correct mean but will underestimate the true variance by a factor $\frac{N-1}{N}$ (bias).
Curve fitting re-visited
Given $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $y(x, \mathbf w)$.
where $\beta$ is the precision parameter corresponding to the inverse variance of the distribution: $\beta^{-1}=\sigma^2$.
Training data:
- Input values $\mathbf x=(x_1,\cdots,x_N)^\text T$
- Target values $\mathbf t=(t_1,\cdots,t_N)^\text T$
Log likelihood function:
- The last two terms do not depend on $\bf w$, omitted.
- Dividing a positive constant $\beta$ does not alter $\mathbf w_{ML}$, replace $\frac \beta 2$ with $\frac 1 2$.
So maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function $\sum\limits_{n=1}^N\{y(x_n,\mathbf w)-t_n\}^2$.
After obtaining $\mathbf w_{ML}$, we can further maximize the likelihood function w.r.t. $\beta$:
Substitute $\mathbf w_{ML}$ and $\beta_{ML}$ back to get predictive distribution:
MAP: Maximum Posteriori
Review: the Multivariate Gaussian:
where $\alpha$ is the precision of the distribution(hyperparameter), and $M+1$ is the total number of elements in the vector $\bf w$ for an $M^\text{th}$ order polynomial.
Bayes’ Theorem: the posterior distribution for $\bf w$ is proportional to the product of the prior distribution and the likelihood function.
Take the logarithm of the rhs, we already know the likelihood function and prior distribution mentioned above, therefore maximizing posterior is to minimizing the following:
Maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function, with regularization parameter $\lambda = \frac\alpha\beta$.
Bayesian Curve Fitting
Assume that the parameters $\alpha$ and $\beta$ are fixed and known in advance.
In Bayesian treatment, the predictive distribution can be written as:
$p(t\vert x,\mathbf x, \mathbf t)$ is the likelihood function with omitted dependence on $\alpha$ and $\beta$ .
$p(\mathbf w\vert\mathbf x,\mathbf t)$ is the posterior distribution which can be found by normalizing the rhs.
For problems such as curve-fitting, this posterior distribution is a Gaussian and can be evaluated analytically. Similarly, the integration can also be performed analytically.
The predictive distribution is given by a Gaussian of the form
where the mean and variance are given by
Here the matrix $\textbf S$ is given by
where $\textbf I$ is the unit matrix, the vector $\phi(x)$ with elements $\phi_i(x)=x^i$ for $i=0,\cdots,M$.
- The variance and the mean depend on $x$.
- $\beta^{-1}$ represents the uncertainty in the predicted value of $t$, expressed in the maximum likelihood predictive distribution through $\beta^{-1}_{ML}$.
- $\phi(x)^\text T\textbf S\phi(x)$ arises from the uncertainty in $\mathbf w$ and is a consequence of the Bayesian treatment.
Model Selection
S-fold Cross Validation
leave-one-out: $S=N$
Curse of Dimensionality
For a polynomial of order $M$, the growth in the number of coefficients is $D^M$.
In spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface, most of the probability mass of a Gaussian is located within a thin shell at a specific radius.
Decision Theory
Confusion Matrix
Truth/Decision | Positive | Negative |
---|---|---|
Positive | TP | FN |
Negative | FP | TN |
Receiver Operating Characteristic Curve
Minimizing the misclassification rate
Minimizing the expected loss
Each $\mathbf x$ can be assigned independently to one of the decision regions $\mathcal R_j$.
Goal is to choose $\mathcal R_j$ to minimize $\mathbb E[L]$, we should minimize $\sum\limits_{k} L_{kj}p(\mathbf x,\mathcal C_k)$ for each $\mathbf x$.
Common factor $p(\mathbf x)$ can be eliminated from $p(\mathbf x,\mathcal C_k)=p(\mathcal C_k\vert\mathbf x)p(\mathbf x)$.
Thus the decision rule that minimizes the expected loss is the one that assigns each new $\mathbf x$ to the class $j$ for which the quantity $\sum\limits_{k} L_{kj}p(\mathcal C_k\vert\mathbf x)$ is a minimum.
Reject Option
Introducing a threshold $\theta$ and rejecting those inputs $x$ for which the largest of the posterior probabilities $p(\mathcal C_k\vert \mathbf x)$ is less than or equal to $\theta$.
- $\theta=1$: all examples are rejected.
- $\theta<\frac 1 K$: ($K$ classes) no examples are rejected.
Inference and Decision
Generative model: Model the distribution of inputs as well as outputs. Because by sampling from them it is possible to generate synthetic data points in the input space.
Discriminative model: model the posterior probabilities directly.
Discriminant function: Learn a function that maps inputs x directly into decisions.
3 Ways to Solve Decision Problems
(a) By inference, determine $p(\mathbf x\vert \mathcal C_k)$ and $p(\mathcal C_k)$, then uses Bayes’ Theorem
Equivalently, we can model the joint distribution $p(\mathbf x, \mathcal C_k)$ directly and then normalize to obtain the posterior probabilities. (generative model)
(b) By inference, determine the posterior class probabilities $p(\mathcal C_k\vert\mathbf x)$ directly. (discriminative model)
- (c) Find a function $f(x)$, called a discriminant function, which maps each input $\mathbf x$ directly onto a class label. Probabilities play no role.
Pros/Cons of the three approaches(extract from PRML)
Let us consider the relative merits of these three alternatives. Approach (a) is the most demanding because it involves finding the joint distribution over both
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution
An even simpler approach is (c) in which we use the training data to find a discriminant function
With option (c), however, we no longer have access to the posterior probabilities
- Minimizing risk.
- Reject option.
- Compensating for class priors.
- Combining models.
Loss functions for regression
For each input $\bf x$, choose a specific estimation $y(\bf x)$ of the value $t$.
Take squared loss $L(t,y({\bf x}))=\{y({\bf x})-t\}^2$ for example
Method 1. Calculus of Variations
Using a calculus of variations to give $y({\bf x})$ so as to minimize $\mathbb E[L]$:
Solving for $y({\bf x})$, and using the sum and product rules of probability, we obtain the regression function:
which is the conditional average of $t$ conditioned on $\bf x$.
Method 2. Expanding the square term
Substitute into the loss function and perform the integral on $t$:
When $y({\bf x})=\mathbb E_t[t\vert \mathbf x]$, loss function $\mathbb E[L]$ is minimized.
The second term represents the intrinsic variability of the target data and can be regarded as noise, or minimum value of the loss function.
3 Ways to Solve Regression Problems (in order of decreasing complexity)
- (a) Inference to determine the joint density $p({\bf x},t)$, then normalize to find the conditional density $p(t\vert \mathbf x)$, finally marginalize to find the conditional mean by the regression function.
- (b) Inference to determine the conditional density $p(t\vert \mathbf x)$, then marginalize to find the conditional mean by the regression function.
- (c) Find a regression function $y(\mathbf x)$ directly from the training data.
The relative merits of these three approaches follow the same lines as for classification problems.
Minkowski loss
The minimum of $\mathbb E[L_q]$ is given by:
- $q\rightarrow 0$: conditional mode
- $q=1$: conditional median
- $q=2$: conditional mean
Information Theory
Entropy(statistical)
Discrete
Allocate $N$ identical objects in $M$ bins.
Entropy maximized when $\forall i,\ p_i=\frac 1 M$.
Continuous
Put bins of width $\Delta$ along the real line.
For fixed $\sigma^2$, when $p(x)=\mathcal N(x\vert\mu,\sigma^2)$, differential entropy is maximized
Entropy
Conditional Entropy
Relative Entropy (The Kullback-Leibler Divergence)
KL divergence describes a distance between model $p$ and model $q$.
- It is not a symmetrical quantity: $\text{KL}(p|q)\not\equiv \text{KL}(q|p)$.
- $\text {KL}(p|q)\ge 0$, with equality IFF $p(\mathbf x)=q(\mathbf x)$.
Approximate distribution $p(\mathbf x)$ using distribution $q(\mathbf x\vert\boldsymbolθ)$ governed by a set of adjustable parameters $\boldsymbol\theta$. One way to determine $\boldsymbol\theta$ is to minimize the KL divergence between the two distributions.
The 1st term is the negative log likelihood function for $\boldsymbol\theta$. The 2nd term is independent of $\boldsymbol\theta$. So minimizing this KL divergence is equivalent to maximizing the likelihood function.
Cross Entropy for Machine Learning
- Goal of ML: $p(\text{real data})\approx p(\text{model}\vert\theta)$
- We assume: $p(\text{training data})\approx p(\text{real data})$
- Operation of ML: $p(\text{training data})\approx p(\text{model}\vert\theta)$
Minimizing $\text{KL}(p(\text{training data})|p(\text{model}\vert\theta))$ is equivalent to minimizing $\text C(p(\text{training data})|p(\text{model}\vert\theta))$ as $\text H(p(\text{training data}))$ is fixed.
Example. Bernoulli model
where $t_n$ is the training data and $\rho$ is the model parameter.
Example. Gaussian model
where $t_n$ is the training data and $\mu$ is the model parameter.
Mutual Information
Mutual information describes the degree of dependence between $\mathbf x$ and $\mathbf y$
Information Gain
Given $N$ balls and a balance, one of these balls is lighter.
$x$: one ball is lighter
$y$: weighing once
$\text H[x]$: uncertain of balls
$\text H[x\vert y]$: uncertain of balls after weighing once
Sum the equation above with $t=0,1,\cdots,T$,
Chapter 3 - Distributions
Binary Distributions
Bernoulli Distribution
Given $\mathcal D=\{x_1,\cdots,x_N\}$, $m$ heads, $N-m$ tails.
Binomial Distribution
Beta Distribution
The Beta distribution provides the conjugate prior for the Bernoulli distribution.
Given $\mathcal D=\{x_1,\cdots,x_N\}$, $m$ heads, $N-m$ tails.
As the size of the dataset $N$ increases,
Multinomial Distributions
1-of-K coding scheme
1-of-K coding scheme: $\mathbf x=(0,0,1,0,0,0)^\text T,\ \sum\limits_{k=1}^Kx_k=1$.
where $\boldsymbol\mu=(\mu_1,\cdots,\mu_K)^\text T$, $\sum\limits_k\mu_k=1$.
Given $\mathcal D=\{\bf x_1,\cdots,x_N\}$, the likelihood
where $ m_k=\sum\limits_{n=1}^Nx_{nk}$ denotes the number of observations of $x_k = 1$.
Using Lagrange method, we obtain
Multinomial Distribution
Dirichlet Distribution
Conjugate prior for the multinomial distribution.
Given $\mathcal D=\{m_1,\cdots,m_K\}$, the likelihood
where $\mathbf m=(m_1,\cdots,m_K)^\text T$.
We can interpret the parameters $\alpha_k$ of the Dirichlet prior as an effective number of observations of $x_k = 1$.
Gaussian Distributions
The Gaussian Distribution
The Multivariate Gaussian
Central Limit Theorem
The distribution of the sum of $N$ i.i.d. random variables becomes increasingly Gaussian as $N$ grows.
Moments of the Multivariate Gaussian
Properties of Gaussians
- Linear transformation
Product of Gaussian r.v.
For multivariate Gaussians, use $\Sigma$ to replace $\sigma^2$.
Bayes’ Theorem for Gaussian Variables(Slides)
Given $y=Ax+v$, $p(x)=\mathcal N(x\vert\mu,\Sigma)$, $p(v)=\mathcal N(v\vert,0,Q)$.
Hence we have
Therefore
where
Conditional Gaussian distributions
If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian.
Given a $D$-dimensional vector $\mathbf x$ with Gaussian distribution $\mathcal N(\mathbf x\vert\boldsymbol\mu,\boldsymbol\Sigma)$, partitioned into two disjoint subsets $\mathbf x_a$ and $\mathbf x_b$.
For conditional distribution $p(\mathbf x_a\vert\mathbf x_b)$,
And marginal distribution $p(\mathbf x_a)=\mathcal N(\mathbf x_a\vert\boldsymbol\mu_a,\boldsymbol\Sigma_{aa})$.
Bayes’ Theorem for Gaussian Variables(Textbook)
Given
Obtain
where
Maximum Likelihood for the Gaussian
Sufficient statistics: $\sum\limits_{n=1}^N\mathbf x_n$ and $\sum\limits_{n=1}^N\mathbf x_n\mathbf x_n^\text T$
Solve the maximum likelihood
Under the true distribution
Hence define
Sequential estimation
Bayesian Inference for the Gaussian
Student’s t-distribution
Periodic variables
Mixtures of Gaussians
$\pi_k$: mixing coefficients
$\sum\limits_{k=1}^K\pi_k=1,0\le\pi_k\le 1$
$N(\mathbf x\vert\boldsymbol\mu_k,\boldsymbol\Sigma_k)$: component
Log likelihood of mixtures
The maximum likelihood solution for the parameters no longer has a closed-form analytical solution.
Alternatives: iterative numerical optimization, expectation maximization.
Exponential Families
The exponential family of distributions over $\mathbf x$, given parameters $\boldsymbol \eta$
- $\bf x$: scalar or vector, discrete or continuous.
- $\boldsymbol\eta$: natural parameters
- $g(\boldsymbol\eta)$: normalization coefficient satisfying ↓
The Bernoulli Distribution:
So
The Multinomial Distribution:
Let $\mu_M=1-\sum\limits_{k=1}^{M-1}\mu_k$,
The Gaussian Distribution:
where
Maximum likelihood and sufficient statistics
Likelihood function
Condition of maximum likelihood estimator $\boldsymbol\eta_\text{ML}$
The solution for the ML estimator only depends on $\sum\limits_{n=1}^N\mathbf u(\mathbf x_n)$, which is called the sufficient statistic of the distribution.
For example,
Bernoulli distribution $\mathbf u(x) = x$, so we only keep the sum of the data points; Gaussian distribution $\mathbf u(x) = \left(\begin{matrix}x\\x^2\end{matrix}\right)$, we need to keep both the sum of data points and the sum of the square.
Conjugate Priors
Seek a prior that is conjugate to the likelihood function: the posterior distribution has the same functional form as the prior.
For any member of the exponential family, there exists a conjugate prior
where
- $f(\chi,\nu)$ is a normalization coefficient
- $g(\boldsymbol\eta)$ is the normalization coefficient of the exponential family.
Recall that
Multiply the prior and the likelihood, the posterior is
which is in the same functional form as the prior.
- $\nu$: effective number of pseudo-observations in the prior
- $\chi$: value of the sufficient statistic for the pseudo-observations
Non-informative Priors
Used when little is known about the prior distribution.
Have as little influence on the posterior distribution as possible.
Cons of choose constant as the prior
- If the domain of the parameter $\lambda$ is unbounded, the integral over $\lambda$ diverges, hence the prior cannot be normalized and is improper.
- By changing the variable(e.g., $\lambda=\eta^2$), the density over the new variable may not be constant.
Example: $\text{Gam}(\lambda\vert a_0,b_0)$ where $a_0=b_0=0$.
Non-parametric Methods
Limitation of parametric methods:
The chosen density might be a poor model of the distribution that generates the data. Multimodal data can never be captured by a unimodal Gaussian.
Histogram Methods
where $\Delta$ acts as a smoothing parameter.
Cons: Curse of Dimensionality
K-Nearest-Neighbors
Chapter 4 - Linear Models for Regression
Chapter 5 - Linear Models for Classification
Three Approaches to Classification
- Use discriminant functions directly
- Infer the posterior probabilities with generative models
- Directly construct posterior conditional class probabilities
Discriminant Functions
Discriminant Functions for $N$ classes
- use $N$ two-way discriminant functions
- use $\frac{N(N-1)}{2}$ two-way discriminant functions
Least Square Classification
- Reduce classification to least squares regression.
- Treat each class as a separate problem, pick the max.
Least square regression is sensitive to outliers, use logistic regression to solve this problem:
Fisher’s Linear Discriminants(LDA)
See the course material for Lab06: LDA
Perceptron
Training
Criterion:
where $t_n\in\{1,-1\}$ is the label.
where $\eta$ is the learning rate.
indicating the convergence of perceptron training.
Simplified Training
- guaranteed to find a set of weights that gets the right answer on the whole training set if any such a set exists.
- no need to choose a learning rate.
Probabilistic Generative Models
where $z$ is called the logit
K-Case Classification
Generative: ML Gaussian Mixtures
Likelihood $p\left(\boldsymbol{t}, \boldsymbol{X} \vert \pi, \mu_{1}, \mu_{2}, \Sigma\right)=\prod_{n=1}^{N}\left[\pi N\left(x_{n} \vert \mu_{1}, \Sigma\right)\right]^{t_{n}}\left[(1-\pi) N\left(x_{n} \vert \mu_{2}, \Sigma\right)\right]^{1-t_{n}}$
Generative: MAP Gaussian Mixtures
Probabilistic Discriminative Models
Bayesian Information Criterion
Chapter 6 - Neural Networks
Chapter 7 - Sparse Kernel Machines
Support Vector Machines
Model: $\quad y(\mathbf x) = \mathbf w^\text T \phi(\mathbf x) + b$
Distance of a point to the decision surface
Rescaling $\mathbf w$ and $b$, the point closest to the surface(active constraint) satisfies:
Then the classification problem becomes:
Hard Margin Classifier
The classification problem:
where
Soft Margin Classifier
Introduce slack variables $\xi_n\ge 0, n=1,\dots,N$ (sometimes $\xi_n<0$ is allowed)
The classification problem:
where $C>0$ is the trade-off between minimizing training errors and controlling model complexity.
SVMs and Logistic Regression
Multiclass SVMs
- One vs. Rest: $K$ separate SVMs
- Can lead to inconsistent results
- Imbalanced training sets
- One vs. One: $\frac{K(K-1)}{2}$ SVMs
- dead zone
SVMs for Regression
过的很快,只用了一分半
※Relevance Vector Machines
Given an input vector $\mathbf x$, the conditional distribution for a real-valued target variable $t$:
where $\beta=\sigma^{-2}$ is the noise precision.
The mean is given by a linear model, which can be written in an SVM-like form:
where $k(\mathbf x,\mathbf x_n)=\phi(\mathbf x)^\text T \phi(\mathbf x_n)$ is the kernel, and $b$ is 616.a bias parameter.
Chapter 8 - Mixture Models and EM Learning
Chapter 9 - Sequential Data
Conditional data: independent of the previous states
Sequential data: dependent of the previous states
Markov models
first-order:
second-order:
Hidden Markov Models
For each observation $\mathbf x_n$, introduce latent variable $\mathbf z_n$, such that $\mathbf z_{n+1}\perp!!!!\perp \mathbf z_{n-1} \vert \mathbf z_n$
Use 1-of-K encoding, the latent variable is K-dimensional binary vector. The transition probability matrix $A_{j,k}\equiv p(z_{n,k}=1\mid z_{n-1,j}=1)$. This matrix satisfies $\sum\limits_{k} A_{jk}=1$ therefore has $K(K-1)$ independent parameters.
Chapter 10 - Markov Decision Process
Markov Decision Process
Given states $x$, actions $u$, transition probabilities $p(x’\mid u,x)$, reward function $r(x,u)$.
Expected cumulative reward:
- $T=1$: greedy
- $T>1$: finite horizon
- $T=\infty$: infinite horizon
Goal: optimal policy
T-Step Policy and Value Function