0%

CS329 Machine Learning Notes

Outline

  1. Introduction
  2. Preliminary
  3. Distributions
  4. Linear Models for Regression
  5. Linear Models for Classification
  6. Neural Networks
  7. Sparse Kernel Machines
  8. Mixture Models and EM Learning
  9. Sequential Data
  10. Markov Decision Process

Chapter 1 - Introduction

Concepts

  • Machine Learning—minimization of some loss function for generalizing data sets with models.
  • Datasets—annotated, indexed, organized
  • Models—tree, distance, probabilistic, graph, bio-inspired
  • Optimization—algorithms can minimize the loss

Linear Optimization Model

Find $X^\star$ to minimize the loss function:

Euclidean Distance Optimization

Point $x_0$, model $\mathbf b^\text Tx=d$.

We have:

Lagrange optimization:

Convex Optimization

Unconstrained optimization

Gradient descent
Gauss-Newton’s Method

Use a second-order approximation to function:

Choose $\Delta x$ to minimize above:

Descent direction:

Batch Gradient Descent

Minimize empirical loss, assuming it’s convex and unconstrained.

  • Gradient descent on the empirical loss
  • Gradient is the average of the gradient for all samples
  • Very slow when $n$ is very large
Stochastic Gradient Descent

Compute gradient from just one or a few samples

Constrained optimization

  • Lagrange methods
  • Bayesian methods

Non-convex Optimization

  • Heuristic algorithms
  • Random search

Algorithms

  • Bayes
  • KNN
  • K-means
  • Decision Tree
  • SVM
  • Boosting
  • Ensemble Learning
  • Linear Statistical Learning
  • Nonlinear Statistical Learning
  • Deep Neural Networks
  • Generative Adversarial Networks
  • Bayesian Networks
  • Reinforcement Learning
  • Federated Learning

Chapter 2 - Preliminary

Curve Fitting and Regularization

Polynomial Curve Fitting

Sum-of-Squares Error Function

Root-Mean-Square (RMS) Error

where $\mathbf w^\star=\mathop{\text{argmin}}\limits_{\bf w}\ E(\bf w)$.

Regularization

Discourage the coefficients from reaching large values.

where $|{\bf w}|={\bf w}^T{\bf w}=w_0^2+w_1^2+\cdots+w_M^2$.

Note that often the coefficient $w_0$ is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al., 2001), or it may be included but with its own regularization coefficient

Probabilities Theory

Basic Concepts

  • Marginal/Joint/Conditional Probability

  • Sum/Product Rule

  • Probability Densities

  • Bayes’ Theorem

  • Prior/Posterior Probability

  • Transformed Densities

  • Expectations

  • Variances and Covariances

Gaussian Distribution

  • The Gaussian Distribution

  • Gaussian Mean and Variance

  • The Multivariate Gaussian

Likelihood Function

For a data set of independent observations $\mathbf x = (x_1,\cdots,x_N)^\text T$ of a Gaussian Distribution, its likelihood function is

Log Likelihood Function:

By maximizing the log likelihood function, we have

corresponding to the sample mean and the sample variance.

Properties of $\mu_{ML}$ and $\sigma^2_{ML}$

The MLE will obtain the correct mean but will underestimate the true variance by a factor $\frac{N-1}{N}$ (bias).

Curve fitting re-visited

Given $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $y(x, \mathbf w)$.

where $\beta$ is the precision parameter corresponding to the inverse variance of the distribution: $\beta^{-1}=\sigma^2$.

Training data:

  • Input values $\mathbf x=(x_1,\cdots,x_N)^\text T$
  • Target values $\mathbf t=(t_1,\cdots,t_N)^\text T$

Likelihood function:

Log likelihood function:

  • The last two terms do not depend on $\bf w$, omitted.
  • Dividing a positive constant $\beta$ does not alter $\mathbf w_{ML}$, replace $\frac \beta 2$ with $\frac 1 2$.

So maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function $\sum\limits_{n=1}^N\{y(x_n,\mathbf w)-t_n\}^2$.

After obtaining $\mathbf w_{ML}$, we can further maximize the likelihood function w.r.t. $\beta$:

Substitute $\mathbf w_{ML}$ and $\beta_{ML}$ back to get predictive distribution:

MAP: Maximum Posteriori

The prior distribution:

Review: the Multivariate Gaussian:

where $\alpha$ is the precision of the distribution(hyperparameter), and $M+1$ is the total number of elements in the vector $\bf w$ for an $M^\text{th}$ order polynomial.

Bayes’ Theorem: the posterior distribution for $\bf w$ is proportional to the product of the prior distribution and the likelihood function.

The posterior distribution:

Take the logarithm of the rhs, we already know the likelihood function and prior distribution mentioned above, therefore maximizing posterior is to minimizing the following:

Maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function, with regularization parameter $\lambda = \frac\alpha\beta$.

Bayesian Curve Fitting

Assume that the parameters $\alpha$ and $\beta$ are fixed and known in advance.

In Bayesian treatment, the predictive distribution can be written as:

$p(t\vert x,\mathbf x, \mathbf t)$ is the likelihood function with omitted dependence on $\alpha$ and $\beta$ .

$p(\mathbf w\vert\mathbf x,\mathbf t)$ is the posterior distribution which can be found by normalizing the rhs.

For problems such as curve-fitting, this posterior distribution is a Gaussian and can be evaluated analytically. Similarly, the integration can also be performed analytically.

The predictive distribution is given by a Gaussian of the form

where the mean and variance are given by

Here the matrix $\textbf S$ is given by

where $\textbf I$ is the unit matrix, the vector $\phi(x)$ with elements $\phi_i(x)=x^i$ for $i=0,\cdots,M$.

  • The variance and the mean depend on $x$.
  • $\beta^{-1}$ represents the uncertainty in the predicted value of $t$, expressed in the maximum likelihood predictive distribution through $\beta^{-1}_{ML}$.
  • $\phi(x)^\text T\textbf S\phi(x)$ arises from the uncertainty in $\mathbf w$ and is a consequence of the Bayesian treatment.

Model Selection

S-fold Cross Validation

image.png

leave-one-out: $S=N$

Curse of Dimensionality

For a polynomial of order $M$, the growth in the number of coefficients is $D^M$.

In spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface, most of the probability mass of a Gaussian is located within a thin shell at a specific radius.

image.png

Decision Theory

Confusion Matrix

Truth/Decision Positive Negative
Positive TP FN
Negative FP TN

image.png

Receiver Operating Characteristic Curve

image.png

Minimizing the misclassification rate

image.png

Minimizing the expected loss

Each $\mathbf x$ can be assigned independently to one of the decision regions $\mathcal R_j$.

Goal is to choose $\mathcal R_j$ to minimize $\mathbb E[L]$, we should minimize $\sum\limits_{k} L_{kj}p(\mathbf x,\mathcal C_k)$ for each $\mathbf x$.

Common factor $p(\mathbf x)$ can be eliminated from $p(\mathbf x,\mathcal C_k)=p(\mathcal C_k\vert\mathbf x)p(\mathbf x)$.

Thus the decision rule that minimizes the expected loss is the one that assigns each new $\mathbf x$ to the class $j$ for which the quantity $\sum\limits_{k} L_{kj}p(\mathcal C_k\vert\mathbf x)$ is a minimum.

Reject Option

Introducing a threshold $\theta$ and rejecting those inputs $x$ for which the largest of the posterior probabilities $p(\mathcal C_k\vert \mathbf x)$ is less than or equal to $\theta$.

  • $\theta=1$: all examples are rejected.
  • $\theta<\frac 1 K$: ($K$ classes) no examples are rejected.

image.png

Inference and Decision

Generative model: Model the distribution of inputs as well as outputs. Because by sampling from them it is possible to generate synthetic data points in the input space.

Discriminative model: model the posterior probabilities directly.

Discriminant function: Learn a function that maps inputs x directly into decisions.

3 Ways to Solve Decision Problems

  • (a) By inference, determine $p(\mathbf x\vert \mathcal C_k)$ and $p(\mathcal C_k)$, then uses Bayes’ Theorem

    Equivalently, we can model the joint distribution $p(\mathbf x, \mathcal C_k)$ directly and then normalize to obtain the posterior probabilities. (generative model)

  • (b) By inference, determine the posterior class probabilities $p(\mathcal C_k\vert\mathbf x)$ directly. (discriminative model)

  • (c) Find a function $f(x)$, called a discriminant function, which maps each input $\mathbf x$ directly onto a class label. Probabilities play no role.
Pros/Cons of the three approaches(extract from PRML)

Let us consider the relative merits of these three alternatives. Approach (a) is the most demanding because it involves finding the joint distribution over both x and Ck. For many applications, x will have high dimensionality, and consequently we may need a large training set in order to be able to determine the class-conditional densities to reasonable accuracy. Note that the class priors p(Ck) can often be estimated simply from the fractions of the training set data points in each of the classes. One advantage of approach (a), however, is that it also allows the marginal density of data p(x) to be determined from p(x)=kp(x|Ck)p(Ck). This can be useful for detecting new data points that have low probability under the model and for which the predictions maybe of low accuracy, which is known as outlier detection or novelty detection.

However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x,Ck) when in fact we only really need the posterior probabilities p(Ck|x), which can be obtained directly through approach (b). Indeed, the class-conditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them.

image.png

An even simpler approach is (c) in which we use the training data to find a discriminant function f(x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification.

With option (c), however, we no longer have access to the posterior probabilities p(Ck|x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include:

  • Minimizing risk.
  • Reject option.
  • Compensating for class priors.
  • Combining models.

Loss functions for regression

For each input $\bf x$, choose a specific estimation $y(\bf x)$ of the value $t$.

Take squared loss $L(t,y({\bf x}))=\{y({\bf x})-t\}^2$ for example

Method 1. Calculus of Variations

Using a calculus of variations to give $y({\bf x})$ so as to minimize $\mathbb E[L]$:

Solving for $y({\bf x})$, and using the sum and product rules of probability, we obtain the regression function:

which is the conditional average of $t$ conditioned on $\bf x$.

Method 2. Expanding the square term

Substitute into the loss function and perform the integral on $t$:

When $y({\bf x})=\mathbb E_t[t\vert \mathbf x]$, loss function $\mathbb E[L]$ is minimized.

The second term represents the intrinsic variability of the target data and can be regarded as noise, or minimum value of the loss function.

3 Ways to Solve Regression Problems (in order of decreasing complexity)

  • (a) Inference to determine the joint density $p({\bf x},t)$, then normalize to find the conditional density $p(t\vert \mathbf x)$, finally marginalize to find the conditional mean by the regression function.
  • (b) Inference to determine the conditional density $p(t\vert \mathbf x)$, then marginalize to find the conditional mean by the regression function.
  • (c) Find a regression function $y(\mathbf x)$ directly from the training data.

The relative merits of these three approaches follow the same lines as for classification problems.

Minkowski loss

image.png

The minimum of $\mathbb E[L_q]$ is given by:

  • $q\rightarrow 0$: conditional mode
  • $q=1$: conditional median
  • $q=2$: conditional mean

Information Theory

Entropy(statistical)

Discrete

Allocate $N$ identical objects in $M$ bins.

Entropy maximized when $\forall i,\ p_i=\frac 1 M$.

Continuous

Put bins of width $\Delta$ along the real line.

For fixed $\sigma^2$, when $p(x)=\mathcal N(x\vert\mu,\sigma^2)$, differential entropy is maximized

Entropy

Conditional Entropy

Relative Entropy (The Kullback-Leibler Divergence)

KL divergence describes a distance between model $p$ and model $q$.

  1. It is not a symmetrical quantity: $\text{KL}(p|q)\not\equiv \text{KL}(q|p)$.
  2. $\text {KL}(p|q)\ge 0$, with equality IFF $p(\mathbf x)=q(\mathbf x)$.

Approximate distribution $p(\mathbf x)$ using distribution $q(\mathbf x\vert\boldsymbolθ)$ governed by a set of adjustable parameters $\boldsymbol\theta$. One way to determine $\boldsymbol\theta$ is to minimize the KL divergence between the two distributions.

The 1st term is the negative log likelihood function for $\boldsymbol\theta$. The 2nd term is independent of $\boldsymbol\theta$. So minimizing this KL divergence is equivalent to maximizing the likelihood function.

Cross Entropy for Machine Learning

  • Goal of ML: $p(\text{real data})\approx p(\text{model}\vert\theta)$
  • We assume: $p(\text{training data})\approx p(\text{real data})$
  • Operation of ML: $p(\text{training data})\approx p(\text{model}\vert\theta)$

Minimizing $\text{KL}(p(\text{training data})|p(\text{model}\vert\theta))$ is equivalent to minimizing $\text C(p(\text{training data})|p(\text{model}\vert\theta))$ as $\text H(p(\text{training data}))$ is fixed.

Example. Bernoulli model

where $t_n$ is the training data and $\rho$ is the model parameter.

Example. Gaussian model

where $t_n$ is the training data and $\mu$ is the model parameter.

Mutual Information

Mutual information describes the degree of dependence between $\mathbf x$ and $\mathbf y$

Information Gain

Given $N$ balls and a balance, one of these balls is lighter.

$x$: one ball is lighter

$y$: weighing once

$\text H[x]$: uncertain of balls

$\text H[x\vert y]$: uncertain of balls after weighing once

Sum the equation above with $t=0,1,\cdots,T$,

Chapter 3 - Distributions

Binary Distributions

Bernoulli Distribution

Given $\mathcal D=\{x_1,\cdots,x_N\}$, $m$ heads, $N-m$ tails.

Binomial Distribution

Beta Distribution

The Beta distribution provides the conjugate prior for the Bernoulli distribution.

Given $\mathcal D=\{x_1,\cdots,x_N\}$, $m$ heads, $N-m$ tails.

As the size of the dataset $N$ increases,

Multinomial Distributions

1-of-K coding scheme

1-of-K coding scheme: $\mathbf x=(0,0,1,0,0,0)^\text T,\ \sum\limits_{k=1}^Kx_k=1$.

where $\boldsymbol\mu=(\mu_1,\cdots,\mu_K)^\text T$, $\sum\limits_k\mu_k=1$.

Given $\mathcal D=\{\bf x_1,\cdots,x_N\}$, the likelihood

where $ m_k=\sum\limits_{n=1}^Nx_{nk}$ denotes the number of observations of $x_k = 1$.

Using Lagrange method, we obtain

Multinomial Distribution

Dirichlet Distribution

Conjugate prior for the multinomial distribution.

Given $\mathcal D=\{m_1,\cdots,m_K\}$, the likelihood

where $\mathbf m=(m_1,\cdots,m_K)^\text T$.

We can interpret the parameters $\alpha_k$ of the Dirichlet prior as an effective number of observations of $x_k = 1$.

Gaussian Distributions

  • The Gaussian Distribution

  • The Multivariate Gaussian

Central Limit Theorem

The distribution of the sum of $N$ i.i.d. random variables becomes increasingly Gaussian as $N$ grows.

Moments of the Multivariate Gaussian

Properties of Gaussians

  1. Linear transformation
  1. Product of Gaussian r.v.

    For multivariate Gaussians, use $\Sigma$ to replace $\sigma^2$.

Bayes’ Theorem for Gaussian Variables(Slides)

Given $y=Ax+v$, $p(x)=\mathcal N(x\vert\mu,\Sigma)$, $p(v)=\mathcal N(v\vert,0,Q)$.

Hence we have

Therefore

where

Conditional Gaussian distributions

If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian.

Given a $D$-dimensional vector $\mathbf x$ with Gaussian distribution $\mathcal N(\mathbf x\vert\boldsymbol\mu,\boldsymbol\Sigma)$, partitioned into two disjoint subsets $\mathbf x_a$ and $\mathbf x_b$.

For conditional distribution $p(\mathbf x_a\vert\mathbf x_b)$,

And marginal distribution $p(\mathbf x_a)=\mathcal N(\mathbf x_a\vert\boldsymbol\mu_a,\boldsymbol\Sigma_{aa})$.

Bayes’ Theorem for Gaussian Variables(Textbook)

Given

Obtain

where

Maximum Likelihood for the Gaussian

Sufficient statistics: $\sum\limits_{n=1}^N\mathbf x_n$ and $\sum\limits_{n=1}^N\mathbf x_n\mathbf x_n^\text T$

Solve the maximum likelihood

Under the true distribution

Hence define

Sequential estimation

The Robbins-Monro Algorithm

Bayesian Inference for the Gaussian

Student’s t-distribution

Periodic variables

Mixtures of Gaussians

  • $\pi_k$: mixing coefficients

    $\sum\limits_{k=1}^K\pi_k=1,0\le\pi_k\le 1$

  • $N(\mathbf x\vert\boldsymbol\mu_k,\boldsymbol\Sigma_k)$: component

Log likelihood of mixtures

The maximum likelihood solution for the parameters no longer has a closed-form analytical solution.

Alternatives: iterative numerical optimization, expectation maximization.

Exponential Families

The exponential family of distributions over $\mathbf x$, given parameters $\boldsymbol \eta$

  • $\bf x$: scalar or vector, discrete or continuous.
  • $\boldsymbol\eta$: natural parameters
  • $g(\boldsymbol\eta)$: normalization coefficient satisfying ↓

The Bernoulli Distribution:

So

The Multinomial Distribution:

Let $\mu_M=1-\sum\limits_{k=1}^{M-1}\mu_k$,

The Gaussian Distribution:

where

Maximum likelihood and sufficient statistics

Likelihood function

Condition of maximum likelihood estimator $\boldsymbol\eta_\text{ML}$

The solution for the ML estimator only depends on $\sum\limits_{n=1}^N\mathbf u(\mathbf x_n)$, which is called the sufficient statistic of the distribution.

For example,

Bernoulli distribution $\mathbf u(x) = x$, so we only keep the sum of the data points; Gaussian distribution $\mathbf u(x) = \left(\begin{matrix}x\\x^2\end{matrix}\right)$, we need to keep both the sum of data points and the sum of the square.

Conjugate Priors

Seek a prior that is conjugate to the likelihood function: the posterior distribution has the same functional form as the prior.

For any member of the exponential family, there exists a conjugate prior

where

  • $f(\chi,\nu)$ is a normalization coefficient
  • $g(\boldsymbol\eta)$ is the normalization coefficient of the exponential family.

Recall that

Multiply the prior and the likelihood, the posterior is

which is in the same functional form as the prior.

  • $\nu$: effective number of pseudo-observations in the prior
  • $\chi$: value of the sufficient statistic for the pseudo-observations

Non-informative Priors

Used when little is known about the prior distribution.

Have as little influence on the posterior distribution as possible.

Cons of choose constant as the prior

  • If the domain of the parameter $\lambda$ is unbounded, the integral over $\lambda$ diverges, hence the prior cannot be normalized and is improper.
  • By changing the variable(e.g., $\lambda=\eta^2$), the density over the new variable may not be constant.

Example: $\text{Gam}(\lambda\vert a_0,b_0)$ where $a_0=b_0=0$.

Non-parametric Methods

Limitation of parametric methods:

The chosen density might be a poor model of the distribution that generates the data. Multimodal data can never be captured by a unimodal Gaussian.

Histogram Methods

where $\Delta$ acts as a smoothing parameter.

Cons: Curse of Dimensionality

K-Nearest-Neighbors

Chapter 4 - Linear Models for Regression

Chapter 5 - Linear Models for Classification

Three Approaches to Classification

  • Use discriminant functions directly
  • Infer the posterior probabilities with generative models
  • Directly construct posterior conditional class probabilities

Discriminant Functions

Discriminant Functions for $N$ classes

  • use $N$ two-way discriminant functions
  • use $\frac{N(N-1)}{2}$ two-way discriminant functions

Least Square Classification

  • Reduce classification to least squares regression.
  • Treat each class as a separate problem, pick the max.

Least square regression is sensitive to outliers, use logistic regression to solve this problem:

image.png

Fisher’s Linear Discriminants(LDA)

See the course material for Lab06: LDA

Perceptron

Training

Criterion:

where $t_n\in\{1,-1\}$ is the label.

where $\eta$ is the learning rate.

indicating the convergence of perceptron training.

Simplified Training

  • guaranteed to find a set of weights that gets the right answer on the whole training set if any such a set exists.
  • no need to choose a learning rate.

Probabilistic Generative Models

where $z$ is called the logit

K-Case Classification

Generative: ML Gaussian Mixtures

Likelihood $p\left(\boldsymbol{t}, \boldsymbol{X} \vert \pi, \mu_{1}, \mu_{2}, \Sigma\right)=\prod_{n=1}^{N}\left[\pi N\left(x_{n} \vert \mu_{1}, \Sigma\right)\right]^{t_{n}}\left[(1-\pi) N\left(x_{n} \vert \mu_{2}, \Sigma\right)\right]^{1-t_{n}}$

Generative: MAP Gaussian Mixtures

Probabilistic Discriminative Models

Bayesian Information Criterion

Chapter 6 - Neural Networks

Chapter 7 - Sparse Kernel Machines

Support Vector Machines

Model: $\quad y(\mathbf x) = \mathbf w^\text T \phi(\mathbf x) + b$

Distance of a point to the decision surface

Rescaling $\mathbf w$ and $b$, the point closest to the surface(active constraint) satisfies:

Then the classification problem becomes:

Hard Margin Classifier

The classification problem:

where

Soft Margin Classifier

Introduce slack variables $\xi_n\ge 0, n=1,\dots,N$ (sometimes $\xi_n<0$ is allowed)

The classification problem:

where $C>0$ is the trade-off between minimizing training errors and controlling model complexity.

SVMs and Logistic Regression

Multiclass SVMs

  1. One vs. Rest: $K$ separate SVMs
    • Can lead to inconsistent results
    • Imbalanced training sets
  2. One vs. One: $\frac{K(K-1)}{2}$ SVMs
    • dead zone

SVMs for Regression

过的很快,只用了一分半

※Relevance Vector Machines

Given an input vector $\mathbf x$, the conditional distribution for a real-valued target variable $t$:

where $\beta=\sigma^{-2}$ is the noise precision.

The mean is given by a linear model, which can be written in an SVM-like form:

where $k(\mathbf x,\mathbf x_n)=\phi(\mathbf x)^\text T \phi(\mathbf x_n)$ is the kernel, and $b$ is 616.a bias parameter.

Chapter 8 - Mixture Models and EM Learning

Chapter 9 - Sequential Data

Conditional data: independent of the previous states

Sequential data: dependent of the previous states

Markov models

  • first-order:

  • second-order:

Hidden Markov Models

For each observation $\mathbf x_n$, introduce latent variable $\mathbf z_n$, such that $\mathbf z_{n+1}\perp!!!!\perp \mathbf z_{n-1} \vert \mathbf z_n$

Use 1-of-K encoding, the latent variable is K-dimensional binary vector. The transition probability matrix $A_{j,k}\equiv p(z_{n,k}=1\mid z_{n-1,j}=1)$. This matrix satisfies $\sum\limits_{k} A_{jk}=1$ therefore has $K(K-1)$ independent parameters.

Chapter 10 - Markov Decision Process

Markov Decision Process

Given states $x$, actions $u$, transition probabilities $p(x’\mid u,x)$, reward function $r(x,u)$.

Expected cumulative reward:

  • $T=1$: greedy
  • $T>1$: finite horizon
  • $T=\infty$: infinite horizon

Goal: optimal policy

T-Step Policy and Value Function

Value Iteration and Policy Iteration