Linear regression is the “work horse” of statistics and (supervised) machine learning. When augmented with kernels or other forms of basis function expansion, it can model also nonlinear relationships. And when the Gaussian output is replaced with a Bernoulli or multinoulli distribution, it can be used for classification, as we will see below. So it pays to study this model in detail.

In the simplest approach,we can directly construct an appropriate function $y(\vec{x})$ whose values for new inputs $\vec{x}$ constitute the predictions for the corresponding values of $t$.More generally,from a probabilistic perspective,we aim to model the predictive distribution $p(t|x)$ because this expresses the uncertainty about the value of $t$ for each value of $x$.

Representation

The simplest linear model for regression is linear regression that involves a linear combination of the input variables \[y(\vec{x},\vec{w}) = w_0+w_1x_1+...+w_Dx_D\] where $\vec{x}=(x_1,...,x_D)^T$.

Linear regression can be made to model non-linear relationships by replacing $\vec{x}$ with some non-linear function of the inputs, $\phi(\vec{x})$.Consider the linear combinations of fixed nonlinear functions of the input variables,of the form \[y(\vec{x},\vec{w}) = w_0+\sum_{j=1}^{M-1}w_j\phi_j(x) =\sum_{j=0}^{M-1}w_j\phi_j(x) = \vec{w}^T\phi(x)\] where $\phi_j(x)$ are known as basis functions, $\vec{w}=(w_0,...,w_{M-1})^T$ and $\phi=(\phi_0,...,\phi_{M-1})^T$. This is known as basis function expansion. (Note that the model is still linear in the parameters $\vec{w}$, so it is still called linear regression; the importance of this will become clear below.) A simple example are polynomial basis functions, where the model has the form \[\phi(x)=(1, x, \cdots, x^d)\]

As before,we assume that the target variable is given by a deterministic function $y(\vec{x},\vec{w})$ with additive Gaussian noise so that \[t=y(\vec{x},\vec{w})+\epsilon\] where $\epsilon$ is a zero mean Gaussian random variable with precision(inverse variance) $\beta$.Thus \[\label{eqn:linear regression representation} p(t|\vec{x},\vec{w},\beta)=\mathcal{N}(t|h(\vec{x},\vec{w}),\beta^{-1})\] or \[\begin{aligned} & y(\vec{x}) = \vec{w}^T\vec{x}+\epsilon \\ & p(y|\vec{x},\vec{\theta})=\mathcal{N}(y|\vec{w}^T\vec{x}, \sigma^2) \\\end{aligned}\] where $\vec{w}$ (weight vector) and $\vec{x}$ are extended vectors, $\vec{x}=(1,x)$, $\vec{w}=(b,w)$ and $\epsilon$ has a Gaussian or normal distribution.$w_0$ is the intercept or text

Maximum likelihood estimations(least squares)

Assume the training examples are independently and identically distributed(IID),we obtain the likelihood function,which is a function of adjustable parameters $\vec{w}$ and $\beta$,in the form \[\begin{aligned} \label{eqn:linear regression likelihood} p(\vec{t}|\vec{X},\vec{w},\beta) = \prod_{n=1}^{N}\mathcal{N}(t_n|\vec{w}^T\phi(\vec{x_n}),\beta^{-1})\end{aligned}\] we can write the log-likelihood (logarithm of the likelihood) function as follows: \[\ell(\vec{\theta}) \triangleq \log p(\mathcal{D}|\vec{\theta})\] A common way to estimate the parameters of a statistical model is to compute the MLE(Maximum likelihood estimations),which is defined as \[\vec{\hat{\theta}}=\arg\max_\theta{\log p(\mathcal{D}|\vec{\theta})}\]

Instead of maximizing the log-likelihood, we can equivalently minimize the negative log likelihood or NLL: \[\text{NLL}(\vec{\theta}) \triangleq -\ell(\vec{\theta})=-\log p(\mathcal{D}|\vec{\theta})\]

The NLL formulation is sometimes more convenient, since many optimization software packages are designed to find the minima of functions, rather than maxima.

Now let us apply the method of MLE to the linear regression setting. Inserting the definition of the Gaussian into the above, we find that the log likelihood is given by \[\begin{aligned} \ell(\vec{\theta}) &= \log p(\vec{T}|\vec{w},\beta) \\ &=\sum_{n=1}^{N}\log \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x_n}),\beta^{-1}) \\ &=\sum\limits_{i=1}^N \log\left[\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{1}{2\sigma^2}(y_i-\vec{w}^T\vec{x}_i)^2\right)\right] \\ &=\dfrac{N}{2}\log \beta-\dfrac{N}{2}\log (2\pi)-\beta E_D(\vec{w}) \\ &=-\dfrac{1}{2\sigma^2}\text{RSS}(\vec{w})-\dfrac{\vec{w}}{2}\log(2\pi\sigma^2)\end{aligned}\] where the sum-of-squares error function is defined by \[E_D(\vec{w}) = \dfrac{1}{2}\sum_{n=1}^{N}\{t_n-\vec{w}^T\phi(\vec{x_n}) \}^2\] RSS stands for residual sum of squares and is defined by \[\text{RSS}(\vec{w}) \triangleq \sum\limits_{i=1}^N (t_i-\vec{w}^T\vec{x}_i)^2\] The RSS can also be written as the square of the $\ell_2$ norm of the vector of residual errors: \[RSS(\vec{w}) = \parallel\epsilon\parallel_2^2 = \sum_{i=1}^{N}\epsilon_i^2\] where $\epsilon_i = (t_i - \vec{w}^T\vec{x_i})$.

Derivations of the MLE

We see that maximizing the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to $\vec{w}$ minimizing the RSS(or $E_D$), so this method is known as least squares.The gradient of the log likelihood function takes the form \[\nabla\log p(\vec{t}|\vec{w},\beta) =\sum_{n=1}^{N}\{t_n-\vec{w}^T\phi(\vec{x_n}) \}\phi(\vec{x_n})^T\] Setting the gradient respect to $\vec{w}$ to zero gives \[\begin{aligned} 0=\sum_{n=1}^{N}t_n\phi(\vec{x}_n)^T-\vec{w}^T(\sum_{n=1}^{N}\vec{\phi}(\vec{x}_n)\vec{\phi}(\vec{x}_n)^T)\end{aligned}\] Solving for $\vec{w}$ we obtain \[\begin{aligned} \vec{w}_{ML}=(\vec{\Phi}^T\vec{\Phi})^{-1}\vec{\Phi}^T\vec{t}\end{aligned}\] which are known as the normal equations.

Make the bias parameter explicit,the error function \[\begin{aligned} E_D(\vec{w})=\dfrac{1}{2}\sum_{n=1}^{N}\{t_n-w_0-\sum_{j=1}^{M-1}w_j\phi_j(\vec{x}_n)\}^2\end{aligned}\] Setting the derivative with respect to $w_0$ equal to zero,we obtain \[\begin{aligned} w_0 = \bar{t}-\sum_{j=1}^{M-1}w_j\bar{\phi_j}\end{aligned}\] Then we have \[\begin{aligned} y&=\vec{w}^T\vec{\phi} \\ y&=\sum_{j=1}^{M}w_j \phi_j+w_0 \\ y&=\sum_{j=1}^{M}w_j \phi_j+\bar{t}-\sum_{j=1}^{M-1}w_j\bar{\phi_j} \\ y-\bar{y}&=\sum_{j=1}^{M}w_j(\phi_j-\bar{\phi}_j)\end{aligned}\] which indicates that we can normalize the features by subtracting the mean of $\vec{\Phi}$ while training to reduce computation complexity but obtain the same weight parameter.

Note that $\vec{t}=(t_1,t_2,\cdots,t_N)^T$, $\vec{\Phi}=\left(\begin{array}{c}\vec{\phi}_1^T \\ \vec{\phi}_2^T \\ \vdots \\ \vec{\phi}_N^T\end{array}\right)$

$\vec{\Phi}$ is an $N\times M$ matrix,called the design matrix,whose elements are given by $\Phi_{nj}=\phi_j(\vec{x_n})$ so that \[\vec{\Phi}= \begin{bmatrix} \phi_0(\vec{x_1}) & \phi_1(\vec{x_1}) &...&\phi_{M-1}(\vec{x_1}) \\ \phi_0(\vec{x_2}) & \phi_1(\vec{x_2}) &...&\phi_{M-1}(\vec{x_2}) \\ \vdots &\vdots &\vdots&\vdots \\ \phi_0(\vec{x_N})& \phi_1(\vec{x_{N}})& \cdot &\phi_{M-1}(\vec{x_N})\\ \end{bmatrix}\]

\[\begin{aligned} \because \sum_{n=1}^{N}t_n\phi(\vec{x}_n)^T &=\vec{t}^T\vec{\Phi}\end{aligned}\]

\[\begin{aligned} \begin{cases} \vec{\Phi} = \begin{bmatrix} &\vec{\Phi_1^T} \\ &\vec{\Phi_2^T} \\ & ... \\ &\vec{\Phi_n^T} \end{bmatrix} \\ \vec{\Phi^T}\vec{\Phi} = \begin{bmatrix} \vec{\Phi_1} & \vec{\Phi_2} & ...&\vec{\Phi_n} \end{bmatrix} \cdot \begin{bmatrix} &\vec{\Phi_1^T} \\ &\vec{\Phi_2^T} \\ & ... \\ &\vec{\Phi_N^T} \end{bmatrix} = \sum\limits_{i=1}^{N}\vec{\phi_i}\vec{\phi_i^T} = \textbf{sum of squares matrix} \\ \end{cases}\end{aligned}\]

\[\begin{aligned} \therefore \vec{w}^T = \vec{t}^T\vec{\Phi}(\vec{\Phi}^T\vec{\Phi})^{-1} \therefore \vec{w}_{ML}=(\vec{\Phi}^T\vec{\Phi})^{-1}\vec{\Phi}^T\vec{t}\end{aligned}\]

Let’s drop constants with respect to $\vec{w}$ and NLL can be written as \[\text{NLL}(\vec{w}) = \dfrac{1}{2}\sum\limits_{i=1}^N (y_i-\vec{w}^T\vec{\phi}_i)^2\]

Then we can rewritten the objective in a form that is more amendable to differentiation: \[\begin{aligned} \text{NLL}(\vec{w}) &= \dfrac{1}{2}(\vec{t}-\vec{\Phi}\vec{w})^T(\vec{t}-\vec{\Phi}\vec{w})\\ &= \dfrac{1}{2} [\vec{w^T} (\vec{\Phi}^T\vec{\Phi})\vec{w} - \vec{t}^T\vec{\Phi}\vec{w}- \vec{w}^T\vec{x}^T\vec{t} +\vec{t}^T\vec{t}] \\ \Rightarrow \dfrac{\partial}{\partial \vec{w}}\text{NLL} &= \dfrac{1}{2} \vec{w^T} (\vec{\Phi}^T\vec{\Phi})\vec{w} -\vec{w}^T(\vec{\Phi}^T\vec{t}) \end{aligned}\]

When $\mathcal{D}$ is small(for example, $N < 1000$), we can use the following equation to compute $\vec{w}$ directly \[\vec{w}_{ML}=\hat{\vec{w}}_{\mathrm{OLS}}=(\vec{\Phi}^T\vec{\Phi})^{-1}\vec{\Phi}^T\vec{t}\]

The corresponding solution $\hat{\vec{w}}_{\mathrm{OLS}}$ to this linear system of equations is called the ordinary least squares or OLS or normal equation solution.

We now state without proof some facts of matrix derivatives (we won’t need all of these at this section). \[\begin{aligned} trA &\triangleq& \sum\limits_{i=1}^n A_{ii} \nonumber \\ \frac{\partial}{\partial A}AB &=& B^T \\ \frac{\partial}{\partial A^T}f(A) &=& \left[\frac{\partial}{\partial A}f(A)\right]^T \label{eqn:matrix-1} \\ \frac{\partial}{\partial A}ABA^TC &=& CAB+C^TAB^T \label{eqn:matrix-2} \\ \frac{\partial}{\partial A}|A| &=& |A|(A^{-1})^T\end{aligned}\]

Then, \[\begin{aligned} \text{NLL}(\vec{w}) &=& \frac{1}{2N}(\vec{X}\vec{w}-\vec{y})^T(\vec{X}\vec{w}-\vec{y}) \\ \frac{\partial \text{NLL}}{\partial\vec{w}} &=& \frac{1}{2} \frac{\partial}{\partial\vec{w}} (\vec{w}^T\vec{X}^T\vec{X}\vec{w}-\vec{w}^T\vec{X}^T\vec{y}-\vec{y}^T\vec{X}\vec{w}+\vec{y}^T\vec{y}) \\ &=& \frac{1}{2} \frac{\partial}{\partial\vec{w}} (\vec{w}^T\vec{X}^T\vec{X}\vec{w}-\vec{w}^T\vec{X}^T\vec{y}-\vec{y}^T\vec{X}\vec{w}) \\ &=& \frac{1}{2} \frac{\partial}{\partial\vec{w}} \\ &=& \frac{1}{2} \frac{\partial}{\partial\vec{w}} (tr\vec{w}^T\vec{X}^T\vec{X}\vec{w}-2tr\vec{y}^T\vec{X}\vec{w})\end{aligned}\]

Combining Equations [eqn:matrix-1] and [eqn:matrix-2], we find that \[\frac{\partial}{\partial A^T}ABA^TC = B^TA^TC^T+BA^TC\]

Let $A^T=\vec{w}, B=B^T=\vec{X}^T\vec{X}$, and $C=I$, Hence, \[\begin{aligned} \frac{\partial \text{NLL}}{\partial\vec{w}} &=& \frac{1}{2} (\vec{X}^T\vec{X}\vec{w}+\vec{X}^T\vec{X}\vec{w} -2\vec{X}^T\vec{y}) \nonumber \\ &=& \frac{1}{2} (\vec{X}^T\vec{X}\vec{w} - \vec{X}^T\vec{y}) \nonumber \\ \frac{\partial \text{NLL}}{\partial\vec{w}} &=& 0 \Rightarrow \vec{X}^T\vec{X}\vec{w} - \vec{X}^T\vec{y} =0 \nonumber \\ \vec{X}^T\vec{X}\vec{w} &=& \vec{X}^T\vec{y} \label{eqn:normal-equation} \\ \hat{\vec{w}}_{\mathrm{OLS}} &=& (\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y} \nonumber\end{aligned}\]

Consider $\mathbf{vector-vector}$ differentiation,if $\mathbf{A}$ is not a function of $\vec{x}$ then \[\dfrac{\partial\vec{X}^TA\vec{X}}{\partial\vec{X}} = \vec{X}^T(\mathbf{A}+\mathbf{A}^T) = (\mathbf{A}+\mathbf{A}^T)\vec{X}\]

Then,we can derivate $\dfrac{\partial NLL}{\partial \vec{w}}$ \[\begin{aligned} g(\vec{w}) &= \dfrac{\partial NLL}{\partial\vec{w}} \\ &=\vec{\Phi}^T\vec{\Phi}\vec{w} - \vec{\Phi}^T\vec{y} \end{aligned}\] Setting the derivative to zero,we get \[\vec{w} = (\vec{X}^T\vec{X})^{-1}\vec{x}^T\vec{y}\]

Geometric interpretation

$Graphical interpretation of least squares for N=3 examples and D=2 features. \tilde{\vec{x}}_1 and \tilde{\vec{x}}_2˜ are vectors in \mathbb{R}^3; together they define a 2D plane. \vec{y} is also a vector in \mathbb{R}^3 but does not lie on this 2D plane. The orthogonal projection of \vec{y} onto this plane is denoted \hat{\vec{y}}. The red line from \vec{y} to \hat{\vec{y}} is the residual, whose norm we want to minimize. For visual clarity, all vectors have been converted to unit norm.$

Graphical interpretation of least squares for $N=3$ examples and $D=2$ features. $\tilde{\vec{x}}_1$ and $\tilde{\vec{x}}_2$˜ are vectors in $\mathbb{R}^3$; together they define a 2D plane. $\vec{y}$ is also a vector in $\mathbb{R}^3$ but does not lie on this 2D plane. The orthogonal projection of $\vec{y}$ onto this plane is denoted $\hat{\vec{y}}$. The red line from $\vec{y}$ to $\hat{\vec{y}}$ is the residual, whose norm we want to minimize. For visual clarity, all vectors have been converted to unit norm.

Given that \[\begin{aligned} \vec{X} = \begin{bmatrix} \vec{x_1}^T &\\ \vec{x_2}^T &\\ ... &\\ \vec{x_N}^T \end{bmatrix} = \begin{bmatrix} \vec{\tilde{x_1}} & \vec{\tilde{x_1}} & ... &\vec{\tilde{x_D}} \end{bmatrix}\\ \vec{y} = \begin{bmatrix} t_1 \\ t_2 \\ ... \\ t_n \end{bmatrix} \end{aligned}\] We seek a vector $\hat{\vec{t}} \in \mathbb{R}^N$ that lies in the column linear space of $\vec{X}$ and is as close as possible to $\vec{t}$,i.e.,we want to find \[\begin{aligned} \hat{\vec{t}} \in span(\vec{X}) \\ \Rightarrow \hat{\vec{t}} = \vec{X}\vec{w} = w_1\vec{\tilde{x_1}}+\cdot\cdot\cdot+w_D\vec{\tilde{x_D}} \\ \vec{\hat{t}}=\arg\min\limits_{\hat{\vec{t}} \in \text{span} (\{\vec{\tilde{x_1}},...,\vec{\tilde{x_D}}\})}\end{aligned}\]

To minimize the norm of the residual, $\vec{t}-\hat{\vec{t}}$, we want the residual vector to be orthogonal to every column of $\vec{X}$,so˜ $\tilde{\vec{x}}_j(\vec{t}-\hat{\vec{t}})=0$ for $j=1:D$. Hence \[\begin{split} \tilde{\vec{x}}_j(\vec{t}-\hat{\vec{t}})=0 & \Rightarrow \vec{X}^T(\vec{t}-\vec{X}\vec{w})=0 \\ & \Rightarrow \vec{w}=(\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{t} \end{split}\]

Sequential learning

Batch techniques,such as the maximum likelihood solution,which involves processing the entire training set in one go(pass),can be computationally costly for large data sets.If the data set is sufficiently large,it may be worthwhile to use sequential algorithms,known as on-line algorithms. When $\mathcal{D}$ is large, use stochastic gradient descent(SGD),also known as sequential gradient descent. \[\begin{aligned} E &=\sum_{n}E_n \\ \vec{w}^{\tau+1} &= \vec{w}^{\tau}-\eta\nabla E_n\end{aligned}\] where $\tau$ denotes the iteration number,the $\eta$ is a learning rate parameter.For the case of sum-of-squares error function,this gives \[\vec{w}^{\tau+1} = \vec{w}^{\tau}-\eta(t_n-\vec{w}^{(\tau)T}\phi_n)\phi_n\] where $\phi_n=\phi(\vec{x_n})$.This is known as the least-mean-squares or the LMS algorithm.

\[\begin{aligned} \because \dfrac{\partial}{\partial w_i}\text{NLL}(\vec{w})=& \sum\limits_{i=1}^N (\vec{w}^T\vec{x}_i-y_i)x_{ij} \\ \therefore w_j=& w_j - \alpha\dfrac{\partial}{\partial w_j}\text{NLL}(\vec{w}) \nonumber \\ =& w_j - \sum\limits_{i=1}^N \alpha(\vec{w}^T\vec{x}_i-y_i)x_{ij} \\ \therefore \vec{w}=& \vec{w}-\alpha(\vec{w}^T\vec{x}_i-y_i)\vec{x}\end{aligned}\]

Ridge regression(MAP)

One problem with ML estimation is that it can result in over-fitting. In this section, we discuss a way to ameliorate this problem by using MAP estimation with a Gaussian prior.

Basic idea

We can encourage the parameters to be small, thus resulting in a smoother curve, by using a zero-mean Gaussian prior: \[p(\vec{w})=\prod\limits_j \mathcal{N}(w_j|0,\tau^2)\] where $1/\tau^2$ controls the strength of the prior. The corresponding MAP estimation problem becomes \[\arg\max_{\vec{w}} \sum\limits_{i=1}^N \log{\mathcal{N}(t_i|w_0+\vec{w}^T\vec{\phi}_i,\sigma^2)}+\sum\limits_{j=1}^D \log{\mathcal{N}(w_j|0,\tau^2)}\]

This is equivalent to minimizing the following \[\begin{aligned} \label{eqn:Ridge-regression-J} J(\vec{w})&=E_D(\vec{w})+\lambda E_W(\vec{w}) \\ &=\dfrac{1}{N}\sum\limits_{i=1}^N (y_i-(w_0+\vec{w}^T\vec{x}_i))^2+\lambda\lVert\vec{w}\rVert^2 , \lambda \triangleq \dfrac{\sigma^2}{\tau^2}\end{aligned}\] where the first term is the MSE/ NLL as usual, and the second term, $\lambda \geq 0$, is the regularization coefficient that controls the complexity penalty. Here $E_W(\vec{w})$ is one of the simplest forms of regularizer given by the sum-of-squares of the weight vector elements. \[E_W(\vec{w})=\dfrac{1}{2}\vec{w}^T\vec{w}\] This particular choice of regularizer is known as weight decay.

The corresponding solution is given by \[\label{eqn:Ridge-regression-solution} \hat{\vec{w}}_{\mathrm{ridge}}=(\lambda\vec{I}_D+\vec{\Phi}^T\vec{\Phi})^{-1}\vec{\Phi}^T\vec{t}\]

This technique is known as ridge regression,or penalized least squares. In general, adding a Gaussian prior to the parameters of a model to encourage them to be small is called $\ell_2$ regularization or weight decay. Note that the offset term $w_0$ is not regularized, since this just affects the height of the function, not its complexity.

We will consider a variety of different priors in this book. Each of these corresponds to a different form of regularization. This technique is very widely used to prevent overfitting.

Multiple outputs

Now consider the case where we wish to predict $K>1$ target variables,which we denote collectively by the target vector $\vec{t}$.To use the same set of basis functions to model all the components of the target vector so that \[\vec{y}(\vec{x},\vec{w}) = \vec{W}^T\phi(x)\] where $\vec{y}$ is a $K$-dimensional column vector,$\vec{W}$ is an $M\times K$ matrix of parameters.

The conditional distribution of the target vector is an isotropic Gaussian \[p(\vec{t}|\vec{x},\vec{W},\beta) = \mathcal{N}(\vec{t}|\vec{W}^T\phi(x,\beta^{-1}\vec{I}))\]

The log likelihood function is then given by \[\begin{aligned} \ln p(\vec{T}|\vec{X},\vec{W},\beta) &=\sum_{n=1}^{N}\ln \mathcal{N}(\vec{t_n}|\vec{W}^T\phi(x_n),\beta^{-1}\vec{I}) \\ &=\dfrac{NK}{2}\ln(\dfrac{\beta}{2\pi}) -\dfrac{\beta}{2}\sum_{n=1}^{N}\parallel \vec{t_n}-\vec{W}^T\phi(x_n)\parallel^2\end{aligned}\] We maximize this function with respect to $\vec{W}$,giving \[\vec{W}_{ML} = (\vec{\Phi}^T\vec{\Phi})^{-1}\vec{\Phi}^T\vec{T}\]

Numerically stable computation *

\[\label{eqn:Ridge-regression-SVD} \hat{\vec{w}}_{\mathrm{ridge}}=\vec{V}(\vec{Z}^T\vec{Z}+\lambda\vec{I}_N)^{-1}\vec{Z}^T\vec{y}\]

Connection with PCA *

Regularization effects of big data

Regularization is the most common way to avoid overfitting. However, another effective approach — which is not always available — is to use lots of data. It should be intuitively obvious that the more training data we have, the better we will be able to learn.

In domains with lots of data, simple methods can work surprisingly well (Halevy et al. 2009). However, there are still reasons to study more sophisticated learning methods, because there will always be problems for which we have little data. For example, even in such a data-rich domain as web search, as soon as we want to start personalizing the results, the amount of data available for any given user starts to look small again (relative to the complexity of the problem).

The Bias-Variance Decomposition

representation

Data representation: N observations of x,wiritten $X \equiv (x_1,x_2,...,x_n)^T$ together with corresponding observations of the values of t,denoted $t \equiv (t_1,t_2,...t_N)^T$. \[t_i = f(\vec{x_i}) + \epsilon\] where the noise $ \epsilon $ has zero mean and variance $ \sigma^2 $. Find a function \[\hat{f}(x)\] that approximates the true function \[t = f(\vec{x})\] as well as possible.Make “as well as possible” precise by measuring the mean squared error between y and $ \hat{f}(x) $,we want $ (t - \hat{f}(x))^2 $ to be minimal. Hypothesis function(model representation): $y(x,\bold w)= w_0+w_1x+w_2x^2+...+w_Mx^M = \sum_{j=0}^{M}w_jx^j$ M is the order of the polynomial,and $x^j$ denotes $x$ raised to the power of $j$.The polynomial coefficients $w_0,...w_M$ are collectively denoted by the vector $\mathbf{w}$.

Error Function(Sum of squares of errors between predictions $y(x_n,w)$ for each data point $x_n$ and the corresponding target values $t_n$,so that we minimize: \[E(\bold w)=\frac{1}{2}\sum_{n=1}^{N}\{y(x_n,\bold w)-t_n \}^2\] root-mean-squre(RMS) error defined by \[E_{RMS} = \sqrt[2]{2E(\bold w^*)/N}\] Penalized(regularized) error function \[\widetilde{E}(\textbf{w}) = \frac{1}{2}\sum_{n=1}^{N}\{y(x_n,\textbf{w}-t_n\}^2 + \frac{1}{2} \parallel \textbf{w} \parallel^2\] where $\parallel \textbf{w} \parallel^2 \equiv \textbf{w}^T\textbf{w}=w_0^2+w_1^2+...+w_M^2$

Loss function for regression

\[\mathbb{E}[\mathit{L}] = \iint\mathit{L}(t,y(\textbf{x}))p(\textbf{x},t)d\textbf{x}dt\]

A common choice of loss function in squared loss given by \[\begin{aligned} \mathit{L}(t,y(\textbf{x})) = \{y(\textbf{x}) - t\}^2 \\ \mathbb{E}[\mathit{L}] = \iint\{y(\textbf{x}) - t\}^2 p(\textbf{x},t)d\textbf{x}dt.\end{aligned}\] Minimize $ \mathbb{E}[\mathit{L}] $ by using the calculus of variations to give \[\frac{\delta\mathbb{E}[\mathit{L}]}{\delta y(\mathbf{x}))} = 2 \int\{ y(\mathbf{x} -t) \}p(\mathbf{x},t)dt = 0\] Solving for $ y(\textbf{x}) $ and using the sum and product rules of probability,we obtain \[y(\textbf{x}) = \frac{\int tp(\textbf{x},t)dt}{p(\textbf{x})} = \int tp(t|\textbf{x})dt = \mathbb{E}[t|\textbf{x}]\]

Let’s derive this result in a slightly different way.Armed with knowledge that the optimal solution is the conditional expectation,we can expand the square term as follows \[\{y(\textbf{x} -t)\}^2 = \{y(\textbf{x}) - \mathbb{E}[t|\textbf{x}] + \mathbb{E}[t|\textbf{x}] - t )\}^2 = \{ y(\textbf{x}) - \mathbb{E}[t|\textbf{x}] \}^2 + 2\{ y(\textbf{x}) - \mathbb{E}[t|\textbf{x}] \}\{ \mathbb{E}[t|\textbf{x}]-t \} + \{ \mathbb{E}[t|\textbf{x}] -t \}^2\] where,$ \mathbb{E}[t|\textbf{x}] $ denote $ \mathbb{E}_{t}[t|\textbf{x}] $.Substitute into the loss function and perform the integral over t,we see the cross-term vanishes \[\begin{aligned} \label{eqn:squared loss function} \mathbb{E}[\mathit{L}] &= \iint\{y(\textbf{x}) - t\}^2 p(\textbf{x},t)d\textbf{x}dt \\ &= \int \{ y(\textbf{x}) -\mathbb{E}[t|\textbf{x}] \}^2 p(\textbf{x})d\textbf{x} + \int\{ \mathbb{E}[t|\textbf{x}] - t \}^2 p(\textbf{x})d\textbf{x} \\ &= \int \{ y(\textbf{x}) -h(\textbf{x}) \}^2 p(\textbf{x})d\textbf{x} + \int\{ h(\textbf{x}) - t \}^2 p(\textbf{x})d\textbf{x} \\ &= \int \{ y(\textbf{x}) -h(\textbf{x}) \}^2 p(\textbf{x})d\textbf{x} + \int\{ h(\textbf{x}) - t \}^2 p(\textbf{x},t)d\textbf{x}dt \\\end{aligned}\]

Decomposition

For a popular choice,we use squared loss function,for which the optimal prediction is given by the conditional expectation,which we denote by h(x) and which is given by \[h(\textbf{x}) = \mathbb{E}[t|\textbf{x}] = \int tp(t|\textbf{x})dt\]

Consider the integrand of the first term of [eqn:squared loss function],which for particular data set D takes the form \[\{ y(\textbf{x};D) - h(\textbf{x}) \} ^2\] This quantity will be dependent on the particular data set D,so we take its average over the ensemble of data sets. If we add and subtract the quantity $ \mathbb{E_D}[y(\textbf{x};D)] $ inside the braces,and then expand,we obtain \[\begin{aligned} \{ y(\textbf{x};D) - h(\textbf{x}) \} ^2 \\ =&\{y(\textbf{x};D) - \mathbb{E_D}[y(\textbf{x};D)] + \mathbb{E_D}[y(\textbf{x};D)] -h(\textbf{x}) \}^2 \\ =& \{ y(\textbf{x};D) -\mathbb{E}_{\mathbb{D}}[y(\textbf{x};D)] \}^2 + \{ \mathbb{E_D}[y(\textbf{x};D)] - h(\textbf{x})\}^2 + 2\{ y(\textbf{x};D) - \mathbb{E_D}[y(\textbf{x};D)]\}\{ \mathbb{E_D}[y(\textbf{x};D)] -h(\textbf{x})\} \\ =& \underbrace{\{ y(\textbf{x};D) -\mathbb{E}_{\mathbb{D}}[y(\textbf{x};D)] \}^2}_\text{\color{red}{variance}} + \underbrace{\{ \mathbb{E_D}[y(\textbf{x};D)] - h(\textbf{x})\}^2}_\text{\color{blue}$(bias)^2$} +0\\\end{aligned}\] The decomposition of the expected squared loss \[\text{expected loss} = (bias)^2 + variance + noise\] where \[\begin{aligned} (bias)^2 = ... \\ variance = ... \\ noise = ...\end{aligned}\]

The function $ y(\textbf{x}) $ we seek to determine enters only the first term,which will be minimized when $ y(\textbf{x}) $ is equal to $ \mathbb{E}[t|\textbf{x}] $,in which case this term will vanish.The second term is the variance of distribution of t,averaged over $ \textbf{x} $,representing the intrinsic variabilility of the target data and can be regarded as noise.It’s the irreducible minimum value of the loss function.

More sophisticated loss function,Minkowski loss \[\mathbb{E}[\mathit{L_q}] = \iint| y(\textbf{x}) - t |^q p(\textbf{x},t)d\textbf{x}dt\]

Bayesian linear regression

Hold-out data can be used to determine model complexity but it will be computationally expensive and wasteful of valuable data.We therefore turn to a Bayesian treatment of linear regression,which will avoid the over-fitting problem of maximum likelihood,and which will also lead to automatic methods of determining model complexity using the training data.

Parameter distribution

First introduce a prior probability distribution over the model parameters.The likelihood function $p(\vec{t}|\vec{w})$ defined by [eqn:linear regression likelihood] is the exponential of a quadratic function of $\vec{w}$,so the corresponding conjugate prior is Gaussian \[p(\vec{w}) = \mathcal{N}(\vec{w}|\vec{m_0},\vec{S}_0)\]

The posterior distribution is proportional to the product of the likelihood and the prior.And the posterior will also be Gaussian due to the choice of conjugate Gaussian prior,which is derived in [sec:Gaussian distribution]. \[\label{eqn:Bayes linear regression posterior} p(\vec{w}|\vec{t}) = \mathcal{N}(\vec{w}|\vec{m}_N,\vec{S}_N)\] where \[\begin{aligned} \vec{m}_N &=\vec{S}_N(\vec{S}_0^{-1}\vec{m}_0+\beta\vec{\Phi}^T\vec{t})\\ \vec{S}_N^{-1} &= \vec{S}_0^{-1}+\beta\vec{\Phi}^T\vec{\Phi}\end{aligned}\] Thus the maximum posterior weigh vector is simply given by $\vec{w}_{MAP}=\vec{m}_{N}$.If $N=0$ then the posterior distribution reverts to the prior.Furthermore,if data points arrive sequentially,then the posterior distribution at any stage acts as the prior distribution,such that the new posterior is again given.

A zero-mean isotropic Gaussian governed by a single precision parameter $\alpha$ so that \[p(\vec{w}|\alpha) = \mathcal{N}(\vec{w}|\vec{0},\alpha^{-1}\vec{I})\] for which the posterior is given by \[\begin{aligned} \vec{m}_N &=\beta\vec{S}_N\vec{\Phi}^T\vec{t} \\ \vec{S}_N^{-1}&=\alpha\vec{I}+\beta\vec{\Phi}^T\vec{\Phi}\end{aligned}\]

The log of posterior distribution is given by the sum of the log likelihood and the log of prior and,as a function of $\vec{w}$,takes the form \[\log p(\vec{w}|\vec{t}) = -\dfrac{\beta}{2}\sum_{n=1}^{N}\{t_n-\vec{w}^T\phi(\vec{x}_n) \}^2-\dfrac{\alpha}{2}\vec{w}^T\vec{w}+const\] Maximization of this posterior w.r.t $\vec{w}$ is equivalent to minimization of the sum-of-squares error function with the addition of a quadratic regularization term,corresponding with $\lambda=\alpha/\beta$.

Predictive distribution

Evaluate the predictive distribution defined by \[p(t|\vec{t},\alpha,\beta) = \int p(t|\vec{w},\beta)p(\vec{w}|\vec{t},\alpha,\beta)d\vec{w}\] The right hand quantity can be interpreted as the marginalization by integration of the joint distribution of $p(t,\vec{w}|\vec{t},\alpha,\beta)$ over $\vec{w}$.And the joint distribution can be solved by the Bayesian theorem for Gaussian. The conditional distribution $p(t|\vec{x},\vec{w},\beta)$ of the target variable is given by [eqn:linear regression representation],and the posterior weight distribution is given by [eqn:Bayes linear regression posterior].This involves the convolution of two Gaussian distributions.The predictive distribution take the form \[p(t|x,\vec{t},\alpha,\beta) = \mathcal{N}(t|\vec{m}_N^T\phi(x),\sigma_N^2(\vec{x}))\] where the variance $\sigma_N^2(\vec{x})$ is given by \[\sigma_N^2(\vec{x})=\dfrac{1}{\beta}+\phi(\vec{x})^T\vec{S}_N\phi(\vec{x})\] The first term represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters $\vec{w}$. $\sigma_{N+1}^2(\vec{x})\leq \sigma_N(\vec{x})$.In the limit $N\rightarrow \infty$,the second term goes to zero.

Note that if both $\vec{w}$ and $\beta$ are treated as unknown,then we can introduce a conjugate prior distribution $p(\vec{w},\beta)$ given by Gaussian-gamma distribution,leading to a Student’s t-distribution predictive distribution.

Equivalent kernel

The posterior mean solution [] has an interpretation that will set stage for kernel methods,including Gaussian processes.The predictive mean: \[y(\vec{x},\vec{w})=\vec{m}_N^T\vec{\phi}(\vec{x}) =\beta\vec{\phi}(\vec{x})^T\vec{S}_N\vec{\Phi}^T\vec{t} =\sum_{n=1}^{N}\beta\vec{\phi}(\vec{x})^T\vec{S}_N\vec{\phi}(\vec{x}_n)t_n =\sum_{n=1}^{N}\mathit{k}(\vec{x},\vec{x}_n)t_n\] where \[k(\vec{x},\vec{x'}) =\beta\vec{\phi}(\vec{x})^T\vec{S}_N\vec{\phi}(\vec{x'})\] is known as the smoother matrix or equivalent kernel.Regression functions,such as this,which make predictions by taking linear combinations of the training set target values are known linear smoothers.The kernel functions are localized around $x$(local evidence weight more than distant evidence).

The covariance between $y(\vec{x})$ and $y(\vec{x'})$ is given by \[\begin{aligned} cov[y(\vec{x}),y(\vec{x'})] &=cov[\vec{\phi}(\vec{x})^T\vec{w},\vec{w}^T\vec{\phi}(\vec{x'})] \\ &=\phi(\vec{x})^T\vec{S}_N\phi(\vec{x'})=\beta^{-1}\mathit{k}(\vec{x},\vec{x'})\end{aligned}\] We see that the predictive mean at nearby points will be highly correlated.

The formulation of linear regression in terms of kernel functions suggests an alternative approach,called Gaussian process:Instead of introducing a set of basis functions,which implicitly determines an equivalent kernel,we can instead define a localized kernel directly and use this to make predictions for new input vectors $\vec{x}$,given the observed training set.

The effective kernel defines the weights by which the training set target values are combined in order to make a prediction at a new value of $x$,and it can be shown that these weights sum to one, \[\begin{aligned} \sum\limits_{n=1}^{N}\mathcal{k}(\vec{x},\vec{x}_n) = 1\end{aligned}\] for all values of $\vec{x}$.

General kernel functions share an important property,namely that it can be expressed in the form an inner product with respect to a vector $\varPsi(\vec{x}) $ of nonlinear functions,so that \[\begin{aligned} \mathcal{k}(\vec{x},\vec{z}) = \varPsi(\vec{x})^T\varPsi(\vec{z})\end{aligned}\] where $\varPsi(\vec{x})=\beta^{1/2}\vec{S}_n^{1/2}\phi(\vec{x})$

Bayesian Model Comparison

The over-fitting associated with maximum likelihood can be avoided by marginalizing(summing or integrating) over the model parameters instead of making point estimates of their values,no need for validation.

Suppose we wish to compare a set of $L$ models $\{\mathcal{M}_i\},i=1,...L$ over observed data $\mathcal{D}$.Evaluate the posterior distribution \[\begin{aligned} p(\mathcal{M}_i|\mathcal{D})\propto p(\mathcal{M}_i)p(\mathcal{D}|\mathcal{M}_i)\end{aligned}\] The prior allows us to express a preference for different models.$p(\mathcal{D}|\mathcal{M}_i)$ is the model evidence,which expresses the preference shown by the data for different models.It is also called the marginal likelihood.The ratio of model evidences $p(\mathcal{D}|\mathcal{M}_i)/p(\mathcal{D}|\mathcal{M}_j)$ for two models is Bayes factor. The predictive distribution \[\begin{aligned} & p(t|\vec{x},\mathcal{D})p(\mathcal{M}_i|\mathcal{D})&=p(t|\vec{x},\mathcal{M}_i,\mathcal{D})p(\mathcal{M}_i|\mathcal{D}) \\ &\Rightarrow \sum\limits_{i=1}^{L}p(t|\vec{x},\mathcal{D})p(\mathcal{M}_i|\mathcal{D})&=\sum\limits_{i=1}^{L}p(t|\vec{x},\mathcal{M}_i,\mathcal{D})p(\mathcal{M}_i|\mathcal{D}) \\ &\Rightarrow p(t|\vec{x},\mathcal{D})&=\sum\limits_{i=1}^{L}p(t|\vec{x},\mathcal{M}_i,\mathcal{D})p(\mathcal{M}_i|\mathcal{D}) \\\end{aligned}\] This is an example of a mixture distribution in which the overall predictive distribution is obtained by averaging the predictive distributions of individual models weighted by the posterior probabilities $p(\mathcal{M}_i|\mathcal{D})$ of those models.

To use the single most probable model alone to make predictions is known as model selection.Model evidence \[\begin{aligned} p(\mathcal{D}|\mathcal{M}_i)=\int p(\mathcal{D}|\mathcal{\vec{w}})p(\vec{w}|\mathcal{M}_i)d\vec{w}\end{aligned}\] because \[\begin{aligned} p(\vec{w}|\mathcal{D},\mathcal{M}_i)=\dfrac{p(\mathcal{D}|\vec{w},\mathcal{M}_i)p(\vec{w}|\mathcal{M}_i)}{p(\mathcal{D}|\mathcal{M}_i)}\end{aligned}\]

Assume the posterior distribution over parameters is sharply peaked around the most probable value $w_{MAP}$,with width $\Delta w_{posterior}$,then we can approximate the integral by the value of integrand at its maximum times the width of the peak.And further assume that the prior is flat with width $\Delta w_{prior}$ so that $p(w)=1/\Delta w_{prior}$,then we have \[\begin{aligned} p(\mathcal{D})=\int p(\mathcal{D}|w)p(w)dw \simeq p(\mathcal{D}|w_{MAP})p(w_{MAP})\Delta w_{posterior} \simeq p(\mathcal{D}|w_{MAP})\dfrac{\Delta w_{posterior}}{\Delta w_{prior}}\end{aligned}\]

and so taking logs \[\begin{aligned} \ln p(\mathcal{D}) \simeq \ln p(\mathcal{D}|w_{MAP}) +\ln(\dfrac{\Delta w_{posterior}}{\Delta w_{prior}})\end{aligned}\] If we have a set of $M$ parameters,assuming that all parameters have the same ratio of $\Delta w_{posterior}/\Delta w_{prior}$,we obtain \[\begin{aligned} \ln p(\mathcal{D}) \simeq \ln p(\mathcal{D}|w_{MAP}) +M\ln(\dfrac{\Delta w_{posterior}}{\Delta w_{prior}})\end{aligned}\] As we increase the complexity of the model,the first term representing the fit to the data,increases(?),whereas the second term will decrease due to the dependence on $M$.The optimal model is given by a trade-off between these two competing terms.

Average the Bayes factor over the distribution of data sets \[\begin{aligned} \int p(\mathcal{D}|\mathcal{M}_1)\ln\dfrac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_2)}\end{aligned}\] This quantity is $Kullback-Leibler$ divergence,and will be bigger than or equal to $0$ if $\mathcal{M}_1$ is the correct model.

Linear Regression

Introduction