01/04/2018

Tags: machine learning linear regression probability

Today, we look at the regression under a probabilistic modeling context that help us understand the reasons behind the least squares loss function.

Recall that in linear regression, we assume that our data is a linear combination of weights and our feature vector with an prepended bias term:

\[\hat{y} = h(\mathbf{x}; \mathbf{w}) = \sum_{i=1}^{m+1} x_i w_i = \mathbf{w^\top x}\] \[\mathbf{x}, \mathbf{w} \in \mathbb{R}^{m+1}, \mathbf{x} = \begin{bmatrix}1 \\ \mathbf{x^1}\end{bmatrix}, \mathbf{w} = \begin{bmatrix}w_0 \\ \mathbf{w^1} \end{bmatrix}\]

We posited that the residual sum-of-squares was a good loss function, but its use seemed rather ad hoc. Let’s show some of the theoretical underpinnings behind this loss function.

To help understand the reasons for the least squares loss, we’ll need to review some fundamentals of probability theory. This is by no means a comprehensive overview of these concepts, but it hopefully brings back some of the most important ideas from your probability theory class. Consider a single data point. We can write the conditional probability of seeing its output $y$ given its feature vector $\mathbf{x}$ as:

\[p(y|\mathbf{x})\]

If we assume that each data point is independent from one another, we can express the joint probability, the probability of seeing all of the data’s $y$ given each of the $\mathbf{x}$’s, as:

\[p(y_1, y_2, \ldots, y_n | \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n) = p(y_1|\mathbf{x}_1)p(y_2|\mathbf{x}_2)\ldots p(y_n|\mathbf{x}_n) = \prod_{i=1}^n p(y_i|\mathbf{x}_i)\]

If the probabilities depend on parameters, we can also define this as the likelihood $L(\mathbf{w})$ which is a joint probability based on parameters $\mathbf{w}$.

The value of a probability is defined by its distribution. We can think of a distribution as being the probability of occurrence for all possible values in its support, the range of possible outputs. For instance, a biased coin with probabilty 0.6 of being heads would have a probabilty distribution of:

\[\begin{cases} 0.4 & \text{tails} \\ 0.6 & \text{heads} \end{cases}\]

The distribution we will be discussing is the Normal/Gaussian distribution. The Normal distribution has the following probability density function:

\[\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)\]

The multivariate case can be written as:

\[\mathcal{N}(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{\sqrt{|2\pi \mathbf{\Sigma}|}} \exp\left( -\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^\top\mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}) \right)\]

Note that the probabilities of a distribution must add up to 1.

Now to tie it all together. In the linear regression, we think of the data as coming from a generative model of normally distributed data. A generative model is a probabilistic statement that describes the joint distribution over feature vectors and their output values. Here, the outputs $y$ are modeled to be generated by inputs $\mathbf{x}$. In the general regression form, we can say that our data follows a true function of the data $h$ with noise $\epsilon$:

\[y = h(\mathbf{x}, \mathbf{w}) + \epsilon, \epsilon \sim \mathcal{N}(0, \beta^{-1})\]

Another way to view this is that $y$ follows a Normal distribution with mean $h$ and variance $\beta^{-1}$.

\[p(y|\mathbf{x}, \mathbf{w}) = \mathcal{N}(y|h(\mathbf{x}, \mathbf{w}), \beta^{-1}) \longleftrightarrow y \sim \mathcal{N}(h(\mathbf{x}, \mathbf{w}), \beta^{-1})\]

Plugging in our assumption of linearity $(h(\mathbf{x}, \mathbf{w}) = \mathbf{w}^\top \mathbf{x})$, we get:

\[p(y|\mathbf{x}, \mathbf{w}) = \mathcal{N}(y|\mathbf{w}^\top \mathbf{x}, \beta^{-1}) \longleftrightarrow y \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}, \beta^{-1})\]

The moral of the story is that we are currently making the following assumptions:

The data truly comes from a linear model that has been disturbed with random noise.
The randomness can be modeled as a Gaussian random variable.
Each data point is independent.

In this generative model, we would like to find a $\mathbf{w}^\star$ that maximizes the likelihood of seeing our data. Not surprisingly, we can estimate the optimal $\mathbf{w}^\star$ with the maximum likelihood estimator (MLE) $\hat{\mathbf{w}}_{MLE}$:

\[\hat{\mathbf{w}}_{MLE} = \text{arg}_\mathbf{w}\text{max }p(y_1, \ldots y_n | \mathbf{x}_1, \ldots, \mathbf{x}_n) = \text{arg}_\mathbf{w}\text{max } L(\mathbf{w})\]

Using our model for the linear regression and converting it to its joint Gaussian form, we get:

\[\hat{\mathbf{w}}_{MLE} = \text{arg}_\mathbf{w}\text{max } \prod_{i=1}^n \mathcal{N}(y_i|\mathbf{w}^\top \mathbf{x}_i, \beta^{-1}) = \text{arg}_\mathbf{w}\text{max } \mathcal{N}(\mathbf{y}|\mathbf{Xw}, \mathbf{\beta}^{-1}I_{n})\]

Notice the likelihood is a product of probabilities. Since probabilities are between 0 and 1, with enough data points, floating point imprecision will cause the joint probability to approach 0. To avoid these numerical issues, we often work in log-space. Furthermore, it’s often convention to work on minimization problems, so instead of maximizing the log-likelihood, we minimize the negative log-likelihood:

\[\ell(\mathbf{w}) = - \log L(\mathbf{w})\]

So the new equation is:

\[\begin{align*}\hat{\mathbf{w}}_{MLE} &= \text{arg}_\mathbf{w}\text{min } \ell(\mathbf{w})\\ &= \text{arg}_\mathbf{w}\text{min } - \log \mathcal{N}(\mathbf{y}|\mathbf{Xw}, \mathbf{\beta}^{-1}I_{n}) \\ &= \text{arg}_\mathbf{w}\text{min } -\frac{1}{2} \log|2\pi\mathbf{\beta}^{-1}| - \frac{1}{2}(\mathbf{y} - \mathbf{Xw})^\top\beta(\mathbf{y} - \mathbf{Xw}) \end{align*}\]

To find the argmin, we take the derivative of the the negative log-likelihood and solve for $\mathbf{w}$.

\[\begin{align*} \frac{\partial \ell(\mathbf{w})}{\partial \mathbf{w}} &= \frac{\beta}{2}(-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{Xw}) \\ \hat{\mathbf{w}}_{MLE}&= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \end{align*}\]

The maximum likelihood estimator is precisely the projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$ which we found last week! In conclusion, using the least squares loss is equivalent to the probabilistic linear regression model where data is assumed to independently come from a Normal distribution.

Linear Regression: A Probabilistic Approach

Categories

Recent Posts