Linear regression, considered to be the most basic of regression models, assumes a linear relationship between the independent variables and the dependent variable. It literally just fits a line to the data, minimizing squared error. Formally, linear regression predicts the scalar value as

where

  • is an -dimensional vector of features (with an additional 1 appended for the intercept term), and
  • is an (n+1)-dimensional vector of model parameters (weights).

We can extend this to a design matrix of examples as

Optimizing linear regression

There are two approaches to optimizing linear regression: numerically or analytically.

Analytical optimization of linear regression

As noted, the loss function for linear regression is mean squared error. Recall that mean squared error loss is defined as

We can rewrite this in a vectorized form to eliminate the sum as

where is the vector magnitude. Now, the vector magnitude is defined as

So the above can be expanded as

But , so

To optimize , we take its derivative, set it to zero, and solve for . I am told this works out to the following (but you’d have to ask high-school-me):

Since is a constant factor and we have set the expression to 0, we can cancel the factor with no loss of generality.

Multiplying the factor through the parenthetical expression, we obtain

Therefore

As long as is invertible, we can solve for as

This is the closed-form optimization for linear regression. If is not invertible, then there is no unique optimal solution. This typically happens when two features are actually the same (i.e., they are collinear).

We can implement this in NumPy as

def fit_lr_exact(X: np.array, y: np.array) -> np.array:
	ones_column: np.array = np.ones((X.shape[0], 1))
	X_b: np.array = np.concatenate((ones_column, X), axis=1)  # bias column
	theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
	return theta

Numerical optimization of linear regression

We can also approximate the optimal theta numerically, typically using gradient descent. To do this, we still need the partial derivative of the loss with respect to the parameters, . Recall from above that this turns out to be:

With this information, implementing gradient descent is literally as simple as evaluating the above and stepping by some learning rate in that direction:

def fit_lr_approx(X: np.array, y: np.array, lr: float, epochs: int) -> np.array:
	m = X.shape[0]
	ones_column: np.array = np.ones((X.shape[0], 1))
	X_b: np.array = np.concatenate((ones_column, X), axis=1)  # bias column
	theta: np.array = np.random.randn(X_b.shape[1], 1)
	for _ in range(epochs):
		gradient: np.array = 2/m * X_b.T @ (X_b @ theta - y)
		theta -= lr * gradient
	return theta

Implementation in PyTorch

If we’re allowed to use a library like PyTorch, we can skip both explicit differentiation and explicit SGD:

class LinearRegression(nn.Module):
	def __init__(self, dim: int) -> None:
		super(LinearRegression, self).__init__()
		self.weights: nn.Parameter = nn.Parameter(torch.randn(dim+1, 1))
 
	def forward(self, x: torch.Tensor) -> torch.Tensor:
		ones_column: torch.Tensor = torch.ones(x.shape[0], 1)
		x_b: torch.Tensor = torch.cat((x, ones_column), axis=1)
		return x_b @ self.weights