Linear regression, considered to be the most basic of regression models, assumes a linear relationship between the independent variables and the dependent variable. It literally just fits a line to the data, minimizing squared error. Formally, linear regression predicts the scalar value
where
is an -dimensional vector of features (with an additional 1 appended for the intercept term), and is an (n+1)-dimensional vector of model parameters (weights).
We can extend this to a design matrix of
Optimizing linear regression
There are two approaches to optimizing linear regression: numerically or analytically.
Analytical optimization of linear regression
As noted, the loss function for linear regression is mean squared error. Recall that mean squared error loss
We can rewrite this in a vectorized form to eliminate the sum as
where
So the above can be expanded as
But
To optimize
Since
Multiplying the
Therefore
As long as
This is the closed-form optimization for linear regression. If
We can implement this in NumPy as
def fit_lr_exact(X: np.array, y: np.array) -> np.array:
ones_column: np.array = np.ones((X.shape[0], 1))
X_b: np.array = np.concatenate((ones_column, X), axis=1) # bias column
theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
return thetaNumerical optimization of linear regression
We can also approximate the optimal theta numerically, typically using gradient descent. To do this, we still need the partial derivative of the loss with respect to the parameters,
With this information, implementing gradient descent is literally as simple as evaluating the above and stepping by some learning rate in that direction:
def fit_lr_approx(X: np.array, y: np.array, lr: float, epochs: int) -> np.array:
m = X.shape[0]
ones_column: np.array = np.ones((X.shape[0], 1))
X_b: np.array = np.concatenate((ones_column, X), axis=1) # bias column
theta: np.array = np.random.randn(X_b.shape[1], 1)
for _ in range(epochs):
gradient: np.array = 2/m * X_b.T @ (X_b @ theta - y)
theta -= lr * gradient
return thetaImplementation in PyTorch
If we’re allowed to use a library like PyTorch, we can skip both explicit differentiation and explicit SGD:
class LinearRegression(nn.Module):
def __init__(self, dim: int) -> None:
super(LinearRegression, self).__init__()
self.weights: nn.Parameter = nn.Parameter(torch.randn(dim+1, 1))
def forward(self, x: torch.Tensor) -> torch.Tensor:
ones_column: torch.Tensor = torch.ones(x.shape[0], 1)
x_b: torch.Tensor = torch.cat((x, ones_column), axis=1)
return x_b @ self.weights