Skip to content

Conversation

@yfnaji
Copy link
Contributor

@yfnaji yfnaji commented Oct 12, 2025

This PR implements the Ridge regression model as part of the RustQuant_ml crate.

Ridge regression extends linear regression by adding an L2 regularisation term to the loss function, penalising large coefficient values to reduce overfitting.

This implementation is designed to closely align with Scikit-Learn’s linear_model.Ridge model. The Ridge regression implementation from Scikit-Learn using the same data as the unit tests in this PR is available here.

Take a features matrix $X$, response vector $\textbf{y}$ and regularisation parameter $\lambda>0$.

The loss function for a Ridge regression model is:

$$ C:=\lVert \mathbf{y} - X\beta \rVert^2_2 + \lambda\lVert\beta\rVert^2_2 $$

The optimal values for $\beta$ have a closed form solution. The loss function above can be written as

$$ \left(\mathbf{y}-X\beta\right)^T\left(\mathbf{y}-X\beta\right) + \lambda\beta^T\beta $$

Expanding gives

$$ \mathbf{y}^T\mathbf{y} -\beta^TX^T\mathbf{y} - \underbrace{\mathbf{y}^TX\beta}_*+\underbrace{\beta^TX^TX\beta+\beta^T \ \lambda \cdot I_\text{d} \ \beta}_{**} $$

where $I_{\text{d}}$ is the identity matrix.

Note that * is a scaler value, therefore

$$ \mathbf{y}^TX\beta = \left(\mathbf{y}^TX\beta\right)^T = \beta^TX^T\mathbf{y} $$

We can also combine the terms in ** to give:

$$ \beta^TX^T\mathbf{y}+\beta^T \lambda \cdot \ I_\text{d} \ \beta = \beta^T\left(X^TX + \lambda \cdot I_\text{d} \right)\beta $$

Now we can further simplify the loss function:

$$ \mathbf{y}^T\mathbf{y} -\beta^TX^T\mathbf{y} - \beta^TX^T\mathbf{y}+\beta^T\left(X^TX + \lambda \cdot I_\text{d} \right)\beta $$

$$ \Rightarrow \mathbf{y}^T\mathbf{y} -2\beta^TX^T\mathbf{y} + \beta^T\left(X^TX + \lambda \cdot I_\text{d} \right)\beta $$

Now calculate the derivative with respect to $\beta$ and set to $0$ to find the optimal values of $\beta$:

$$ \left.\frac{\partial C}{\partial \beta} \right\vert_{\beta=\hat{\beta}}= -2 X^T \mathbf{y} + \underbrace{2\left(X^TX + \lambda I_{\text{d}}\right)\hat{\beta}}_{***} = 0 $$

Note that *** was derived using the fact that

$$ \frac{\partial}{\partial x}\left[\textbf{x}^TA\textbf{x}\right]=\left(A + A^T\right)\text{x} $$

and if A is symmetric, the above can be simplified to $2A\textbf{x}$.

Solving for $\hat{\beta}$:

$$ \left(X^TX + \lambda I_{\text{d}}\right)\hat{\beta} = X^T \mathbf{y} $$

$$ \Rightarrow \hat{\beta} = \left(X^TX + \lambda I_{\text{d}}\right)^{-1}X^T \mathbf{y} $$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant