Explain L1 and L2 Regularization

15 views

Q
Question

Explain L1 (Lasso) and L2 (Ridge) regularization in the context of linear models. Discuss their mathematical formulations, the differences in their effects on model parameters, and scenarios where one might be preferred over the other.

A
Answer

L1 regularization, or Lasso, adds a penalty equivalent to the absolute value of the magnitude of coefficients to the loss function. This results in some coefficients being exactly zero, thus performing feature selection. L2 regularization, or Ridge, adds a penalty equivalent to the square of the magnitude of coefficients. This tends to shrink the coefficients uniformly, but not to zero, which can be useful to handle multicollinearity and prevent overfitting.

Differences:

  • L1 regularization can lead to sparse models with few active features, making it useful when feature selection is needed.
  • L2 regularization is better for models where all features are likely to contribute and you want to prevent overfitting without eliminating features.

Use Cases:

  • Use L1 regularization when you suspect that only a few features carry significant predictive power.
  • Use L2 regularization when you have many correlated features and you want to retain all features but reduce model complexity.

E
Explanation

Theoretical Background: Regularization techniques are used to prevent overfitting by adding a penalty to the loss function. For linear regression, the loss function is typically the mean squared error (MSE): J(θ)=12m(hθ(xi)yi)2J(\theta) = \frac{1}{2m} \sum (h_\theta(x^i) - y^i)^2 where hθ(xi)h_\theta(x^i) is the hypothesis function, yiy^i are the true values, and mm is the number of training examples.

L1 Regularization (Lasso): Adds the sum of the absolute values of the coefficients to the loss function: J(θ)=12m(hθ(xi)yi)2+λθjJ(\theta) = \frac{1}{2m} \sum (h_\theta(x^i) - y^i)^2 + \lambda \sum |\theta_j| This can result in sparse models with some coefficients being exactly zero, effectively selecting a subset of features.

L2 Regularization (Ridge): Adds the sum of the squares of the coefficients to the loss function: J(θ)=12m(hθ(xi)yi)2+λθj2J(\theta) = \frac{1}{2m} \sum (h_\theta(x^i) - y^i)^2 + \lambda \sum \theta_j^2 This typically results in a model where all coefficients are small but not zero, which helps in managing multicollinearity and stabilizing the solution.

Practical Applications:

  • L1 Regularization is useful in high-dimensional datasets where feature selection is beneficial. It's commonly used in scenarios like text classification with many features (e.g., word counts).
  • L2 Regularization is preferred in situations where you have many features that may be correlated, such as in ridge regression for finance data modeling.

Python Code Example:

from sklearn.linear_model import Lasso, Ridge

# Lasso example
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Ridge example
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

External References:

Mermaid Diagram:

graph LR A[Linear Model] --> B[L1 Regularization] A --> C[L2 Regularization] B --> D{Effects} C --> E{Effects} D --> |Feature Selection| F(Sparse Model) E --> |Stability| G(Reduced Overfitting)

Related Questions