Linear Regression & Gradient Descent¶
1. Why this matters¶
Linear regression is the first model you should try for any regression problem. If it works, great — you have a fast, interpretable model. If it doesn't, the reasons it didn't (residual patterns, high-variance coefficients) tell you what to fix next.
Gradient descent isn't just for linear regression — it's the optimization engine behind logistic regression, neural networks, and most modern ML. Understanding it here pays off for everything later.
2. Mental model¶
Find the line/hyperplane that minimizes the total squared distance to all points:
flowchart LR
A[Training data X, y] --> M[Find β that minimizes<br/>sum_y_predicted_y_squared]
M --> P[Trained β coefficients]
P --> N[New X] --> Q[Predicted y = β · X]
The model is y = β₀ + β₁x for simple regression, y = β₀ + Σ βᵢxᵢ for multiple. Closed-form solution exists; otherwise gradient descent.
3. Simple Linear Regression¶
One feature, one target:
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]]) # must be 2D
y = np.array([2.1, 4.0, 6.1, 8.2, 10.0])
model = LinearRegression().fit(X, y)
print(f"slope (β₁) = {model.coef_[0]:.3f}")
print(f"intercept (β₀) = {model.intercept_:.3f}")
print(model.predict([[6]])) # → ~12
The math:
Predictions: ŷ = β₀ + β₁x.
4. Multiple Linear Regression¶
Multiple features, same model:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scale", StandardScaler()),
("lr", LinearRegression()),
]).fit(X_tr, y_tr)
print("R² on test:", pipe.score(X_te, y_te))
print("Coefficients:")
for name, coef in zip(X.columns, pipe.named_steps["lr"].coef_):
print(f" {name:20s} {coef:+.3f}")
print("Intercept:", pipe.named_steps["lr"].intercept_)
Why scale before linear regression: scaling doesn't change the predictions but makes coefficients comparable in magnitude (1 unit change in a scaled feature has a comparable interpretation).
5. Closed-Form (Normal Equation)¶
The optimal coefficients have a closed-form:
LinearRegression() uses this by default. Pros: no hyperparameters, exact. Cons: O(p³) for matrix inverse — slow when p > ~10,000.
6. Gradient Descent¶
When closed-form is too expensive or you want online learning, iteratively update coefficients to descend the loss surface.
The recipe:
1. Initialize β randomly (or zeros)
2. Compute predictions: ŷ = X·β
3. Compute loss: L = mean((ŷ - y)²)
4. Compute gradient: g = (2/n) Xᵀ (ŷ - y)
5. Update: β ← β − η · g (η = learning rate)
6. Repeat 2–5 until convergence
flowchart TD
INIT[Init β] --> PRED[Predict y_hat]
PRED --> LOSS[Compute loss MSE]
LOSS --> GRAD[Compute gradient]
GRAD --> UPD[β -= η * grad]
UPD --> CHK{Converged or<br/>max_iter?}
CHK -->|no| PRED
CHK -->|yes| OUT[Final β]
import numpy as np
def gd_linear_regression(X, y, lr=0.01, n_iter=1000):
n, p = X.shape
X1 = np.c_[np.ones(n), X] # add bias column
beta = np.zeros(p + 1)
for _ in range(n_iter):
grad = (2/n) * X1.T @ (X1 @ beta - y)
beta -= lr * grad
return beta[0], beta[1:] # intercept, slopes
intercept, slopes = gd_linear_regression(X_scaled, y)
7. Types of Gradient Descent¶
| Variant | Batch size per step | Pros | Cons |
|---|---|---|---|
| Batch (full) | All N samples | Stable, smooth path | Slow on big data, can't stream |
| Stochastic (SGD) | 1 sample | Fast per step, online-able, noisy escapes saddle points | Noisy convergence, needs LR schedule |
| Mini-batch | 32–512 typical | Best of both — production default | Tune batch size + LR |
sklearn's SGDRegressor implements stochastic / mini-batch GD:
from sklearn.linear_model import SGDRegressor
# Scaling is REQUIRED for SGD
sgd = SGDRegressor(
loss="squared_error",
learning_rate="invscaling", # eta = eta0 / pow(t, power_t)
eta0=0.01,
max_iter=1000,
tol=1e-4,
random_state=42,
)
sgd.fit(X_scaled, y)
8. Regression Metrics¶
| Metric | Formula | Best for |
|---|---|---|
| MAE (Mean Absolute Error) | mean( | ŷ - y |
| MSE (Mean Squared Error) | mean((ŷ - y)²) | Penalizes large errors more; differentiable for GD |
| RMSE (Root MSE) | √MSE | Same units as y; default for benchmarking |
| R² (Coefficient of Determination) | 1 − SS_res / SS_tot | Proportion of variance explained; 1 = perfect, 0 = no better than mean, negative = worse |
| MAPE (Mean Absolute % Error) | mean( | ŷ - y |
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score,
)
y_pred = model.predict(X_test)
print("MAE :", mean_absolute_error(y_test, y_pred))
print("MSE :", mean_squared_error(y_test, y_pred))
print("RMSE:", root_mean_squared_error(y_test, y_pred)) # sklearn ≥ 1.4
# or: rmse = mean_squared_error(y_test, y_pred, squared=False)
print("R² :", r2_score(y_test, y_pred))
How to pick: - Picking ONE for benchmarking → RMSE (industry standard). - Outliers dominate / business cares about typical error → MAE. - Need a 0-1 score that's comparable across datasets → R².
9. Common pitfalls¶
- ❗ Forgetting to scale before SGD. SGD diverges or crawls.
- ❗ Trusting R² blindly. A high R² on training tells you nothing about generalization. Always use a holdout test or CV.
- ❗ Skipping the residual plot. Plot
(y_pred, y_pred - y_true). Patterns (curves, fans) signal model mis-specification — add polynomial features or switch model class. - ❗ Multicollinearity wrecking coefficients. Two highly correlated features → unstable coefficients with huge variance. Drop one, regularize (Ridge), or use PCA.
- ❗ Predicting outside training range (extrapolation). Linear regression cheerfully extrapolates to wrong answers. Always check
min(X_test) >= min(X_train). - ❗ Using
LinearRegression()on > 10k features. O(p³) matrix inverse — slow. Switch toSGDRegressororRidge(solver="lsqr"). - ❗ MAPE on data with zeros in
y. Division by zero. Use MAE or sMAPE instead.
10. When to use vs not use¶
| Use Linear Regression when | Use something else when |
|---|---|
| Relationship looks roughly linear | Relationships are clearly non-linear (decision boundaries, interactions you can't easily encode) |
| You need interpretability (per-feature impact) | Black-box accuracy is OK |
| Few features, small dataset | Massive feature counts → reach for boosting or NN |
| As a baseline before any complex model | After you've tried simpler models first ← always |
| Stable, low-variance prediction needed | Sparse/skewed target → tree-based |
11. Cheatsheet¶
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error,
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Closed-form (default for small/mid data)
model = LinearRegression().fit(X_train, y_train)
model.coef_, model.intercept_
# Inside a pipeline (always scale)
pipe = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())])
# Cross-validate
scores = cross_val_score(pipe, X, y, cv=5, scoring="neg_root_mean_squared_error")
print(-scores.mean()) # RMSE
# Stochastic GD — for huge data or online learning
sgd = SGDRegressor(loss="squared_error", learning_rate="invscaling",
eta0=0.01, max_iter=1000, tol=1e-4).fit(X_scaled, y)
# Residual plot — DO this after every linear model
import matplotlib.pyplot as plt
resid = y_test - model.predict(X_test)
plt.scatter(model.predict(X_test), resid, alpha=0.5)
plt.axhline(0, color="red"); plt.xlabel("predicted"); plt.ylabel("residual")
12. Q&A — recall test¶
-
Q: Closed-form vs gradient descent for linear regression? A: Closed-form (
LinearRegression()) gives an exact solution but is O(p³). GD (SGDRegressor) is iterative, scales to millions of features/samples, requires scaling and tuning. -
Q: Why does SGD need scaled features? A: Gradient magnitudes scale with feature ranges. Big-range features dominate the update direction; small ones become invisible. Scaling equalizes them.
-
Q: MAE vs RMSE — when to prefer each? A: RMSE penalizes large errors more (squared term); use as default benchmark. MAE is robust to outliers and reports average error in original units; use when outliers are real but you don't want them to dominate the score.
-
Q: What does R² = 0 mean? R² = negative? A: R²=0 → your model is no better than always predicting the mean. R²<0 → your model is WORSE than the mean baseline (rare but happens on tiny test sets or wildly wrong models).
-
Q: First plot to make after fitting a linear regression? A: Residual plot:
(predicted, predicted - true). Patterns = mis-specification, fans = heteroscedasticity, outliers = obvious points to investigate. -
Q: Why scale before linear regression when sklearn's
LinearRegressionalready finds the exact optimum? A: Scaling doesn't change the predictions but makes coefficient magnitudes interpretable on a comparable axis ("1 std change in feature i → β_i change in y"). Also required for regularized variants (Ridge, Lasso) where scaling affects the penalty.
Practice¶
What does this print?
Expected: True
Use np.allclose to compare floats (== fails on float precision)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What does linear regression assume about the relationship between features and target?
- Linear (a straight-line or hyperplane relationship)
- Non-linear
- Quadratic
- No assumption
Why: Linear regression models
y = β₀ + β₁x₁ + β₂x₂ + .... If the actual relationship is non-linear, you need polynomial features, transformations (log, sqrt), or a non-linear model.
Q2. What does R² measure?
- Number of features
- Fraction of variance in y explained by the model (0 = no fit, 1 = perfect)
- Mean squared error
- Number of parameters
Why: R² = 1 means the model explains all variance; R² = 0 means it's no better than predicting the mean of y. R² can be negative on test data when the model is worse than the mean baseline.
Q3. When does the closed-form solution β = (XᵀX)⁻¹ Xᵀy work?
- Always
- When XᵀX is invertible (no perfect multicollinearity) and the dataset fits in memory
- Only for 2D data
- Only for classification
Why: For huge datasets or near-singular XᵀX, the closed form fails or is unstable. That's when gradient descent or
np.linalg.lstsq(more numerically stable) comes in.
Common doubts¶
Why do my coefficients change dramatically when I add a feature?
Multicollinearity. When two features are highly correlated, their individual coefficients become unstable — they "trade off" between each other. Solutions: drop one, use Ridge (which stabilizes), or check VIF (variance inflation factor) for problematic features.
Is linear regression still useful in 2026?
Yes — for interpretability (you can explain every coefficient to a stakeholder), for baselines (always beat the linear baseline first), and for problems where the relationship really is approximately linear. Often the best model for small datasets too.
What's the difference between MSE and RMSE?
RMSE = sqrt(MSE). RMSE is in the same units as y (e.g., dollars), so it's more interpretable. MSE is what's actually minimized during training (mathematically easier). Same ranking of models — both pick the same best model.