Skip to content

Linear Regression & Gradient Descent

1. Why this matters

Linear regression is the first model you should try for any regression problem. If it works, great — you have a fast, interpretable model. If it doesn't, the reasons it didn't (residual patterns, high-variance coefficients) tell you what to fix next.

Gradient descent isn't just for linear regression — it's the optimization engine behind logistic regression, neural networks, and most modern ML. Understanding it here pays off for everything later.

2. Mental model

Find the line/hyperplane that minimizes the total squared distance to all points:

flowchart LR
    A[Training data X, y] --> M[Find β that minimizes<br/>sum_y_predicted_y_squared]
    M --> P[Trained β coefficients]
    P --> N[New X] --> Q[Predicted y = β · X]
   y |          •    
     |       •  •  /
     |   •  •  /•
     |  •  /•           ← fitted line
     | / •
     |/
     +--------------- x

The model is y = β₀ + β₁x for simple regression, y = β₀ + Σ βᵢxᵢ for multiple. Closed-form solution exists; otherwise gradient descent.

3. Simple Linear Regression

One feature, one target:

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])       # must be 2D
y = np.array([2.1, 4.0, 6.1, 8.2, 10.0])

model = LinearRegression().fit(X, y)
print(f"slope (β₁) = {model.coef_[0]:.3f}")
print(f"intercept (β₀) = {model.intercept_:.3f}")
print(model.predict([[6]]))                    # → ~12

The math:

β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²
β₀ = ȳ − β₁ x̄

Predictions: ŷ = β₀ + β₁x.

4. Multiple Linear Regression

Multiple features, same model:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("lr",    LinearRegression()),
]).fit(X_tr, y_tr)

print("R² on test:", pipe.score(X_te, y_te))
print("Coefficients:")
for name, coef in zip(X.columns, pipe.named_steps["lr"].coef_):
    print(f"  {name:20s} {coef:+.3f}")
print("Intercept:", pipe.named_steps["lr"].intercept_)

Why scale before linear regression: scaling doesn't change the predictions but makes coefficients comparable in magnitude (1 unit change in a scaled feature has a comparable interpretation).

5. Closed-Form (Normal Equation)

The optimal coefficients have a closed-form:

β = (XᵀX)⁻¹ Xᵀy

LinearRegression() uses this by default. Pros: no hyperparameters, exact. Cons: O(p³) for matrix inverse — slow when p > ~10,000.

6. Gradient Descent

When closed-form is too expensive or you want online learning, iteratively update coefficients to descend the loss surface.

The recipe:

1. Initialize β randomly (or zeros)
2. Compute predictions:  ŷ = X·β
3. Compute loss:         L = mean((ŷ - y)²)
4. Compute gradient:     g = (2/n) Xᵀ (ŷ - y)
5. Update:               β ← β − η · g    (η = learning rate)
6. Repeat 2–5 until convergence
flowchart TD
    INIT[Init β] --> PRED[Predict y_hat]
    PRED --> LOSS[Compute loss MSE]
    LOSS --> GRAD[Compute gradient]
    GRAD --> UPD[β -= η * grad]
    UPD --> CHK{Converged or<br/>max_iter?}
    CHK -->|no| PRED
    CHK -->|yes| OUT[Final β]
import numpy as np

def gd_linear_regression(X, y, lr=0.01, n_iter=1000):
    n, p = X.shape
    X1 = np.c_[np.ones(n), X]              # add bias column
    beta = np.zeros(p + 1)
    for _ in range(n_iter):
        grad = (2/n) * X1.T @ (X1 @ beta - y)
        beta -= lr * grad
    return beta[0], beta[1:]               # intercept, slopes

intercept, slopes = gd_linear_regression(X_scaled, y)

7. Types of Gradient Descent

Variant Batch size per step Pros Cons
Batch (full) All N samples Stable, smooth path Slow on big data, can't stream
Stochastic (SGD) 1 sample Fast per step, online-able, noisy escapes saddle points Noisy convergence, needs LR schedule
Mini-batch 32–512 typical Best of both — production default Tune batch size + LR

sklearn's SGDRegressor implements stochastic / mini-batch GD:

from sklearn.linear_model import SGDRegressor

# Scaling is REQUIRED for SGD
sgd = SGDRegressor(
    loss="squared_error",
    learning_rate="invscaling",          # eta = eta0 / pow(t, power_t)
    eta0=0.01,
    max_iter=1000,
    tol=1e-4,
    random_state=42,
)
sgd.fit(X_scaled, y)

8. Regression Metrics

Metric Formula Best for
MAE (Mean Absolute Error) mean( ŷ - y
MSE (Mean Squared Error) mean((ŷ - y)²) Penalizes large errors more; differentiable for GD
RMSE (Root MSE) √MSE Same units as y; default for benchmarking
(Coefficient of Determination) 1 − SS_res / SS_tot Proportion of variance explained; 1 = perfect, 0 = no better than mean, negative = worse
MAPE (Mean Absolute % Error) mean( ŷ - y
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score,
)

y_pred = model.predict(X_test)
print("MAE :", mean_absolute_error(y_test, y_pred))
print("MSE :", mean_squared_error(y_test, y_pred))
print("RMSE:", root_mean_squared_error(y_test, y_pred))      # sklearn ≥ 1.4
# or:    rmse = mean_squared_error(y_test, y_pred, squared=False)
print("R²  :", r2_score(y_test, y_pred))

How to pick: - Picking ONE for benchmarking → RMSE (industry standard). - Outliers dominate / business cares about typical error → MAE. - Need a 0-1 score that's comparable across datasets → .

9. Common pitfalls

  • Forgetting to scale before SGD. SGD diverges or crawls.
  • Trusting R² blindly. A high R² on training tells you nothing about generalization. Always use a holdout test or CV.
  • Skipping the residual plot. Plot (y_pred, y_pred - y_true). Patterns (curves, fans) signal model mis-specification — add polynomial features or switch model class.
  • Multicollinearity wrecking coefficients. Two highly correlated features → unstable coefficients with huge variance. Drop one, regularize (Ridge), or use PCA.
  • Predicting outside training range (extrapolation). Linear regression cheerfully extrapolates to wrong answers. Always check min(X_test) >= min(X_train).
  • Using LinearRegression() on > 10k features. O(p³) matrix inverse — slow. Switch to SGDRegressor or Ridge(solver="lsqr").
  • MAPE on data with zeros in y. Division by zero. Use MAE or sMAPE instead.

10. When to use vs not use

Use Linear Regression when Use something else when
Relationship looks roughly linear Relationships are clearly non-linear (decision boundaries, interactions you can't easily encode)
You need interpretability (per-feature impact) Black-box accuracy is OK
Few features, small dataset Massive feature counts → reach for boosting or NN
As a baseline before any complex model After you've tried simpler models first ← always
Stable, low-variance prediction needed Sparse/skewed target → tree-based

11. Cheatsheet

from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error,
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Closed-form (default for small/mid data)
model = LinearRegression().fit(X_train, y_train)
model.coef_, model.intercept_

# Inside a pipeline (always scale)
pipe = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())])

# Cross-validate
scores = cross_val_score(pipe, X, y, cv=5, scoring="neg_root_mean_squared_error")
print(-scores.mean())   # RMSE

# Stochastic GD — for huge data or online learning
sgd = SGDRegressor(loss="squared_error", learning_rate="invscaling",
                   eta0=0.01, max_iter=1000, tol=1e-4).fit(X_scaled, y)

# Residual plot — DO this after every linear model
import matplotlib.pyplot as plt
resid = y_test - model.predict(X_test)
plt.scatter(model.predict(X_test), resid, alpha=0.5)
plt.axhline(0, color="red"); plt.xlabel("predicted"); plt.ylabel("residual")

12. Q&A — recall test

  • Q: Closed-form vs gradient descent for linear regression? A: Closed-form (LinearRegression()) gives an exact solution but is O(p³). GD (SGDRegressor) is iterative, scales to millions of features/samples, requires scaling and tuning.

  • Q: Why does SGD need scaled features? A: Gradient magnitudes scale with feature ranges. Big-range features dominate the update direction; small ones become invisible. Scaling equalizes them.

  • Q: MAE vs RMSE — when to prefer each? A: RMSE penalizes large errors more (squared term); use as default benchmark. MAE is robust to outliers and reports average error in original units; use when outliers are real but you don't want them to dominate the score.

  • Q: What does R² = 0 mean? R² = negative? A: R²=0 → your model is no better than always predicting the mean. R²<0 → your model is WORSE than the mean baseline (rare but happens on tiny test sets or wildly wrong models).

  • Q: First plot to make after fitting a linear regression? A: Residual plot: (predicted, predicted - true). Patterns = mis-specification, fans = heteroscedasticity, outliers = obvious points to investigate.

  • Q: Why scale before linear regression when sklearn's LinearRegression already finds the exact optimum? A: Scaling doesn't change the predictions but makes coefficient magnitudes interpretable on a comparable axis ("1 std change in feature i → β_i change in y"). Also required for regularized variants (Ridge, Lasso) where scaling affects the penalty.

Practice

What does this print?

Expected: True

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([3, 5, 7, 9, 11])      # y = 2x + 1
model = LinearRegression().fit(X, y)
print(round(model.coef_[0]) == 2 and round(model.intercept_) == 1)

Use np.allclose to compare floats (== fails on float precision)

Expected: True

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
model = LinearRegression().fit(X, y)
print(model.coef_[0] == 2.0)        # bug: float precision — use np.isclose

Quiz — Quick check

What you remember

Q1. What does linear regression assume about the relationship between features and target?

  • Linear (a straight-line or hyperplane relationship)
  • Non-linear
  • Quadratic
  • No assumption

Why: Linear regression models y = β₀ + β₁x₁ + β₂x₂ + .... If the actual relationship is non-linear, you need polynomial features, transformations (log, sqrt), or a non-linear model.

Q2. What does R² measure?

  • Number of features
  • Fraction of variance in y explained by the model (0 = no fit, 1 = perfect)
  • Mean squared error
  • Number of parameters

Why: R² = 1 means the model explains all variance; R² = 0 means it's no better than predicting the mean of y. R² can be negative on test data when the model is worse than the mean baseline.

Q3. When does the closed-form solution β = (XᵀX)⁻¹ Xᵀy work?

  • Always
  • When XᵀX is invertible (no perfect multicollinearity) and the dataset fits in memory
  • Only for 2D data
  • Only for classification

Why: For huge datasets or near-singular XᵀX, the closed form fails or is unstable. That's when gradient descent or np.linalg.lstsq (more numerically stable) comes in.

Common doubts

Why do my coefficients change dramatically when I add a feature?

Multicollinearity. When two features are highly correlated, their individual coefficients become unstable — they "trade off" between each other. Solutions: drop one, use Ridge (which stabilizes), or check VIF (variance inflation factor) for problematic features.

Is linear regression still useful in 2026?

Yes — for interpretability (you can explain every coefficient to a stakeholder), for baselines (always beat the linear baseline first), and for problems where the relationship really is approximately linear. Often the best model for small datasets too.

What's the difference between MSE and RMSE?

RMSE = sqrt(MSE). RMSE is in the same units as y (e.g., dollars), so it's more interpretable. MSE is what's actually minimized during training (mathematically easier). Same ranking of models — both pick the same best model.