Feature Scaling¶

1. Why this matters¶

If feature A ranges 0–1 and feature B ranges 0–1,000,000, a distance metric (KNN, K-means) and a gradient-based optimizer (logistic regression, neural nets) will see B as 1,000,000× more important. Scaling fixes that.

Algorithm	Needs scaling?
KNN, K-means	YES — distance-based
Logistic / Linear regression	YES — improves convergence + interpretability
SVM	YES — distance-based
PCA	YES — variance-based
Neural networks	YES — convergence depends on it
Decision tree	NO
Random Forest / Gradient Boosting / XGBoost	NO — splits are scale-invariant
Naive Bayes	NO

2. Mental model¶

Scaling is a per-column transformation — for each column independently, apply the same formula to every value:

flowchart LR
    A[Raw column<br/>x: 100, 200, 5000] --> S[Scaler.fit on train]
    S -->|"learns mean/std (or min/max)"| P[Params]
    P --> T[Transform train & test]
    T --> O["Scaled column<br/>e.g. -0.5, -0.3, 1.8"]

The cardinal rule: fit on train, transform on both train and test. Never fit on test.

3. Methods¶

StandardScaler (Z-score normalization) — most common default¶

z = (x - mean) / std

After transform: column has mean 0, std 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)        # use train's mean/std

Use when: features are roughly normally distributed; want symmetric range; default choice for linear models, SVM, NN.

MinMaxScaler — scale to [0, 1]¶

x' = (x - min) / (max - min)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))      # default

Use when: you need bounded output (image pixels, neural net inputs), or you don't care about preserving distribution shape.

Warning: sensitive to outliers — one big value compresses the rest into a tiny range.

MaxAbsScaler — scale by max absolute value¶

x' = x / max(|x|)

Output range: [-1, 1]. Use when: you have sparse data (don't want to shift the zero). Common with NLP feature vectors.

RobustScaler — outlier-resistant¶

x' = (x - median) / IQR

Uses median + interquartile range instead of mean + std. Use when: you have outliers you don't want to remove but don't want to dominate scaling.

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

Normalizer — per-ROW (not per-column)¶

Scales each sample to unit norm. Used in NLP / text similarity, not typical tabular ML.

from sklearn.preprocessing import Normalizer
Normalizer(norm="l2")        # each row's L2 norm = 1

4. Architecture / Flow¶

flowchart TD
    A[X_train] --> S[scaler.fit_transform X_train]
    S --> AX[scaled X_train]
    AX --> M[Model.fit]
    B[X_test] --> T[scaler.transform X_test]
    T --> BX[scaled X_test]
    BX --> P[Model.predict]
    style S fill:#e8f5e9
    style T fill:#fff4e5

Inside a Pipeline this happens automatically with no leakage.

5. Code — minimal working example¶

import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd

X = pd.DataFrame({
    "age":     [25, 30, 35, 42, 50, 80],
    "income":  [40_000, 50_000, 60_000, 80_000, 120_000, 5_000_000],   # outlier!
})

for name, scaler in [("Std", StandardScaler()),
                     ("MinMax", MinMaxScaler()),
                     ("Robust", RobustScaler())]:
    scaled = scaler.fit_transform(X)
    print(name)
    print(pd.DataFrame(scaled, columns=X.columns).round(2))
    print()

You'll see MinMaxScaler squashes almost everyone into a tiny range because of the 5M outlier. RobustScaler handles it gracefully.

6. Code — real-world pattern with Pipeline¶

The right way — never fit a scaler outside a Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression

num_features_normal = ["age", "tenure_months"]
num_features_skewed = ["income", "transaction_amount"]   # have outliers

preprocess = ColumnTransformer([
    ("std",    StandardScaler(), num_features_normal),
    ("robust", RobustScaler(),   num_features_skewed),
], remainder="passthrough")

pipe = Pipeline([
    ("prep",  preprocess),
    ("clf",   LogisticRegression(max_iter=500)),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

This is leak-proof: when cross_val_score(pipe, X, y, cv=5) runs, the scaler refits on each fold's train, never seeing the fold's validation.

7. Common pitfalls¶

❗ Fitting the scaler on X (train + test) before splitting. Test statistics leak into the scaler — inflated metrics, broken production.
❗ Scaling categorical / one-hot columns. Pointless and slightly distorts them. Scale only numeric continuous columns. Use ColumnTransformer to limit scope.
❗ Using MinMaxScaler on outlier-heavy data. A single 5M income compresses 99% of users into [0.00, 0.02]. Use RobustScaler or remove outliers first.
❗ Forgetting to save the scaler. Production needs the same scaler used at training. Save the entire Pipeline via joblib.dump(pipe, ...).
❗ Scaling tree-based models for no reason. It's a no-op (won't hurt accuracy) but adds complexity and inference cost.
❗ Scaling target y for regression and forgetting to inverse-transform predictions. Possible but tricky — usually only worth it if y is hugely skewed.

8. When to use vs not use¶

Method	When
`StandardScaler`	Default for most numeric features.
`MinMaxScaler`	Bounded outputs needed (e.g., image pixels 0-255 → 0-1); when min/max are stable and outliers are rare.
`RobustScaler`	Heavy outliers you can't or don't want to remove.
`MaxAbsScaler`	Sparse data (NLP tf-idf, one-hot stacks) — preserves sparsity.
`Normalizer`	Per-sample (row) normalization — text similarity, cosine distance.
No scaling	Tree-based models (RandomForest, GBM, XGBoost, LightGBM), Naive Bayes.
Log / Box-Cox / Yeo-Johnson	Highly skewed features — see Pipelines.

9. Cheatsheet¶

from sklearn.preprocessing import (
    StandardScaler,   # mean=0, std=1
    MinMaxScaler,     # [0,1] (or custom range)
    MaxAbsScaler,     # [-1,1], preserves sparsity
    RobustScaler,     # uses median + IQR — outlier-resistant
    Normalizer,       # per-ROW, not per-column
)

# Standard pattern
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

# In a pipeline (preferred)
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scale", StandardScaler()), ("model", LogisticRegression())])
pipe.fit(X_train, y_train)

# Different scalers per column
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
    ("z",  StandardScaler(), ["age", "tenure"]),
    ("r",  RobustScaler(),   ["income"]),
], remainder="passthrough")

# Inspect learned params
scaler.mean_, scaler.scale_                 # StandardScaler
scaler.data_min_, scaler.data_max_          # MinMaxScaler
scaler.center_, scaler.scale_               # RobustScaler

# Inverse (e.g., for unscaling predictions if you scaled y)
scaler.inverse_transform(X_scaled)

10. Q&A — recall test¶

Q: Default scaler for linear models? A: StandardScaler. Gives mean 0, std 1, plays well with regularization and gradient descent.
Q: Should you scale data for Random Forest? A: No. Trees split on threshold per feature — scale-invariant. It won't break anything but adds zero value.
Q: Why is MinMaxScaler dangerous with outliers? A: x' = (x - min) / (max - min). One huge max makes everyone else tiny. Compresses information.
Q: Why must scaling fit on train only? A: Otherwise test statistics (mean, std, min, max) leak into the model's view of the world — inflated metrics, real-world surprise.
Q: How do you make scaling leak-proof automatically? A: Put the scaler in a Pipeline and only call .fit() on the pipeline. Each cross-validation fold refits the scaler.
Q: Difference between Normalizer and the column scalers? A: Normalizer rescales each ROW to unit norm — used in text/embedding similarity. Column scalers rescale each COLUMN independently.

Practice¶

What does this print?

Expected: 0.0

import numpy as np
from sklearn.preprocessing import StandardScaler
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0]])
scaled = StandardScaler().fit_transform(X)
print(round(scaled.mean(), 2))   # mean of standardized data is ~0

Fit the scaler on TRAIN only, then transform test (not fit on both)

Expected: True

import numpy as np
from sklearn.preprocessing import StandardScaler
X_tr = np.array([[1.0], [2.0], [3.0]])
X_te = np.array([[10.0], [11.0], [12.0]])
scaler = StandardScaler().fit(np.vstack([X_tr, X_te]))    # bug: leakage — scaler saw test data
X_tr_s = scaler.transform(X_tr)
print(X_tr_s.mean() < 0)   # if no leakage, train mean should be ~0; with leakage it's offset

Quiz — Quick check¶

What you remember

Q1. Which models REQUIRE feature scaling?

Distance-based (k-NN, SVM with RBF), gradient descent (logistic regression, neural networks)
Random Forest
Decision Tree
None

Why: Tree-based models split on individual features — scale doesn't matter. Distance and gradient-descent models compute differences/gradients across features, where a large-scale feature dominates without scaling.

Q2. What's the difference between StandardScaler and MinMaxScaler?

StandardScaler maps to mean=0, std=1; MinMaxScaler maps to [0, 1] range
No difference
StandardScaler is for classification
MinMaxScaler is deprecated

Why: StandardScaler is the default for most algorithms. MinMaxScaler is useful when you need bounded values (e.g., for neural network inputs or image data).

Q3. Why call .fit() on train data but only .transform() on test data?

To save time
To prevent data leakage — the scaler's parameters (mean, std) must be derived from training data only
Required by sklearn
Test data doesn't change

Why: If you fit on the combined train+test, your scaler "saw" test statistics. The model evaluation no longer represents how it'd perform on truly unseen data.

Common doubts¶

Should I scale the target variable y for regression?

Usually no for tree-based regressors. For linear regression and neural networks, scaling y can help with numerical stability and convergence. Use TransformedTargetRegressor to scale and inverse-transform cleanly.

What about scaling after one-hot encoding?

Generally don't scale 0/1 columns from one-hot encoding — they're already bounded. You can use ColumnTransformer to scale numeric columns and pass through the one-hot columns unchanged.

When does RobustScaler beat StandardScaler?

When you have outliers. StandardScaler uses mean and std, both of which are dragged by outliers. RobustScaler uses median and IQR — outliers barely affect them.