Feature Scaling¶
1. Why this matters¶
If feature A ranges 0–1 and feature B ranges 0–1,000,000, a distance metric (KNN, K-means) and a gradient-based optimizer (logistic regression, neural nets) will see B as 1,000,000× more important. Scaling fixes that.
| Algorithm | Needs scaling? |
|---|---|
| KNN, K-means | YES — distance-based |
| Logistic / Linear regression | YES — improves convergence + interpretability |
| SVM | YES — distance-based |
| PCA | YES — variance-based |
| Neural networks | YES — convergence depends on it |
| Decision tree | NO |
| Random Forest / Gradient Boosting / XGBoost | NO — splits are scale-invariant |
| Naive Bayes | NO |
2. Mental model¶
Scaling is a per-column transformation — for each column independently, apply the same formula to every value:
flowchart LR
A[Raw column<br/>x: 100, 200, 5000] --> S[Scaler.fit on train]
S -->|"learns mean/std (or min/max)"| P[Params]
P --> T[Transform train & test]
T --> O["Scaled column<br/>e.g. -0.5, -0.3, 1.8"]
The cardinal rule: fit on train, transform on both train and test. Never fit on test.
3. Methods¶
StandardScaler (Z-score normalization) — most common default¶
After transform: column has mean 0, std 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use train's mean/std
Use when: features are roughly normally distributed; want symmetric range; default choice for linear models, SVM, NN.
MinMaxScaler — scale to [0, 1]¶
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1)) # default
Use when: you need bounded output (image pixels, neural net inputs), or you don't care about preserving distribution shape.
Warning: sensitive to outliers — one big value compresses the rest into a tiny range.
MaxAbsScaler — scale by max absolute value¶
Output range: [-1, 1]. Use when: you have sparse data (don't want to shift the zero). Common with NLP feature vectors.
RobustScaler — outlier-resistant¶
Uses median + interquartile range instead of mean + std. Use when: you have outliers you don't want to remove but don't want to dominate scaling.
Normalizer — per-ROW (not per-column)¶
Scales each sample to unit norm. Used in NLP / text similarity, not typical tabular ML.
4. Architecture / Flow¶
flowchart TD
A[X_train] --> S[scaler.fit_transform X_train]
S --> AX[scaled X_train]
AX --> M[Model.fit]
B[X_test] --> T[scaler.transform X_test]
T --> BX[scaled X_test]
BX --> P[Model.predict]
style S fill:#e8f5e9
style T fill:#fff4e5
Inside a Pipeline this happens automatically with no leakage.
5. Code — minimal working example¶
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd
X = pd.DataFrame({
"age": [25, 30, 35, 42, 50, 80],
"income": [40_000, 50_000, 60_000, 80_000, 120_000, 5_000_000], # outlier!
})
for name, scaler in [("Std", StandardScaler()),
("MinMax", MinMaxScaler()),
("Robust", RobustScaler())]:
scaled = scaler.fit_transform(X)
print(name)
print(pd.DataFrame(scaled, columns=X.columns).round(2))
print()
You'll see MinMaxScaler squashes almost everyone into a tiny range because of the 5M outlier. RobustScaler handles it gracefully.
6. Code — real-world pattern with Pipeline¶
The right way — never fit a scaler outside a Pipeline:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
num_features_normal = ["age", "tenure_months"]
num_features_skewed = ["income", "transaction_amount"] # have outliers
preprocess = ColumnTransformer([
("std", StandardScaler(), num_features_normal),
("robust", RobustScaler(), num_features_skewed),
], remainder="passthrough")
pipe = Pipeline([
("prep", preprocess),
("clf", LogisticRegression(max_iter=500)),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
This is leak-proof: when cross_val_score(pipe, X, y, cv=5) runs, the scaler refits on each fold's train, never seeing the fold's validation.
7. Common pitfalls¶
- ❗ Fitting the scaler on
X(train + test) before splitting. Test statistics leak into the scaler — inflated metrics, broken production. - ❗ Scaling categorical / one-hot columns. Pointless and slightly distorts them. Scale only numeric continuous columns. Use
ColumnTransformerto limit scope. - ❗ Using
MinMaxScaleron outlier-heavy data. A single 5M income compresses 99% of users into [0.00, 0.02]. UseRobustScaleror remove outliers first. - ❗ Forgetting to save the scaler. Production needs the same scaler used at training. Save the entire
Pipelineviajoblib.dump(pipe, ...). - ❗ Scaling tree-based models for no reason. It's a no-op (won't hurt accuracy) but adds complexity and inference cost.
- ❗ Scaling target
yfor regression and forgetting to inverse-transform predictions. Possible but tricky — usually only worth it ifyis hugely skewed.
8. When to use vs not use¶
| Method | When |
|---|---|
StandardScaler |
Default for most numeric features. |
MinMaxScaler |
Bounded outputs needed (e.g., image pixels 0-255 → 0-1); when min/max are stable and outliers are rare. |
RobustScaler |
Heavy outliers you can't or don't want to remove. |
MaxAbsScaler |
Sparse data (NLP tf-idf, one-hot stacks) — preserves sparsity. |
Normalizer |
Per-sample (row) normalization — text similarity, cosine distance. |
| No scaling | Tree-based models (RandomForest, GBM, XGBoost, LightGBM), Naive Bayes. |
| Log / Box-Cox / Yeo-Johnson | Highly skewed features — see Pipelines. |
9. Cheatsheet¶
from sklearn.preprocessing import (
StandardScaler, # mean=0, std=1
MinMaxScaler, # [0,1] (or custom range)
MaxAbsScaler, # [-1,1], preserves sparsity
RobustScaler, # uses median + IQR — outlier-resistant
Normalizer, # per-ROW, not per-column
)
# Standard pattern
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
# In a pipeline (preferred)
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scale", StandardScaler()), ("model", LogisticRegression())])
pipe.fit(X_train, y_train)
# Different scalers per column
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
("z", StandardScaler(), ["age", "tenure"]),
("r", RobustScaler(), ["income"]),
], remainder="passthrough")
# Inspect learned params
scaler.mean_, scaler.scale_ # StandardScaler
scaler.data_min_, scaler.data_max_ # MinMaxScaler
scaler.center_, scaler.scale_ # RobustScaler
# Inverse (e.g., for unscaling predictions if you scaled y)
scaler.inverse_transform(X_scaled)
10. Q&A — recall test¶
-
Q: Default scaler for linear models? A:
StandardScaler. Gives mean 0, std 1, plays well with regularization and gradient descent. -
Q: Should you scale data for Random Forest? A: No. Trees split on threshold per feature — scale-invariant. It won't break anything but adds zero value.
-
Q: Why is
MinMaxScalerdangerous with outliers? A:x' = (x - min) / (max - min). One hugemaxmakes everyone else tiny. Compresses information. -
Q: Why must scaling fit on train only? A: Otherwise test statistics (mean, std, min, max) leak into the model's view of the world — inflated metrics, real-world surprise.
-
Q: How do you make scaling leak-proof automatically? A: Put the scaler in a
Pipelineand only call.fit()on the pipeline. Each cross-validation fold refits the scaler. -
Q: Difference between
Normalizerand the column scalers? A:Normalizerrescales each ROW to unit norm — used in text/embedding similarity. Column scalers rescale each COLUMN independently.
Practice¶
What does this print?
Expected: 0.0
Fit the scaler on TRAIN only, then transform test (not fit on both)
Expected: True
import numpy as np
from sklearn.preprocessing import StandardScaler
X_tr = np.array([[1.0], [2.0], [3.0]])
X_te = np.array([[10.0], [11.0], [12.0]])
scaler = StandardScaler().fit(np.vstack([X_tr, X_te])) # bug: leakage — scaler saw test data
X_tr_s = scaler.transform(X_tr)
print(X_tr_s.mean() < 0) # if no leakage, train mean should be ~0; with leakage it's offset
Quiz — Quick check¶
What you remember
Q1. Which models REQUIRE feature scaling?
- Distance-based (k-NN, SVM with RBF), gradient descent (logistic regression, neural networks)
- Random Forest
- Decision Tree
- None
Why: Tree-based models split on individual features — scale doesn't matter. Distance and gradient-descent models compute differences/gradients across features, where a large-scale feature dominates without scaling.
Q2. What's the difference between StandardScaler and MinMaxScaler?
- StandardScaler maps to mean=0, std=1; MinMaxScaler maps to [0, 1] range
- No difference
- StandardScaler is for classification
- MinMaxScaler is deprecated
Why: StandardScaler is the default for most algorithms. MinMaxScaler is useful when you need bounded values (e.g., for neural network inputs or image data).
Q3. Why call .fit() on train data but only .transform() on test data?
- To save time
- To prevent data leakage — the scaler's parameters (mean, std) must be derived from training data only
- Required by sklearn
- Test data doesn't change
Why: If you
fiton the combined train+test, your scaler "saw" test statistics. The model evaluation no longer represents how it'd perform on truly unseen data.
Common doubts¶
Should I scale the target variable y for regression?
Usually no for tree-based regressors. For linear regression and neural networks, scaling y can help with numerical stability and convergence. Use TransformedTargetRegressor to scale and inverse-transform cleanly.
What about scaling after one-hot encoding?
Generally don't scale 0/1 columns from one-hot encoding — they're already bounded. You can use ColumnTransformer to scale numeric columns and pass through the one-hot columns unchanged.
When does RobustScaler beat StandardScaler?
When you have outliers. StandardScaler uses mean and std, both of which are dragged by outliers. RobustScaler uses median and IQR — outliers barely affect them.