Feature Engineering & Dimensionality Reduction¶
1. Why this matters¶
The single biggest accuracy gain in ML is usually a smart engineered feature — not a fancier algorithm. Examples:
- House prices: raw
(price, sqft)→ addprice_per_sqft— linear regression accuracy doubles. - Web traffic: raw
timestamp→ addweekday, hour, is_weekend, is_holiday— daily patterns emerge. - Customer churn: raw
(signup_date, last_login)→ adddays_since_login, account_age_days— far more predictive than the raw dates.
PCA is the opposite direction — reduce features when there are too many for the model to handle efficiently.
2. Mental model¶
Two opposite-direction operations:
flowchart LR
subgraph Construct [Construction — add signal]
A1[raw cols] --> A2[A op B = new feature]
A2 --> A3[Model sees richer features]
end
subgraph Reduce [Reduction — remove noise]
B1[many cols] --> B2[PCA / selection / projection]
B2 --> B3[Fewer cols, ~same signal]
end
You usually construct first (add features), then optionally reduce (if you've created too many).
3. Feature construction¶
Combine existing features arithmetically or via domain rules.
import pandas as pd
# Ratios — often more meaningful than raw values
df["price_per_sqft"] = df["price"] / df["sqft"]
df["debt_to_income"] = df["debt"] / df["income"].replace(0, 1)
df["click_thru_rate"] = df["clicks"] / df["impressions"].replace(0, 1)
# Aggregations
df["total_spend"] = df[["q1", "q2", "q3", "q4"]].sum(axis=1)
df["max_quarter"] = df[["q1", "q2", "q3", "q4"]].max(axis=1)
# Interactions
df["age_x_income"] = df["age"] * df["income"]
df["bmi_x_smoker"] = df["bmi"] * df["is_smoker"]
# Conditional flags
df["is_premium"] = (df["plan"].isin(["gold", "platinum"])).astype(int)
df["is_active"] = (df["last_login_days_ago"] < 30).astype(int)
# Polynomial features (built-in)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X[["age", "bmi"]])
# Adds: age², bmi², age*bmi alongside the originals
The hardest part: thinking of which combinations matter. Domain knowledge wins here.
4. Feature splitting¶
Decompose a complex feature into simpler ones.
Strings:
# Full name → first / last
df[["first_name", "last_name"]] = df["name"].str.split(" ", n=1, expand=True)
# Email → domain
df["email_domain"] = df["email"].str.split("@").str[1]
# Phone → country code
df["country_code"] = df["phone"].str[:3]
# Address parsing — usually use a dedicated parser like libpostal for production
Dates / datetimes — almost always splittable into multiple signals:
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
df["day"] = df["timestamp"].dt.day
df["hour"] = df["timestamp"].dt.hour
df["weekday"] = df["timestamp"].dt.weekday # 0=Mon
df["weekofyear"]= df["timestamp"].dt.isocalendar().week
df["quarter"] = df["timestamp"].dt.quarter
df["is_weekend"]= (df["weekday"] >= 5).astype(int)
df["is_month_start"] = df["timestamp"].dt.is_month_start.astype(int)
# Cyclical encoding — hour 23 and hour 0 are adjacent, not far apart
import numpy as np
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
Numeric — binning (turn continuous into categorical):
from sklearn.preprocessing import KBinsDiscretizer
# Equal-frequency bins (each bin has ~same count)
kbd = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")
df["age_bucket"] = kbd.fit_transform(df[["age"]]).astype(int)
# Custom bins
df["age_group"] = pd.cut(df["age"],
bins=[0, 18, 30, 50, 65, 120],
labels=["minor", "young", "mid", "senior", "elderly"])
5. Dimensionality reduction — PCA¶
Principal Component Analysis transforms features into orthogonal axes (principal components) ordered by variance. Keep the top-K components → smaller dataset that retains most signal.
The math intuition: rotate the data so the new first axis points along the direction of maximum variance, the second along the next, etc. Drop the last few axes (low variance = mostly noise).
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# 1. ALWAYS scale before PCA — variance is scale-dependent
X_scaled = StandardScaler().fit_transform(X)
# 2. Fit PCA
pca = PCA(n_components=0.95) # keep enough components for 95% variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced {X.shape[1]} → {X_reduced.shape[1]} components")
print("Explained variance per component:", pca.explained_variance_ratio_)
print("Cumulative:", pca.explained_variance_ratio_.cumsum())
# 3. Optionally inspect — which original features matter for each PC?
import pandas as pd
loadings = pd.DataFrame(
pca.components_,
columns=X.columns,
index=[f"PC{i+1}" for i in range(pca.n_components_)],
)
print(loadings.iloc[:3].T) # top-3 components × all features
When to use PCA: - Many correlated features (multicollinearity). - Speed up training on wide datasets. - Visualize high-dim data (n_components=2 or 3).
When NOT to use PCA: - Trees / boosting models — they handle wide feature sets natively, and PCA destroys interpretability. - When you need explainable models — PCs are linear combinations, hard to explain.
6. Feature selection¶
Different from PCA: keeps a subset of original features, doesn't transform them.
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif, # univariate
RFE, # recursive feature elimination
SelectFromModel, # use model importances
)
from sklearn.ensemble import RandomForestClassifier
# 1. Univariate — keep top-K by statistical test
selector = SelectKBest(score_func=f_classif, k=20)
X_sel = selector.fit_transform(X, y)
print("Kept features:", X.columns[selector.get_support()].tolist())
# 2. Recursive feature elimination — train, drop worst feature, repeat
rfe = RFE(estimator=RandomForestClassifier(n_estimators=50),
n_features_to_select=15, step=2)
rfe.fit(X, y)
# 3. From model — use model's built-in feature importance
sfm = SelectFromModel(
estimator=RandomForestClassifier(n_estimators=100).fit(X, y),
threshold="median",
)
X_sel = sfm.transform(X)
7. Putting it together — pipeline-ready¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
StandardScaler, FunctionTransformer, PolynomialFeatures,
)
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
def construct_ratios(X):
X = X.copy()
X["price_per_sqft"] = X["price"] / X["sqft"]
X["age_x_income"] = X["age"] * X["income"]
return X
pipe = Pipeline([
("construct", FunctionTransformer(construct_ratios, validate=False)),
("scale", StandardScaler()),
("poly", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
("pca", PCA(n_components=0.95)),
("clf", LogisticRegression(max_iter=500)),
])
pipe.fit(X_train, y_train)
8. Common pitfalls¶
- ❗ PCA without scaling. Features with bigger ranges dominate the variance computation; PCA finds the "tallest" axis, not the most informative one.
- ❗ Constructing features using the FULL dataset (incl. test). Same leakage problem as any other transformer. Put construction in a
FunctionTransformerinside the pipeline if possible. - ❗ Using PCA on tree-based models. Almost always hurts. Trees handle high-dim native and PCA destroys their interpretability.
- ❗
PolynomialFeatures(degree=3)on 50 columns. Feature count explodes combinatorially → out of memory or model overfits massively. - ❗ Engineering 100 features then never measuring which mattered. Always check feature importances or coefficients post-hoc.
- ❗ Treating PCA components as the "real" features. They're linear combinations of all originals — hard to interpret, hard to deploy alongside the original feature schema. Use only when interpretability is a non-goal.
- ❗ Forgetting cyclical encoding for time-of-day / weekday. Raw
hour=23andhour=0look far apart numerically but are adjacent in reality. Use sine/cosine.
9. When to use vs not use¶
| Technique | When |
|---|---|
| Feature construction (ratios, interactions) | Always — biggest accuracy lever per minute of work. |
| Date/time splitting | Any datetime column. |
| Cyclical sin/cos encoding | Hours, weekdays, month-of-year. |
PolynomialFeatures(degree=2) |
Few features (< 20), linear model that needs non-linearity. |
KBinsDiscretizer |
Make linear models handle non-linear thresholds. |
| PCA | Many correlated features, linear/distance model, OK to lose interpretability. |
SelectKBest |
Quick wide → narrow feature pruning. |
SelectFromModel(RandomForest) |
Trust the tree's importances to pick top features. |
RFE |
Worth the compute, want a principled feature subset. |
| Skip all of this | Tree-based model on rich raw features (just measure importances). |
10. Cheatsheet¶
import numpy as np
import pandas as pd
# Construction
df["ratio"] = df["a"] / df["b"].replace(0, np.nan)
df["interaction"] = df["a"] * df["b"]
df["is_flag"] = (df["x"] > threshold).astype(int)
df["total"] = df[cols].sum(axis=1)
# Datetime splits
ts = pd.to_datetime(df["ts"])
df["year"] = ts.dt.year
df["month"] = ts.dt.month
df["hour"] = ts.dt.hour
df["weekday"] = ts.dt.weekday
df["is_weekend"] = (ts.dt.weekday >= 5).astype(int)
df["hour_sin"] = np.sin(2*np.pi*ts.dt.hour/24)
df["hour_cos"] = np.cos(2*np.pi*ts.dt.hour/24)
# String splits
df[["first", "last"]] = df["name"].str.split(" ", n=1, expand=True)
df["domain"] = df["email"].str.extract(r"@(.+)$")
# Binning
pd.cut(df["age"], bins=[0,18,30,50,120], labels=["minor","young","mid","old"])
from sklearn.preprocessing import KBinsDiscretizer
KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")
# Polynomial
from sklearn.preprocessing import PolynomialFeatures
PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
# PCA (always after scaling)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
PCA(n_components=0.95) # keep enough for 95% variance
PCA(n_components=10) # keep top 10
# Feature selection
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
SelectKBest(score_func=mutual_info_classif, k=20)
from sklearn.feature_selection import SelectFromModel
SelectFromModel(RandomForestClassifier(n_estimators=100), threshold="median")
11. Q&A — recall test¶
-
Q: Cardinal rule of PCA? A: Scale first. PCA finds axes of maximum variance; without scaling, the largest-range feature dominates.
-
Q: Why split a datetime into multiple features? A: The model can't extract "weekday" from a raw timestamp. Splitting exposes daily, weekly, monthly, yearly patterns as separate signals.
-
Q: When is cyclical encoding (sin/cos) the right call? A: Hours of day, days of week, months of year — anything where the high and low values are actually adjacent (23h ↔ 0h).
-
Q: Difference between PCA and SelectKBest? A: PCA transforms features into new (linear combination) axes. SelectKBest picks a subset of the original features. PCA reduces noise; selection improves interpretability.
-
Q: Best single feature engineering move for a tabular ML project? A: Ratios and interactions informed by domain knowledge. A
price_per_sqftratio routinely doubles linear-model accuracy. -
Q: Why is PCA a bad idea for Random Forest? A: RF handles high-dimensional input natively, and PCA destroys feature interpretability — you can't explain which original feature mattered.
Practice¶
What does this print?
Expected: 2
Scale BEFORE applying PCA (PCA is sensitive to feature scale)
Expected: True
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1, 1000], [2, 2000], [3, 3000], [4, 4000]])
Xp = PCA(n_components=1).fit_transform(X) # bug: large-scale feature dominates first component
# After scaling, both features contribute meaningfully
print(Xp.shape == (4, 1))
Quiz — Quick check¶
What you remember
Q1. Why scale features before PCA?
- PCA only works on integers
- PCA finds directions of maximum variance — without scaling, large-scale features dominate
- To make it faster
- Required for the algorithm to converge
Why: A feature in $1000s vs one in 0-1 range will appear to have ~1,000,000× the variance. PCA's first principal component will be almost entirely the large-scale feature, missing the actual structure.
Q2. What's a good rule of thumb for how many PCA components to keep?
- Always 2
- Enough components to explain ~95% of the variance (use
pca.explained_variance_ratio_) - Half the original features
- One per row
Why: The "elbow" of the cumulative explained variance plot tells you. 95% is a common threshold; 99% if you want minimal information loss; 80% for aggressive compression.
Q3. When does feature engineering matter most?
- Never — modern models handle everything
- Almost always — domain-specific features (e.g.,
days_since_last_purchase) often beat tuning model hyperparameters - Only for deep learning
- Only for big data
Why: "Better features beat better models" is a common refrain. A well-chosen ratio, interaction term, or aggregation can produce gains that no hyperparameter tuning would match.
Common doubts¶
When should I use PCA vs feature selection?
PCA when you want to compress information while preserving variance — features become linear combinations of originals, losing interpretability. Feature selection when you want to keep original features but reduce count — better for interpretability, often preferred for production models.
How do I create date-related features?
Extract year, month, day_of_week, is_weekend, is_holiday, days_since_X, time_since_last_event. For cyclic features (hour, month), use sin/cos transforms so December (12) is close to January (1). Add quarter for seasonal patterns.
Should I do feature engineering before or after train/test split?
Before, if the engineering is deterministic (extracting day_of_week from a date) — same logic for both. After, fit on train only, for engineering that uses statistics (target encoding, aggregations over time). Otherwise you leak.