Feature Engineering & Dimensionality Reduction¶

1. Why this matters¶

The single biggest accuracy gain in ML is usually a smart engineered feature — not a fancier algorithm. Examples:

House prices: raw (price, sqft) → add price_per_sqft — linear regression accuracy doubles.
Web traffic: raw timestamp → add weekday, hour, is_weekend, is_holiday — daily patterns emerge.
Customer churn: raw (signup_date, last_login) → add days_since_login, account_age_days — far more predictive than the raw dates.

PCA is the opposite direction — reduce features when there are too many for the model to handle efficiently.

2. Mental model¶

Two opposite-direction operations:

flowchart LR
    subgraph Construct [Construction — add signal]
      A1[raw cols] --> A2[A op B = new feature]
      A2 --> A3[Model sees richer features]
    end
    subgraph Reduce [Reduction — remove noise]
      B1[many cols] --> B2[PCA / selection / projection]
      B2 --> B3[Fewer cols, ~same signal]
    end

You usually construct first (add features), then optionally reduce (if you've created too many).

3. Feature construction¶

Combine existing features arithmetically or via domain rules.

import pandas as pd

# Ratios — often more meaningful than raw values
df["price_per_sqft"] = df["price"] / df["sqft"]
df["debt_to_income"] = df["debt"] / df["income"].replace(0, 1)
df["click_thru_rate"] = df["clicks"] / df["impressions"].replace(0, 1)

# Aggregations
df["total_spend"] = df[["q1", "q2", "q3", "q4"]].sum(axis=1)
df["max_quarter"] = df[["q1", "q2", "q3", "q4"]].max(axis=1)

# Interactions
df["age_x_income"] = df["age"] * df["income"]
df["bmi_x_smoker"] = df["bmi"] * df["is_smoker"]

# Conditional flags
df["is_premium"] = (df["plan"].isin(["gold", "platinum"])).astype(int)
df["is_active"]  = (df["last_login_days_ago"] < 30).astype(int)

# Polynomial features (built-in)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X[["age", "bmi"]])
# Adds: age², bmi², age*bmi alongside the originals

The hardest part: thinking of which combinations matter. Domain knowledge wins here.

4. Feature splitting¶

Decompose a complex feature into simpler ones.

Strings:

# Full name → first / last
df[["first_name", "last_name"]] = df["name"].str.split(" ", n=1, expand=True)

# Email → domain
df["email_domain"] = df["email"].str.split("@").str[1]

# Phone → country code
df["country_code"] = df["phone"].str[:3]

# Address parsing — usually use a dedicated parser like libpostal for production

Dates / datetimes — almost always splittable into multiple signals:

df["timestamp"] = pd.to_datetime(df["timestamp"])

df["year"]      = df["timestamp"].dt.year
df["month"]     = df["timestamp"].dt.month
df["day"]       = df["timestamp"].dt.day
df["hour"]      = df["timestamp"].dt.hour
df["weekday"]   = df["timestamp"].dt.weekday        # 0=Mon
df["weekofyear"]= df["timestamp"].dt.isocalendar().week
df["quarter"]   = df["timestamp"].dt.quarter
df["is_weekend"]= (df["weekday"] >= 5).astype(int)
df["is_month_start"] = df["timestamp"].dt.is_month_start.astype(int)

# Cyclical encoding — hour 23 and hour 0 are adjacent, not far apart
import numpy as np
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)

Numeric — binning (turn continuous into categorical):

from sklearn.preprocessing import KBinsDiscretizer

# Equal-frequency bins (each bin has ~same count)
kbd = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")
df["age_bucket"] = kbd.fit_transform(df[["age"]]).astype(int)

# Custom bins
df["age_group"] = pd.cut(df["age"],
                         bins=[0, 18, 30, 50, 65, 120],
                         labels=["minor", "young", "mid", "senior", "elderly"])

5. Dimensionality reduction — PCA¶

Principal Component Analysis transforms features into orthogonal axes (principal components) ordered by variance. Keep the top-K components → smaller dataset that retains most signal.

The math intuition: rotate the data so the new first axis points along the direction of maximum variance, the second along the next, etc. Drop the last few axes (low variance = mostly noise).

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 1. ALWAYS scale before PCA — variance is scale-dependent
X_scaled = StandardScaler().fit_transform(X)

# 2. Fit PCA
pca = PCA(n_components=0.95)        # keep enough components for 95% variance
X_reduced = pca.fit_transform(X_scaled)

print(f"Reduced {X.shape[1]} → {X_reduced.shape[1]} components")
print("Explained variance per component:", pca.explained_variance_ratio_)
print("Cumulative:", pca.explained_variance_ratio_.cumsum())

# 3. Optionally inspect — which original features matter for each PC?
import pandas as pd
loadings = pd.DataFrame(
    pca.components_,
    columns=X.columns,
    index=[f"PC{i+1}" for i in range(pca.n_components_)],
)
print(loadings.iloc[:3].T)   # top-3 components × all features

When to use PCA: - Many correlated features (multicollinearity). - Speed up training on wide datasets. - Visualize high-dim data (n_components=2 or 3).

When NOT to use PCA: - Trees / boosting models — they handle wide feature sets natively, and PCA destroys interpretability. - When you need explainable models — PCs are linear combinations, hard to explain.

6. Feature selection¶

Different from PCA: keeps a subset of original features, doesn't transform them.

from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,    # univariate
    RFE,                                              # recursive feature elimination
    SelectFromModel,                                  # use model importances
)
from sklearn.ensemble import RandomForestClassifier

# 1. Univariate — keep top-K by statistical test
selector = SelectKBest(score_func=f_classif, k=20)
X_sel = selector.fit_transform(X, y)
print("Kept features:", X.columns[selector.get_support()].tolist())

# 2. Recursive feature elimination — train, drop worst feature, repeat
rfe = RFE(estimator=RandomForestClassifier(n_estimators=50),
          n_features_to_select=15, step=2)
rfe.fit(X, y)

# 3. From model — use model's built-in feature importance
sfm = SelectFromModel(
    estimator=RandomForestClassifier(n_estimators=100).fit(X, y),
    threshold="median",
)
X_sel = sfm.transform(X)

7. Putting it together — pipeline-ready¶

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    StandardScaler, FunctionTransformer, PolynomialFeatures,
)
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

def construct_ratios(X):
    X = X.copy()
    X["price_per_sqft"] = X["price"] / X["sqft"]
    X["age_x_income"]   = X["age"] * X["income"]
    return X

pipe = Pipeline([
    ("construct", FunctionTransformer(construct_ratios, validate=False)),
    ("scale",     StandardScaler()),
    ("poly",      PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ("pca",       PCA(n_components=0.95)),
    ("clf",       LogisticRegression(max_iter=500)),
])
pipe.fit(X_train, y_train)

8. Common pitfalls¶

❗ PCA without scaling. Features with bigger ranges dominate the variance computation; PCA finds the "tallest" axis, not the most informative one.
❗ Constructing features using the FULL dataset (incl. test). Same leakage problem as any other transformer. Put construction in a FunctionTransformer inside the pipeline if possible.
❗ Using PCA on tree-based models. Almost always hurts. Trees handle high-dim native and PCA destroys their interpretability.
❗ PolynomialFeatures(degree=3) on 50 columns. Feature count explodes combinatorially → out of memory or model overfits massively.
❗ Engineering 100 features then never measuring which mattered. Always check feature importances or coefficients post-hoc.
❗ Treating PCA components as the "real" features. They're linear combinations of all originals — hard to interpret, hard to deploy alongside the original feature schema. Use only when interpretability is a non-goal.
❗ Forgetting cyclical encoding for time-of-day / weekday. Raw hour=23 and hour=0 look far apart numerically but are adjacent in reality. Use sine/cosine.

9. When to use vs not use¶

Technique	When
Feature construction (ratios, interactions)	Always — biggest accuracy lever per minute of work.
Date/time splitting	Any datetime column.
Cyclical sin/cos encoding	Hours, weekdays, month-of-year.
`PolynomialFeatures(degree=2)`	Few features (< 20), linear model that needs non-linearity.
`KBinsDiscretizer`	Make linear models handle non-linear thresholds.
PCA	Many correlated features, linear/distance model, OK to lose interpretability.
`SelectKBest`	Quick wide → narrow feature pruning.
`SelectFromModel(RandomForest)`	Trust the tree's importances to pick top features.
`RFE`	Worth the compute, want a principled feature subset.
Skip all of this	Tree-based model on rich raw features (just measure importances).

10. Cheatsheet¶

import numpy as np
import pandas as pd

# Construction
df["ratio"]       = df["a"] / df["b"].replace(0, np.nan)
df["interaction"] = df["a"] * df["b"]
df["is_flag"]     = (df["x"] > threshold).astype(int)
df["total"]       = df[cols].sum(axis=1)

# Datetime splits
ts = pd.to_datetime(df["ts"])
df["year"]    = ts.dt.year
df["month"]   = ts.dt.month
df["hour"]    = ts.dt.hour
df["weekday"] = ts.dt.weekday
df["is_weekend"] = (ts.dt.weekday >= 5).astype(int)
df["hour_sin"] = np.sin(2*np.pi*ts.dt.hour/24)
df["hour_cos"] = np.cos(2*np.pi*ts.dt.hour/24)

# String splits
df[["first", "last"]] = df["name"].str.split(" ", n=1, expand=True)
df["domain"] = df["email"].str.extract(r"@(.+)$")

# Binning
pd.cut(df["age"], bins=[0,18,30,50,120], labels=["minor","young","mid","old"])

from sklearn.preprocessing import KBinsDiscretizer
KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")

# Polynomial
from sklearn.preprocessing import PolynomialFeatures
PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

# PCA (always after scaling)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
PCA(n_components=0.95)            # keep enough for 95% variance
PCA(n_components=10)              # keep top 10

# Feature selection
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
SelectKBest(score_func=mutual_info_classif, k=20)

from sklearn.feature_selection import SelectFromModel
SelectFromModel(RandomForestClassifier(n_estimators=100), threshold="median")

11. Q&A — recall test¶

Q: Cardinal rule of PCA? A: Scale first. PCA finds axes of maximum variance; without scaling, the largest-range feature dominates.
Q: Why split a datetime into multiple features? A: The model can't extract "weekday" from a raw timestamp. Splitting exposes daily, weekly, monthly, yearly patterns as separate signals.
Q: When is cyclical encoding (sin/cos) the right call? A: Hours of day, days of week, months of year — anything where the high and low values are actually adjacent (23h ↔ 0h).
Q: Difference between PCA and SelectKBest? A: PCA transforms features into new (linear combination) axes. SelectKBest picks a subset of the original features. PCA reduces noise; selection improves interpretability.
Q: Best single feature engineering move for a tabular ML project? A: Ratios and interactions informed by domain knowledge. A price_per_sqft ratio routinely doubles linear-model accuracy.
Q: Why is PCA a bad idea for Random Forest? A: RF handles high-dimensional input natively, and PCA destroys feature interpretability — you can't explain which original feature mattered.

Practice¶

What does this print?

Expected: 2

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
print(PCA(n_components=2).fit_transform(X).shape[1])

Scale BEFORE applying PCA (PCA is sensitive to feature scale)

Expected: True

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1, 1000], [2, 2000], [3, 3000], [4, 4000]])
Xp = PCA(n_components=1).fit_transform(X)      # bug: large-scale feature dominates first component
# After scaling, both features contribute meaningfully
print(Xp.shape == (4, 1))

Quiz — Quick check¶

What you remember

Q1. Why scale features before PCA?

PCA only works on integers
PCA finds directions of maximum variance — without scaling, large-scale features dominate
To make it faster
Required for the algorithm to converge

Why: A feature in $1000s vs one in 0-1 range will appear to have ~1,000,000× the variance. PCA's first principal component will be almost entirely the large-scale feature, missing the actual structure.

Q2. What's a good rule of thumb for how many PCA components to keep?

Always 2
Enough components to explain ~95% of the variance (use pca.explained_variance_ratio_)
Half the original features
One per row

Why: The "elbow" of the cumulative explained variance plot tells you. 95% is a common threshold; 99% if you want minimal information loss; 80% for aggressive compression.

Q3. When does feature engineering matter most?

Never — modern models handle everything
Almost always — domain-specific features (e.g., days_since_last_purchase) often beat tuning model hyperparameters
Only for deep learning
Only for big data

Why: "Better features beat better models" is a common refrain. A well-chosen ratio, interaction term, or aggregation can produce gains that no hyperparameter tuning would match.

Common doubts¶

When should I use PCA vs feature selection?

PCA when you want to compress information while preserving variance — features become linear combinations of originals, losing interpretability. Feature selection when you want to keep original features but reduce count — better for interpretability, often preferred for production models.

How do I create date-related features?

Extract year, month, day_of_week, is_weekend, is_holiday, days_since_X, time_since_last_event. For cyclic features (hour, month), use sin/cos transforms so December (12) is close to January (1). Add quarter for seasonal patterns.

Should I do feature engineering before or after train/test split?

Before, if the engineering is deterministic (extracting day_of_week from a date) — same logic for both. After, fit on train only, for engineering that uses statistics (target encoding, aggregations over time). Otherwise you leak.