Encoding Categorical Features¶

1. Why this matters¶

In every tabular dataset you'll handle, half the columns are categorical: country, plan_type, device_os, occupation. Models can't multiply against "premium". The encoder choice affects accuracy, model size, and inference latency — and the wrong choice silently teaches the model false orderings.

2. Mental model¶

Categorical features split by two axes:

       Ordered (small < medium < large)        Unordered (red, green, blue)
 ↓
Low-cardinality (< ~10)    OrdinalEncoder       OneHotEncoder
High-cardinality (> 50)    OrdinalEncoder       Target / Frequency / Hash encoding

flowchart TD
    A[Categorical column] --> Q{Natural order?}
    Q -->|yes: XS, S, M, L, XL| O[OrdinalEncoder<br/>map to 0,1,2,3,4]
    Q -->|no| C{Cardinality}
    C -->|< 10-15| OH[OneHotEncoder]
    C -->|> 50| T[Target / Frequency / Hashing]

3. Method 1: OrdinalEncoder (ordered categories)¶

Maps categories to integers respecting an order you provide.

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

X = pd.DataFrame({"size": ["S", "M", "L", "XL", "M", "S"]})

oe = OrdinalEncoder(categories=[["S", "M", "L", "XL"]])
# pass the ORDER explicitly — sklearn doesn't know what makes sense
X["size_enc"] = oe.fit_transform(X[["size"]])
# → 0, 1, 2, 3, 1, 0

Critical: if you don't pass categories=, sklearn uses alphabetical order, which is rarely what you want.

Use for: - size: S < M < L < XL - education: HS < BA < MA < PhD - rating: 1-star < 2-star < ... < 5-star - priority: low < medium < high

Don't use for: - color, country, category, device_model — no real ordering.

4. Method 2: OneHotEncoder (unordered categories)¶

Creates a 0/1 column per category.

from sklearn.preprocessing import OneHotEncoder

X = pd.DataFrame({"color": ["red", "green", "blue", "red"]})

ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_enc = ohe.fit_transform(X[["color"]])
# →
# [[0, 0, 1],    blue=0, green=0, red=1
#  [0, 1, 0],
#  [1, 0, 0],
#  [0, 0, 1]]
print(ohe.get_feature_names_out())  # ['color_blue', 'color_green', 'color_red']

Common options:

OneHotEncoder(
    handle_unknown="ignore",   # categories not seen in fit → all zeros (vs error)
    sparse_output=True,         # store as scipy sparse — saves RAM with many categories
    drop="first",               # drop one column to avoid the dummy-variable trap (linear models)
    drop="if_binary",           # only drop for binary features
    min_frequency=10,           # group rare categories into 'infrequent_sklearn'
    max_categories=20,          # cap at top-N categories
)

Use for: - color, country, payment_method, gender, device_type — low-cardinality unordered.

Don't use when: - > ~30 unique values — explodes column count, hurts tree models, slows training.

5. Method 3: Pandas shortcut for OHE¶

df_enc = pd.get_dummies(df, columns=["color", "country"], drop_first=False)
# Returns a DataFrame; ideal for EDA. Less ideal for production
# because it doesn't remember categories across fit/transform.

For production, use OneHotEncoder inside a Pipeline — it remembers categories.

6. Method 4: Target / Mean Encoding (high cardinality)¶

For columns with hundreds of categories (zip_code, user_id, product_sku), one-hot is impractical. Replace each category with the mean of the target for that category.

# DIY (with care: target leakage risk)
target_map = X_train.groupby("zip_code")["y"].mean().to_dict()
X_train["zip_enc"] = X_train["zip_code"].map(target_map)
X_test["zip_enc"]  = X_test["zip_code"].map(target_map).fillna(X_train["y"].mean())

Or use the well-tested category_encoders package:

# pip install category_encoders
from category_encoders import TargetEncoder

te = TargetEncoder(cols=["zip_code"], smoothing=10)
X_train_enc = te.fit_transform(X_train, y_train)
X_test_enc  = te.transform(X_test)

Critical: target encoding must be fit on train only, ideally within cross-validation folds. Otherwise it leaks the target.

Use for: high-cardinality categoricals where one-hot would create thousands of columns.

7. Method 5: Frequency / Count Encoding¶

Replace each category with its frequency in the training data.

freq = X_train["category"].value_counts(normalize=True).to_dict()
X_train["category_freq"] = X_train["category"].map(freq)
X_test["category_freq"]  = X_test["category"].map(freq).fillna(0)

Use for: Tree-based models with high-cardinality categoricals. No leakage risk (no target involved).

8. Putting it together — ColumnTransformer¶

The right pattern for any non-trivial dataset:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

ordered_cats   = ["education"]
unordered_cats = ["country", "plan_type", "device_os"]
numeric        = ["age", "tenure_months"]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale",  StandardScaler()),
        ]), numeric),
        ("ord", OrdinalEncoder(
            categories=[["HS","BA","MA","PhD"]],
            handle_unknown="use_encoded_value",
            unknown_value=-1,
        ), ordered_cats),
        ("ohe", OneHotEncoder(
            handle_unknown="ignore",
            min_frequency=10,        # group rare into 'infrequent'
        ), unordered_cats),
    ],
    remainder="drop",
)

pipe = Pipeline([("prep", preprocess), ("clf", RandomForestClassifier())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

9. Architecture / Flow¶

flowchart TD
    DF[DataFrame] --> CT[ColumnTransformer]
    CT --> N[numeric: impute + scale]
    CT --> O[ordered cats: OrdinalEncoder]
    CT --> H[unordered cats: OneHotEncoder]
    N --> CC[concat]
    O --> CC
    H --> CC
    CC --> M[Model]

10. Common pitfalls¶

❗ Using OrdinalEncoder on unordered categories. Assigning color: red=0, green=1, blue=2 teaches the model that blue > green > red.
❗ Forgetting handle_unknown="ignore" on OneHotEncoder. A new category in production → exception. With "ignore", unknown categories become all-zero (treated as "unknown").
❗ Not passing categories= to OrdinalEncoder. Default is alphabetical: ["L","M","S","XL"] → L=0, M=1, S=2, XL=3. Wrong order.
❗ One-hot encoding a high-cardinality column. OneHotEncoder(zip_code) → 40,000 columns → blown memory + worse model.
❗ Target encoding without CV. Leaks target info; train accuracy looks great, test fails. Use category_encoders with proper CV or wrap in OutOfFoldTargetEncoder.
❗ pd.get_dummies() for production. It doesn't remember categories across train/test — new test data with a new category gives a different column set. Use OneHotEncoder inside a Pipeline.
❗ Dropping the wrong column with drop="first". Use drop="first" only for linear models that suffer the dummy variable trap; trees are fine without dropping.

11. When to use vs not use¶

Encoder	When
`OrdinalEncoder`	Categories have a meaningful order (size, education, rating).
`OneHotEncoder`	Categories unordered, cardinality < ~30. Default for linear models.
`pd.get_dummies`	Quick EDA only. Not for production pipelines.
Target encoding	High cardinality (50+), can manage leakage with CV.
Frequency encoding	High cardinality + tree-based models. Safe (no target leakage).
Hashing encoder	Very high cardinality (>10k), streaming, OOV-safe.
Embeddings	Deep learning over categorical (e.g., entity embeddings). Beyond classical ML.

12. Cheatsheet¶

# Ordinal — when there's a natural order
from sklearn.preprocessing import OrdinalEncoder
OrdinalEncoder(
    categories=[["S","M","L","XL"]],
    handle_unknown="use_encoded_value",
    unknown_value=-1,
)

# One-hot — default for unordered
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(
    handle_unknown="ignore",
    sparse_output=False,         # True to save RAM with many cats
    drop=None,                    # or "first" / "if_binary" for linear models
    min_frequency=10,             # group rare into 'infrequent_sklearn'
    max_categories=20,
)

# Pandas shortcut (EDA only)
pd.get_dummies(df, columns=["color"], drop_first=True)

# Apply different encoders per column
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
    ("ord", OrdinalEncoder(...), ["size", "edu"]),
    ("ohe", OneHotEncoder(...),  ["color", "country"]),
])

# High-cardinality
from category_encoders import (TargetEncoder, CountEncoder, HashingEncoder)
TargetEncoder(cols=["zip"], smoothing=10).fit_transform(X_train, y_train)

# Inspect output column names
ohe.get_feature_names_out()
ct.get_feature_names_out()

13. Q&A — recall test¶

Q: When to use Ordinal vs One-Hot? A: Ordinal when categories have a meaningful order (small < medium < large). One-Hot when they don't (red, green, blue).
Q: Why pass categories= to OrdinalEncoder? A: Without it, sklearn uses alphabetical order — which is rarely the order you actually want for an ordered category.
Q: How do you handle unknown categories at inference time? A: OneHotEncoder(handle_unknown="ignore") — new categories produce all-zero rows. OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1) — new categories become -1.
Q: What's the danger with target encoding? A: Target leakage — using y to compute features inflates training accuracy and crashes in production. Mitigate with out-of-fold target encoding via category_encoders or proper CV.
Q: Pandas get_dummies() vs sklearn OneHotEncoder() — which for production? A: sklearn OneHotEncoder — it remembers categories from training, so test data with a new category doesn't break the model. get_dummies is fine for EDA only.
Q: What's the dummy-variable trap? A: With k one-hot columns, one is redundant (perfectly predicted by the others). For linear models with intercept, drop one with drop="first". Trees are unaffected.

Practice¶

What does this print?

Expected: (3, 4)

import pandas as pd
df = pd.DataFrame({"city": ["A", "B", "C"], "x": [1, 2, 3]})
print(pd.get_dummies(df, columns=["city"]).shape)

Handle unseen categories at inference (don't crash on a new city)

Expected: True

from sklearn.preprocessing import OneHotEncoder
import numpy as np
ohe = OneHotEncoder(sparse_output=False)         # bug: default raises on unknown categories at transform
ohe.fit([["Mumbai"], ["Delhi"]])
try:
    ohe.transform([["Pune"]])
    ok = True
except ValueError:
    ok = False
print(ok)

Quiz — Quick check¶

What you remember

Q1. What's the right encoding for an ordinal feature like "small/medium/large"?

One-hot encoding
Ordinal encoding (small=0, medium=1, large=2)
Label encoding
Hashing

Why: Ordinal features have meaningful order. One-hot loses that order. Ordinal encoding preserves it — tree models exploit the ordering; linear models capture monotonic relationships.

Q2. When should you use target encoding instead of one-hot?

Never — it's deprecated
When a categorical has very high cardinality (1000s of unique values) — one-hot would explode the feature space
Only for binary classification
For numeric features

Why: With 10,000 unique ZIP codes, one-hot creates 10,000 columns. Target encoding replaces each category with the mean of y for that category — one column, fast, useful. Be careful about leakage: encode using only training-fold targets.

Q3. Why pass handle_unknown="ignore" to OneHotEncoder?

To allow null values
So new categories at inference time become all zeros instead of raising an error
Speeds up training
Removes rare categories

Why: Production data drifts — new categories appear after deployment. Without handle_unknown="ignore", your serving pipeline crashes the first time a new value appears. Always set it.

Common doubts¶

When is get_dummies (Pandas) vs OneHotEncoder (sklearn) the right choice?

Use OneHotEncoder in production pipelines — it remembers categories from training and handles unknown values cleanly. Use get_dummies for quick exploration in notebooks; it has no concept of "fit" and creates different columns each time depending on the data.

How do I encode dates?

Don't one-hot them. Extract features: year, month, day_of_week, is_weekend, days_since_signup. For cyclic features (hour, month), use sin/cos transforms: sin(2π × month/12) and cos(2π × month/12) so December (12) is close to January (1).

What about high-cardinality categorical features like user IDs?

Don't directly encode them as features — too sparse, can leak. Common patterns: (1) drop them and learn from aggregated user features, (2) use entity embeddings (deep learning), (3) target-encode with smoothing, (4) use frequency encoding (replace each ID with its count).