Encoding Categorical Features¶
1. Why this matters¶
In every tabular dataset you'll handle, half the columns are categorical: country, plan_type, device_os, occupation. Models can't multiply against "premium". The encoder choice affects accuracy, model size, and inference latency — and the wrong choice silently teaches the model false orderings.
2. Mental model¶
Categorical features split by two axes:
Ordered (small < medium < large) Unordered (red, green, blue)
↓
Low-cardinality (< ~10) OrdinalEncoder OneHotEncoder
High-cardinality (> 50) OrdinalEncoder Target / Frequency / Hash encoding
flowchart TD
A[Categorical column] --> Q{Natural order?}
Q -->|yes: XS, S, M, L, XL| O[OrdinalEncoder<br/>map to 0,1,2,3,4]
Q -->|no| C{Cardinality}
C -->|< 10-15| OH[OneHotEncoder]
C -->|> 50| T[Target / Frequency / Hashing]
3. Method 1: OrdinalEncoder (ordered categories)¶
Maps categories to integers respecting an order you provide.
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
X = pd.DataFrame({"size": ["S", "M", "L", "XL", "M", "S"]})
oe = OrdinalEncoder(categories=[["S", "M", "L", "XL"]])
# pass the ORDER explicitly — sklearn doesn't know what makes sense
X["size_enc"] = oe.fit_transform(X[["size"]])
# → 0, 1, 2, 3, 1, 0
Critical: if you don't pass categories=, sklearn uses alphabetical order, which is rarely what you want.
Use for:
- size: S < M < L < XL
- education: HS < BA < MA < PhD
- rating: 1-star < 2-star < ... < 5-star
- priority: low < medium < high
Don't use for:
- color, country, category, device_model — no real ordering.
4. Method 2: OneHotEncoder (unordered categories)¶
Creates a 0/1 column per category.
from sklearn.preprocessing import OneHotEncoder
X = pd.DataFrame({"color": ["red", "green", "blue", "red"]})
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_enc = ohe.fit_transform(X[["color"]])
# →
# [[0, 0, 1], blue=0, green=0, red=1
# [0, 1, 0],
# [1, 0, 0],
# [0, 0, 1]]
print(ohe.get_feature_names_out()) # ['color_blue', 'color_green', 'color_red']
Common options:
OneHotEncoder(
handle_unknown="ignore", # categories not seen in fit → all zeros (vs error)
sparse_output=True, # store as scipy sparse — saves RAM with many categories
drop="first", # drop one column to avoid the dummy-variable trap (linear models)
drop="if_binary", # only drop for binary features
min_frequency=10, # group rare categories into 'infrequent_sklearn'
max_categories=20, # cap at top-N categories
)
Use for:
- color, country, payment_method, gender, device_type — low-cardinality unordered.
Don't use when: - > ~30 unique values — explodes column count, hurts tree models, slows training.
5. Method 3: Pandas shortcut for OHE¶
df_enc = pd.get_dummies(df, columns=["color", "country"], drop_first=False)
# Returns a DataFrame; ideal for EDA. Less ideal for production
# because it doesn't remember categories across fit/transform.
For production, use OneHotEncoder inside a Pipeline — it remembers categories.
6. Method 4: Target / Mean Encoding (high cardinality)¶
For columns with hundreds of categories (zip_code, user_id, product_sku), one-hot is impractical. Replace each category with the mean of the target for that category.
# DIY (with care: target leakage risk)
target_map = X_train.groupby("zip_code")["y"].mean().to_dict()
X_train["zip_enc"] = X_train["zip_code"].map(target_map)
X_test["zip_enc"] = X_test["zip_code"].map(target_map).fillna(X_train["y"].mean())
Or use the well-tested category_encoders package:
# pip install category_encoders
from category_encoders import TargetEncoder
te = TargetEncoder(cols=["zip_code"], smoothing=10)
X_train_enc = te.fit_transform(X_train, y_train)
X_test_enc = te.transform(X_test)
Critical: target encoding must be fit on train only, ideally within cross-validation folds. Otherwise it leaks the target.
Use for: high-cardinality categoricals where one-hot would create thousands of columns.
7. Method 5: Frequency / Count Encoding¶
Replace each category with its frequency in the training data.
freq = X_train["category"].value_counts(normalize=True).to_dict()
X_train["category_freq"] = X_train["category"].map(freq)
X_test["category_freq"] = X_test["category"].map(freq).fillna(0)
Use for: Tree-based models with high-cardinality categoricals. No leakage risk (no target involved).
8. Putting it together — ColumnTransformer¶
The right pattern for any non-trivial dataset:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
ordered_cats = ["education"]
unordered_cats = ["country", "plan_type", "device_os"]
numeric = ["age", "tenure_months"]
preprocess = ColumnTransformer(
transformers=[
("num", Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
]), numeric),
("ord", OrdinalEncoder(
categories=[["HS","BA","MA","PhD"]],
handle_unknown="use_encoded_value",
unknown_value=-1,
), ordered_cats),
("ohe", OneHotEncoder(
handle_unknown="ignore",
min_frequency=10, # group rare into 'infrequent'
), unordered_cats),
],
remainder="drop",
)
pipe = Pipeline([("prep", preprocess), ("clf", RandomForestClassifier())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
9. Architecture / Flow¶
flowchart TD
DF[DataFrame] --> CT[ColumnTransformer]
CT --> N[numeric: impute + scale]
CT --> O[ordered cats: OrdinalEncoder]
CT --> H[unordered cats: OneHotEncoder]
N --> CC[concat]
O --> CC
H --> CC
CC --> M[Model]
10. Common pitfalls¶
- ❗ Using
OrdinalEncoderon unordered categories. Assigningcolor: red=0, green=1, blue=2teaches the model that blue > green > red. - ❗ Forgetting
handle_unknown="ignore"onOneHotEncoder. A new category in production → exception. With"ignore", unknown categories become all-zero (treated as "unknown"). - ❗ Not passing
categories=toOrdinalEncoder. Default is alphabetical:["L","M","S","XL"]→L=0, M=1, S=2, XL=3. Wrong order. - ❗ One-hot encoding a high-cardinality column.
OneHotEncoder(zip_code)→ 40,000 columns → blown memory + worse model. - ❗ Target encoding without CV. Leaks target info; train accuracy looks great, test fails. Use
category_encoderswith proper CV or wrap inOutOfFoldTargetEncoder. - ❗
pd.get_dummies()for production. It doesn't remember categories across train/test — new test data with a new category gives a different column set. UseOneHotEncoderinside a Pipeline. - ❗ Dropping the wrong column with
drop="first". Usedrop="first"only for linear models that suffer the dummy variable trap; trees are fine without dropping.
11. When to use vs not use¶
| Encoder | When |
|---|---|
OrdinalEncoder |
Categories have a meaningful order (size, education, rating). |
OneHotEncoder |
Categories unordered, cardinality < ~30. Default for linear models. |
pd.get_dummies |
Quick EDA only. Not for production pipelines. |
| Target encoding | High cardinality (50+), can manage leakage with CV. |
| Frequency encoding | High cardinality + tree-based models. Safe (no target leakage). |
| Hashing encoder | Very high cardinality (>10k), streaming, OOV-safe. |
| Embeddings | Deep learning over categorical (e.g., entity embeddings). Beyond classical ML. |
12. Cheatsheet¶
# Ordinal — when there's a natural order
from sklearn.preprocessing import OrdinalEncoder
OrdinalEncoder(
categories=[["S","M","L","XL"]],
handle_unknown="use_encoded_value",
unknown_value=-1,
)
# One-hot — default for unordered
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(
handle_unknown="ignore",
sparse_output=False, # True to save RAM with many cats
drop=None, # or "first" / "if_binary" for linear models
min_frequency=10, # group rare into 'infrequent_sklearn'
max_categories=20,
)
# Pandas shortcut (EDA only)
pd.get_dummies(df, columns=["color"], drop_first=True)
# Apply different encoders per column
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
("ord", OrdinalEncoder(...), ["size", "edu"]),
("ohe", OneHotEncoder(...), ["color", "country"]),
])
# High-cardinality
from category_encoders import (TargetEncoder, CountEncoder, HashingEncoder)
TargetEncoder(cols=["zip"], smoothing=10).fit_transform(X_train, y_train)
# Inspect output column names
ohe.get_feature_names_out()
ct.get_feature_names_out()
13. Q&A — recall test¶
-
Q: When to use Ordinal vs One-Hot? A: Ordinal when categories have a meaningful order (small < medium < large). One-Hot when they don't (red, green, blue).
-
Q: Why pass
categories=toOrdinalEncoder? A: Without it, sklearn uses alphabetical order — which is rarely the order you actually want for an ordered category. -
Q: How do you handle unknown categories at inference time? A:
OneHotEncoder(handle_unknown="ignore")— new categories produce all-zero rows.OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)— new categories become -1. -
Q: What's the danger with target encoding? A: Target leakage — using
yto compute features inflates training accuracy and crashes in production. Mitigate with out-of-fold target encoding viacategory_encodersor proper CV. -
Q: Pandas
get_dummies()vs sklearnOneHotEncoder()— which for production? A: sklearnOneHotEncoder— it remembers categories from training, so test data with a new category doesn't break the model.get_dummiesis fine for EDA only. -
Q: What's the dummy-variable trap? A: With
kone-hot columns, one is redundant (perfectly predicted by the others). For linear models with intercept, drop one withdrop="first". Trees are unaffected.
Practice¶
What does this print?
Expected: (3, 4)
Handle unseen categories at inference (don't crash on a new city)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What's the right encoding for an ordinal feature like "small/medium/large"?
- One-hot encoding
- Ordinal encoding (small=0, medium=1, large=2)
- Label encoding
- Hashing
Why: Ordinal features have meaningful order. One-hot loses that order. Ordinal encoding preserves it — tree models exploit the ordering; linear models capture monotonic relationships.
Q2. When should you use target encoding instead of one-hot?
- Never — it's deprecated
- When a categorical has very high cardinality (1000s of unique values) — one-hot would explode the feature space
- Only for binary classification
- For numeric features
Why: With 10,000 unique ZIP codes, one-hot creates 10,000 columns. Target encoding replaces each category with the mean of
yfor that category — one column, fast, useful. Be careful about leakage: encode using only training-fold targets.
Q3. Why pass handle_unknown="ignore" to OneHotEncoder?
- To allow null values
- So new categories at inference time become all zeros instead of raising an error
- Speeds up training
- Removes rare categories
Why: Production data drifts — new categories appear after deployment. Without
handle_unknown="ignore", your serving pipeline crashes the first time a new value appears. Always set it.
Common doubts¶
When is get_dummies (Pandas) vs OneHotEncoder (sklearn) the right choice?
Use OneHotEncoder in production pipelines — it remembers categories from training and handles unknown values cleanly. Use get_dummies for quick exploration in notebooks; it has no concept of "fit" and creates different columns each time depending on the data.
How do I encode dates?
Don't one-hot them. Extract features: year, month, day_of_week, is_weekend, days_since_signup. For cyclic features (hour, month), use sin/cos transforms: sin(2π × month/12) and cos(2π × month/12) so December (12) is close to January (1).
What about high-cardinality categorical features like user IDs?
Don't directly encode them as features — too sparse, can leak. Common patterns: (1) drop them and learn from aggregated user features, (2) use entity embeddings (deep learning), (3) target-encode with smoothing, (4) use frequency encoding (replace each ID with its count).