Unsupervised Learning & Clustering¶

1. Why this matters¶

You won't always have labels. Real-world unsupervised use cases:

Customer segmentation — group users by behavior to target marketing.
Anomaly detection — points that don't fit any cluster are suspicious.
Document grouping — cluster news articles by topic.
Image compression / feature learning — encode images into a few "prototype" vectors.
Preprocessing — PCA / t-SNE / UMAP for visualization or as features for supervised models.

2. Mental model¶

flowchart TD
    A[Unlabeled X] --> Q{What do you need?}
    Q -->|Group similar rows| C{Cluster shape?}
    Q -->|Reduce dimensions for viz| V[t-SNE / UMAP]
    Q -->|Reduce dims for downstream model| P[PCA]
    Q -->|Anomaly detection| AN[IsolationForest / LOF / OneClassSVM]
    C -->|Globular / known k| K[K-Means]
    C -->|Arbitrary shape, noise tolerance| D[DBSCAN]
    C -->|Want a dendrogram / hierarchy| H[AgglomerativeClustering]

3. K-Means¶

The classic clustering algorithm. Place k centroids randomly, assign each point to nearest centroid, move centroids to the mean of their assigned points, repeat until convergence.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

km = KMeans(
    n_clusters=4,
    init="k-means++",       # smart initialization (default)
    n_init=10,              # restart 10× and keep the best
    max_iter=300,
    random_state=42,
).fit(X_scaled)

labels = km.labels_                        # cluster index per sample
centroids = km.cluster_centers_
print("Inertia (lower = tighter):", km.inertia_)

# Predict on new data
new_labels = km.predict(X_new)

Key params:

n_clusters — the big one. Choose via elbow / silhouette.
n_init — number of random restarts. Default "auto" (10). Higher = more reliable, slower.
init="k-means++" — smart initialization. Use the default.

Always scale before K-Means — it's distance-based.

4. Choosing `k` — Elbow & Silhouette¶

Elbow method: plot inertia (sum of squared distances to centroid) vs k. Pick the "elbow" where adding more clusters stops paying off:

import numpy as np
inertias = []
ks = range(1, 11)
for k in ks:
    inertias.append(KMeans(n_clusters=k, n_init=10, random_state=42)
                      .fit(X_scaled).inertia_)

plt.plot(ks, inertias, "bo-")
plt.xlabel("k"); plt.ylabel("inertia")
plt.title("Elbow Method")

Silhouette score: measures how similar each point is to its own cluster vs others. Range [-1, +1]; higher is better.

from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):                          # silhouette undefined for k=1
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
    scores.append(silhouette_score(X_scaled, km.labels_))

best_k = np.argmax(scores) + 2
print(f"Best k by silhouette: {best_k}")

Rule of thumb: use silhouette for the principled answer; elbow as a sanity check.

5. DBSCAN — arbitrary shapes, no `k` upfront¶

Density-Based Spatial Clustering of Applications with Noise. Defines clusters as dense regions separated by sparse ones. Doesn't need k — finds it automatically. Marks low-density points as noise (label -1).

from sklearn.cluster import DBSCAN

db = DBSCAN(
    eps=0.5,                # max neighborhood radius
    min_samples=5,          # min points to form a dense region
    metric="euclidean",
).fit(X_scaled)

labels = db.labels_                      # -1 means noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise    = (labels == -1).sum()

Pros: - Finds non-globular clusters (concentric rings, half-moons). - Marks outliers explicitly. - No need to pre-specify k.

Cons: - Sensitive to eps. Use the k-distance plot to pick it:

from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=5).fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
plt.plot(np.sort(distances[:, -1]))   # look for an "elbow" — that's a good eps

Doesn't scale well to very high-dim data (curse of dimensionality kills density estimates).

6. Hierarchical / Agglomerative Clustering¶

Build a tree (dendrogram) by repeatedly merging the two closest clusters. Cut the tree at any height to get a clustering.

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Visualize the hierarchy first
Z = linkage(X_scaled, method="ward")        # "ward" minimizes within-cluster variance
plt.figure(figsize=(10, 4))
dendrogram(Z, truncate_mode="level", p=4)

# Then commit to a cluster count
agg = AgglomerativeClustering(n_clusters=4, linkage="ward").fit(X_scaled)
labels = agg.labels_

Linkage criteria:

Linkage	Merges based on
`"ward"`	Within-cluster variance increase. Default — works well.
`"average"`	Average distance between cluster members.
`"complete"`	Max distance between members (compact clusters).
`"single"`	Min distance (can produce "chaining").

Use when: - You want to inspect the hierarchy (taxonomies, phylogenetics). - Small datasets (< 10k rows — O(n²) memory).

7. Evaluating clustering — no ground truth¶

Internal metrics (no labels needed):

Metric	Higher is better	What it measures
Silhouette score	Yes (range [-1, 1])	Cohesion vs separation per point
Calinski-Harabasz	Yes	Ratio of between/within dispersion
Davies-Bouldin	No (lower = better)	Average ratio of within/between distances

from sklearn.metrics import (
    silhouette_score, calinski_harabasz_score, davies_bouldin_score,
)

print("Silhouette  :", silhouette_score(X_scaled, labels))
print("Calinski-H  :", calinski_harabasz_score(X_scaled, labels))
print("Davies-Bould:", davies_bouldin_score(X_scaled, labels))

If you DO have ground truth labels (e.g., from a held-out labeled subset), use:

from sklearn.metrics import (
    adjusted_rand_score, normalized_mutual_info_score, homogeneity_completeness_v_measure,
)
adjusted_rand_score(y_true, labels)

8. PCA / t-SNE / UMAP for visualization¶

Reducing to 2D so you can SEE the structure:

# PCA — fast, linear, deterministic
from sklearn.decomposition import PCA
X_2d = PCA(n_components=2).fit_transform(X_scaled)

# t-SNE — non-linear, slow, beautiful for visualization
from sklearn.manifold import TSNE
X_2d = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_scaled)

# UMAP — non-linear, faster than t-SNE, often better
# pip install umap-learn
import umap
X_2d = umap.UMAP(n_components=2, n_neighbors=15, random_state=42).fit_transform(X_scaled)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap="tab10")

Choose by purpose:

Method	Use when
PCA	Production / downstream features. Need a reproducible linear projection.
t-SNE	One-off visualization. Quality > speed. Don't use for downstream features.
UMAP	One-off visualization OR features. Faster than t-SNE, preserves global structure better.

9. End-to-end: customer segmentation¶

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 1. Load behavioral features
df = pd.read_csv("customers.csv")
features = ["age", "tenure_months", "monthly_spend", "session_count", "support_tickets"]

# 2. Scale
X = StandardScaler().fit_transform(df[features])

# 3. Pick k
from sklearn.metrics import silhouette_score
scores = [silhouette_score(X, KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X))
          for k in range(2, 9)]
best_k = scores.index(max(scores)) + 2

# 4. Final fit
km = KMeans(n_clusters=best_k, n_init=10, random_state=42).fit(X)
df["cluster"] = km.labels_

# 5. Interpret each cluster
profile = df.groupby("cluster")[features].mean().round(1)
print(profile)
# Each row = a "persona" you can give to marketing

# 6. Visualize in 2D
X_2d = PCA(n_components=2).fit_transform(X)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=km.labels_, cmap="tab10", alpha=0.6)

10. Common pitfalls¶

❗ K-Means without scaling. Large-range features dominate distance → meaningless clusters.
❗ Eyeballing k. Use silhouette + elbow. Or domain knowledge ("we have 4 customer tiers").
❗ K-Means on non-globular data. Two interlocking moons → K-Means fails. Use DBSCAN.
❗ DBSCAN with default eps=0.5 and not scaling. eps is in original feature units. Without scaling it's meaningless.
❗ Using t-SNE 2D coords as features for a supervised model. t-SNE preserves local structure but distorts distances and is non-deterministic — bad for downstream models. Use PCA or UMAP if you need features.
❗ Treating clusters as if they were classes. Cluster IDs are arbitrary — cluster 0 in one run ≠ cluster 0 in the next. If you re-fit on new data, re-map labels by centroid distance to old labels.
❗ No interpretation step. Clusters without a profile (mean values per feature, per cluster) are useless. Always profile and name them ("budget shoppers", "power users").
❗ Silhouette on small clusters. With < 30 points per cluster the score is noisy. Bigger datasets or merge tiny clusters.

11. When to use what¶

Goal	First try	If that fails
Group similar rows, expect globular clusters	K-Means	Mini-Batch K-Means for big data
Arbitrary cluster shape, want to find outliers	DBSCAN	HDBSCAN (handles varying density)
Want a hierarchy you can cut at any level	AgglomerativeClustering(linkage="ward")	—
Detect anomalies	IsolationForest / LocalOutlierFactor	OneClassSVM
Visualize 2D embedding	UMAP	t-SNE for prettier, PCA for fastest
Reduce dims for downstream model	PCA	UMAP or autoencoder
Cluster very high-dim data (text, images)	K-Means on embeddings	HDBSCAN on UMAP-reduced

12. Cheatsheet¶

from sklearn.cluster import (
    KMeans, MiniBatchKMeans,
    DBSCAN,
    AgglomerativeClustering,
    Birch,
    MeanShift,
)
from sklearn.decomposition import PCA, TruncatedSVD, NMF
from sklearn.manifold import TSNE, MDS
from sklearn.metrics import (
    silhouette_score, calinski_harabasz_score, davies_bouldin_score,
    adjusted_rand_score, normalized_mutual_info_score,
)
from sklearn.preprocessing import StandardScaler

# K-Means canonical pattern
X_s = StandardScaler().fit_transform(X)
km = KMeans(n_clusters=4, n_init=10, random_state=42).fit(X_s)
km.labels_, km.cluster_centers_, km.inertia_

# Pick k by silhouette
import numpy as np
ks = range(2, 11)
scores = [silhouette_score(X_s,
                           KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X_s))
          for k in ks]
best_k = ks[np.argmax(scores)]

# DBSCAN
DBSCAN(eps=0.5, min_samples=5).fit(X_s)
# tune eps via k-distance plot

# Agglomerative
AgglomerativeClustering(n_clusters=4, linkage="ward").fit(X_s)

# Big data → MiniBatch
MiniBatchKMeans(n_clusters=10, batch_size=1024, random_state=42).fit(X_s)

# 2D viz
import umap
X_2d = umap.UMAP(n_components=2, random_state=42).fit_transform(X_s)

# Cluster profiling (DON'T skip)
df["cluster"] = km.labels_
df.groupby("cluster").mean()
df.groupby("cluster").size()

13. Q&A — recall test¶

Q: Why scale before K-Means? A: K-Means uses Euclidean distance. Without scaling, larger-range features dominate, distorting clusters. Always scale.
Q: Elbow vs silhouette for picking k? A: Elbow plots inertia vs k and looks for a kink — heuristic, eyeballed. Silhouette gives a numeric score per k — pick the maximum. Use silhouette as the primary; elbow as sanity check.
Q: When does K-Means fail and DBSCAN win? A: Non-globular cluster shapes (two interlocking moons, concentric rings). K-Means forces convex clusters; DBSCAN follows density.
Q: What does DBSCAN label -1 mean? A: Noise. The point isn't dense-enough-connected to any cluster. Useful for anomaly detection.
Q: Can you use t-SNE coordinates as features for a downstream model? A: No — t-SNE is non-deterministic, distorts distances, and doesn't generalize to new data. Use PCA or UMAP for that.
Q: What's the single most important step AFTER clustering? A: Profile the clusters. Compute the mean / mode of each feature per cluster and name them. Without that, cluster IDs are meaningless to the business.

Practice¶

What does this print?

Expected: 3

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1,1],[1,2],[2,1], [10,10],[10,11], [20,20],[20,21],[21,20]])
km = KMeans(n_clusters=3, n_init=10, random_state=0).fit(X)
print(len(np.unique(km.labels_)))

Scale features before KMeans (it's distance-based)

Expected: True

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 1000], [2, 2000], [10, 1500], [11, 2500]])
km = KMeans(n_clusters=2, n_init=10, random_state=0).fit(X)   # bug: large-scale feature dominates clustering
print(km.n_iter_ > 0)

Quiz — Quick check¶

What you remember

Q1. How do you choose the number of clusters k for KMeans?

Always use k=3
The elbow method — plot inertia vs k, pick the "elbow" where adding more clusters stops helping much
Try k=1 and increase until accuracy is good
Use the dimensions of the data

Why: The elbow method or silhouette score. KMeans doesn't tell you the "right" k — you have to pick based on the data structure and business meaning.

Q2. Which clustering algorithm doesn't need you to specify the number of clusters?

KMeans
DBSCAN (finds clusters based on density)
Hierarchical (needs a cut threshold)
Gaussian Mixture

Why: DBSCAN groups densely-connected points. Points in low-density regions become noise. You set the density parameters (eps, min_samples) instead.

Q3. Why scale features before KMeans?

Required by sklearn
KMeans uses Euclidean distance — features with larger scale dominate the distance metric
To make it faster
Removes outliers

Why: A feature in millions dominates a feature in [0, 1]. The clusters will essentially split on the high-scale feature only. StandardScaler puts everything on equal footing.

Common doubts¶

When should I use clustering in real projects?

Customer segmentation, anomaly detection, image quantization, document grouping, and as a feature for downstream supervised models. It's not always the answer — sometimes simple aggregations or business rules are clearer.

How do I evaluate clustering quality?

Without labels, use silhouette score, Davies-Bouldin index, or Calinski-Harabasz score — all measure cluster compactness and separation. With labels (rare in real clustering), use ARI or NMI. The business judgment of cluster meaning matters more than any metric.

What's the curse of dimensionality for clustering?

In high dimensions, all points become roughly equidistant — distance-based clustering breaks down. Solutions: reduce dimensions first (PCA, UMAP), use density-based methods (DBSCAN), or use distance metrics designed for high dimensions (cosine similarity for text/embeddings).