GroupBy — Split, Apply, Combine¶
groupby is the most powerful tool in Pandas. The three-step pattern: split the data into groups, apply a function to each, combine the results.
The classic "sum per category"¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai","Pune"],
"sales": [100, 200, 150, 250, 80, 175, 120],
})
# Total sales per city
print(df.groupby("city")["sales"].sum())
In SQL: SELECT city, SUM(sales) FROM df GROUP BY city;
What groupby actually does¶
1. SPLIT — rows are bucketed by city
↓
2. APPLY — sum is computed per bucket
↓
3. COMBINE — results stacked into one Series/DataFrame
The result has the group key as the index.
All the common aggregations¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
})
g = df.groupby("city")
print("count :", g["salary"].count().to_dict())
print("sum :", g["salary"].sum().to_dict())
print("mean :", g["salary"].mean().to_dict())
print("min :", g["salary"].min().to_dict())
print("max :", g["salary"].max().to_dict())
print("std :", g["salary"].std().round(0).to_dict())
print("median:", g["salary"].median().to_dict())
print("first :", g["salary"].first().to_dict())
print("last :", g["salary"].last().to_dict())
Multiple aggregations at once — .agg()¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
"age": [25, 30, 35, 40, 22, 50],
})
# Multiple functions on one column
print(df.groupby("city")["salary"].agg(["count", "sum", "mean", "min", "max"]))
Different aggs on different columns¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
"age": [25, 30, 35, 40, 22, 50],
})
print(df.groupby("city").agg({
"salary": ["mean", "max"],
"age": ["mean", "min"],
}))
Named aggregations — clean column names¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
"age": [25, 30, 35, 40, 22, 50],
})
result = df.groupby("city").agg(
avg_salary = ("salary", "mean"),
max_salary = ("salary", "max"),
avg_age = ("age", "mean"),
count = ("city", "size"),
)
print(result)
Clean column names — no MultiIndex.
Group by multiple columns¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Mumbai","Mumbai","Delhi","Delhi","Pune","Pune"],
"dept": ["Eng","HR","Eng","Eng","HR","Eng","HR"],
"salary": [100,80,120,90,70,110,60],
})
# Total salary per (city, dept)
print(df.groupby(["city", "dept"])["salary"].sum())
print()
# As a DataFrame (unstack the inner level into columns)
print(df.groupby(["city", "dept"])["salary"].sum().unstack(fill_value=0))
Custom aggregation function¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
})
# Pass a lambda
print(df.groupby("city")["salary"].agg(lambda s: s.max() - s.min()))
print()
# Named — using your own function
def salary_range(s):
return s.max() - s.min()
print(df.groupby("city")["salary"].agg(range=salary_range))
.transform() — broadcast back to original shape¶
agg shrinks data (one row per group). transform keeps the original shape and broadcasts the group result.
Common use: per-group normalization.
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
})
# Salary minus city mean
df["salary_z"] = df["salary"] - df.groupby("city")["salary"].transform("mean")
print(df)
This is great for ML feature engineering — "how much above the city's average is this person earning?"
.filter() — keep entire groups that pass¶
Keep only cities with > 2 employees:
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Mumbai","Mumbai","Delhi","Pune","Pune","Pune"],
"name": ["A","B","C","D","E","F","G"],
})
# Keep groups with size > 2
result = df.groupby("city").filter(lambda g: len(g) > 2)
print(result)
Iterating over groups¶
For inspection or manual logic:
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi"],
"salary": [50000, 60000, 75000, 80000],
})
for name, group in df.groupby("city"):
print(f"=== {name} ===")
print(group)
print()
.size() vs .count()¶
size()— number of rows in each group (includes NaN).count()— non-null count per column.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"city": ["Mumbai","Mumbai","Delhi","Delhi","Delhi"],
"score": [85, np.nan, 70, 80, np.nan],
})
print("size:")
print(df.groupby("city").size())
print()
print("count (per column):")
print(df.groupby("city").count())
Top-K per group¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame({
"city": rng.choice(["Mumbai","Delhi","Pune"], size=15),
"name": [f"User{i}" for i in range(1, 16)],
"salary": rng.integers(40000, 150000, size=15),
})
# Top-2 highest-paid per city
top2 = (
df
.sort_values("salary", ascending=False)
.groupby("city")
.head(2)
)
print(top2.sort_values(["city","salary"], ascending=[True, False]))
Pivot table — a shortcut over groupby¶
import pandas as pd
df = pd.DataFrame({
"city": ["Mumbai","Mumbai","Delhi","Delhi","Pune","Pune"],
"dept": ["Eng","HR","Eng","HR","Eng","HR"],
"salary": [100, 80, 90, 70, 110, 60],
})
# Pivot — city as rows, dept as cols, mean salary in cells
print(df.pivot_table(
index="city",
columns="dept",
values="salary",
aggfunc="mean",
fill_value=0,
))
More on this in Chapter 12 — Reshaping.
A practical example — sales analysis¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
n = 100
sales = pd.DataFrame({
"date": pd.date_range("2025-01-01", periods=n),
"region": rng.choice(["North","South","East","West"], size=n),
"product": rng.choice(["A","B","C"], size=n),
"revenue": rng.integers(100, 1000, size=n),
"discount": rng.uniform(0, 0.3, size=n).round(2),
})
# Total revenue per region
print("Revenue per region:")
print(sales.groupby("region")["revenue"].sum().sort_values(ascending=False))
print()
# Per (region, product) summary
print("Per region+product (mean revenue):")
print(
sales.groupby(["region", "product"])
.agg(avg_revenue=("revenue","mean"), txns=("revenue","size"))
.round(0)
)
Cheatsheet¶
| Goal | Code |
|---|---|
| Total per group | df.groupby("g")["v"].sum() |
| Mean per group | df.groupby("g")["v"].mean() |
| Multiple aggs | df.groupby("g")["v"].agg(["sum","mean","max"]) |
| Per-col aggs | df.groupby("g").agg({"a":"sum", "b":"mean"}) |
| Named aggs | df.groupby("g").agg(x=("a","mean"), y=("b","sum")) |
| Group by multi cols | df.groupby(["c1","c2"])["v"].sum() |
| Group-wise transform | df.groupby("g")["v"].transform("mean") |
| Keep groups passing condition | df.groupby("g").filter(lambda g: ...) |
| Top-K per group | df.sort_values("v", ascending=False).groupby("g").head(K) |
| Group size | df.groupby("g").size() |
| Pivot table | df.pivot_table(index="r", columns="c", values="v", aggfunc=...) |
Common pitfalls¶
- ❗
groupbyreturns a special object, not a DataFrame — you need to chain an aggregation:.sum(),.agg(...), etc. - ❗ Result has group key as index — use
.reset_index()to make it a column again. - ❗
as_index=False—df.groupby("g", as_index=False).sum()keeps the key as a column. - ❗ NaN keys are dropped — pass
dropna=Falseto include them. - ❗
applyis slow — preferagg,transform, or vectorized ops. - ❗
size()vscount()—sizeis row count (NaN included);countis per-column non-null count.
Practice¶
What does this print?
Expected: 450
Get one row per (city, dept) with total salary as a NORMAL DataFrame (city/dept as columns, not the index)
Expected: True
Quiz — Quick check¶
What you remember
Q1. df.groupby("g") by itself returns…
- A new DataFrame
- A list of groups
- A special
GroupByobject — you need to chain an aggregation - An error
Why: The split phase has happened; the apply phase hasn't. You must follow up with
.sum(),.agg(...),.transform(...), etc.
Q2. What's the difference between agg and transform?
- No difference
-
aggcollapses each group to one row;transformkeeps the original shape, broadcasting the group result -
transformis faster -
aggis for strings
Why:
df.groupby("g")["v"].mean()(with agg) returns one row per group.df.groupby("g")["v"].transform("mean")returns a column with each row showing its group's mean — same length as the original.
Q3. Why use as_index=False in df.groupby("g", as_index=False)?
- To get the group key as a regular column instead of the index
- To make grouping faster
- To handle NaN
- To allow multiple keys
Why: By default the group key becomes the index of the result.
as_index=Falsekeeps it as a column — often more convenient for further operations.
Common doubts¶
What's a 'named aggregation' and when should I use it?
Named agg syntax: df.groupby("g").agg(avg_sales=("sales", "mean"), n=("sales", "size")). Gives you clean column names without MultiIndex. Strongly preferred over .agg({"sales": ["mean", "size"]}) which produces nested column names.
Why is groupby().apply() so slow?
Because apply calls your function once per group in Python — interpreted, with object overhead. For aggregations, prefer agg or transform (vectorized). Only fall back to apply for genuinely custom per-group logic that can't be expressed otherwise.
How do I aggregate to one row per group with multiple columns AND different aggs per column?
Use named aggregation: df.groupby("g").agg(x=("a", "mean"), y=("b", "sum"), z=("c", "max")). Each output column gets its own aggregation, with a clean name.