GroupBy — Split, Apply, Combine¶

groupby is the most powerful tool in Pandas. The three-step pattern: split the data into groups, apply a function to each, combine the results.

The classic "sum per category"¶

import pandas as pd

df = pd.DataFrame({
    "city":  ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai","Pune"],
    "sales": [100, 200, 150, 250, 80, 175, 120],
})

# Total sales per city
print(df.groupby("city")["sales"].sum())

city
Delhi     450
Mumbai    425
Pune      200
Name: sales, dtype: int64

In SQL: SELECT city, SUM(sales) FROM df GROUP BY city;

What `groupby` actually does¶

1. SPLIT — rows are bucketed by city
        ↓
2. APPLY — sum is computed per bucket
        ↓
3. COMBINE — results stacked into one Series/DataFrame

The result has the group key as the index.

All the common aggregations¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
})

g = df.groupby("city")

print("count :", g["salary"].count().to_dict())
print("sum   :", g["salary"].sum().to_dict())
print("mean  :", g["salary"].mean().to_dict())
print("min   :", g["salary"].min().to_dict())
print("max   :", g["salary"].max().to_dict())
print("std   :", g["salary"].std().round(0).to_dict())
print("median:", g["salary"].median().to_dict())
print("first :", g["salary"].first().to_dict())
print("last  :", g["salary"].last().to_dict())

Multiple aggregations at once — `.agg()`¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
    "age":    [25, 30, 35, 40, 22, 50],
})

# Multiple functions on one column
print(df.groupby("city")["salary"].agg(["count", "sum", "mean", "min", "max"]))

Different aggs on different columns¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
    "age":    [25, 30, 35, 40, 22, 50],
})

print(df.groupby("city").agg({
    "salary": ["mean", "max"],
    "age":    ["mean", "min"],
}))

Named aggregations — clean column names¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
    "age":    [25, 30, 35, 40, 22, 50],
})

result = df.groupby("city").agg(
    avg_salary = ("salary", "mean"),
    max_salary = ("salary", "max"),
    avg_age    = ("age",    "mean"),
    count      = ("city",   "size"),
)
print(result)

Clean column names — no MultiIndex.

Group by multiple columns¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Mumbai","Mumbai","Delhi","Delhi","Pune","Pune"],
    "dept":   ["Eng","HR","Eng","Eng","HR","Eng","HR"],
    "salary": [100,80,120,90,70,110,60],
})

# Total salary per (city, dept)
print(df.groupby(["city", "dept"])["salary"].sum())
print()

# As a DataFrame (unstack the inner level into columns)
print(df.groupby(["city", "dept"])["salary"].sum().unstack(fill_value=0))

Custom aggregation function¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
})

# Pass a lambda
print(df.groupby("city")["salary"].agg(lambda s: s.max() - s.min()))
print()

# Named — using your own function
def salary_range(s):
    return s.max() - s.min()

print(df.groupby("city")["salary"].agg(range=salary_range))

`.transform()` — broadcast back to original shape¶

agg shrinks data (one row per group). transform keeps the original shape and broadcasts the group result.

Common use: per-group normalization.

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
})

# Salary minus city mean
df["salary_z"] = df["salary"] - df.groupby("city")["salary"].transform("mean")
print(df)

This is great for ML feature engineering — "how much above the city's average is this person earning?"

`.filter()` — keep entire groups that pass¶

Keep only cities with > 2 employees:

import pandas as pd

df = pd.DataFrame({
    "city": ["Mumbai","Mumbai","Mumbai","Delhi","Pune","Pune","Pune"],
    "name": ["A","B","C","D","E","F","G"],
})

# Keep groups with size > 2
result = df.groupby("city").filter(lambda g: len(g) > 2)
print(result)

Iterating over groups¶

For inspection or manual logic:

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi"],
    "salary": [50000, 60000, 75000, 80000],
})

for name, group in df.groupby("city"):
    print(f"=== {name} ===")
    print(group)
    print()

`.size()` vs `.count()`¶

size() — number of rows in each group (includes NaN).
count() — non-null count per column.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "city":  ["Mumbai","Mumbai","Delhi","Delhi","Delhi"],
    "score": [85, np.nan, 70, 80, np.nan],
})

print("size:")
print(df.groupby("city").size())
print()

print("count (per column):")
print(df.groupby("city").count())

Top-K per group¶

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
df = pd.DataFrame({
    "city":   rng.choice(["Mumbai","Delhi","Pune"], size=15),
    "name":   [f"User{i}" for i in range(1, 16)],
    "salary": rng.integers(40000, 150000, size=15),
})

# Top-2 highest-paid per city
top2 = (
    df
    .sort_values("salary", ascending=False)
    .groupby("city")
    .head(2)
)
print(top2.sort_values(["city","salary"], ascending=[True, False]))

Pivot table — a shortcut over groupby¶

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Mumbai","Delhi","Delhi","Pune","Pune"],
    "dept":   ["Eng","HR","Eng","HR","Eng","HR"],
    "salary": [100, 80, 90, 70, 110, 60],
})

# Pivot — city as rows, dept as cols, mean salary in cells
print(df.pivot_table(
    index="city",
    columns="dept",
    values="salary",
    aggfunc="mean",
    fill_value=0,
))

More on this in Chapter 12 — Reshaping.

A practical example — sales analysis¶

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
n = 100

sales = pd.DataFrame({
    "date":     pd.date_range("2025-01-01", periods=n),
    "region":   rng.choice(["North","South","East","West"], size=n),
    "product":  rng.choice(["A","B","C"], size=n),
    "revenue":  rng.integers(100, 1000, size=n),
    "discount": rng.uniform(0, 0.3, size=n).round(2),
})

# Total revenue per region
print("Revenue per region:")
print(sales.groupby("region")["revenue"].sum().sort_values(ascending=False))
print()

# Per (region, product) summary
print("Per region+product (mean revenue):")
print(
    sales.groupby(["region", "product"])
         .agg(avg_revenue=("revenue","mean"), txns=("revenue","size"))
         .round(0)
)

Cheatsheet¶

Goal	Code
Total per group	`df.groupby("g")["v"].sum()`
Mean per group	`df.groupby("g")["v"].mean()`
Multiple aggs	`df.groupby("g")["v"].agg(["sum","mean","max"])`
Per-col aggs	`df.groupby("g").agg({"a":"sum", "b":"mean"})`
Named aggs	`df.groupby("g").agg(x=("a","mean"), y=("b","sum"))`
Group by multi cols	`df.groupby(["c1","c2"])["v"].sum()`
Group-wise transform	`df.groupby("g")["v"].transform("mean")`
Keep groups passing condition	`df.groupby("g").filter(lambda g: ...)`
Top-K per group	`df.sort_values("v", ascending=False).groupby("g").head(K)`
Group size	`df.groupby("g").size()`
Pivot table	`df.pivot_table(index="r", columns="c", values="v", aggfunc=...)`

Common pitfalls¶

❗ groupby returns a special object, not a DataFrame — you need to chain an aggregation: .sum(), .agg(...), etc.
❗ Result has group key as index — use .reset_index() to make it a column again.
❗ as_index=False — df.groupby("g", as_index=False).sum() keeps the key as a column.
❗ NaN keys are dropped — pass dropna=False to include them.
❗ apply is slow — prefer agg, transform, or vectorized ops.
❗ size() vs count() — size is row count (NaN included); count is per-column non-null count.

Practice¶

What does this print?

Expected: 450

import pandas as pd
df = pd.DataFrame({"city": ["A","B","A","B"], "sales": [100, 200, 150, 250]})
print(df.groupby("city")["sales"].sum()["B"])

Get one row per (city, dept) with total salary as a NORMAL DataFrame (city/dept as columns, not the index)

Expected: True

import pandas as pd
df = pd.DataFrame({
    "city": ["Mumbai","Mumbai","Delhi","Delhi"],
    "dept": ["Eng","HR","Eng","HR"],
    "salary": [100, 80, 90, 70],
})
result = df.groupby(["city","dept"])["salary"].sum()        # bug: returns a Series with MultiIndex
print(isinstance(result, pd.DataFrame))

Quiz — Quick check¶

What you remember

Q1. df.groupby("g") by itself returns…

A new DataFrame
A list of groups
A special GroupBy object — you need to chain an aggregation
An error

Why: The split phase has happened; the apply phase hasn't. You must follow up with .sum(), .agg(...), .transform(...), etc.

Q2. What's the difference between agg and transform?

No difference
agg collapses each group to one row; transform keeps the original shape, broadcasting the group result
transform is faster
agg is for strings

Why: df.groupby("g")["v"].mean() (with agg) returns one row per group. df.groupby("g")["v"].transform("mean") returns a column with each row showing its group's mean — same length as the original.

Q3. Why use as_index=False in df.groupby("g", as_index=False)?

To get the group key as a regular column instead of the index
To make grouping faster
To handle NaN
To allow multiple keys

Why: By default the group key becomes the index of the result. as_index=False keeps it as a column — often more convenient for further operations.

Common doubts¶

What's a 'named aggregation' and when should I use it?

Named agg syntax: df.groupby("g").agg(avg_sales=("sales", "mean"), n=("sales", "size")). Gives you clean column names without MultiIndex. Strongly preferred over .agg({"sales": ["mean", "size"]}) which produces nested column names.

Why is groupby().apply() so slow?

Because apply calls your function once per group in Python — interpreted, with object overhead. For aggregations, prefer agg or transform (vectorized). Only fall back to apply for genuinely custom per-group logic that can't be expressed otherwise.

How do I aggregate to one row per group with multiple columns AND different aggs per column?

Use named aggregation: df.groupby("g").agg(x=("a", "mean"), y=("b", "sum"), z=("c", "max")). Each output column gets its own aggregation, with a clean name.

What's next¶

→ Merge, Join, Concatenate