Skip to content

GroupBy — Split, Apply, Combine

groupby is the most powerful tool in Pandas. The three-step pattern: split the data into groups, apply a function to each, combine the results.

The classic "sum per category"

import pandas as pd

df = pd.DataFrame({
    "city":  ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai","Pune"],
    "sales": [100, 200, 150, 250, 80, 175, 120],
})

# Total sales per city
print(df.groupby("city")["sales"].sum())
city
Delhi     450
Mumbai    425
Pune      200
Name: sales, dtype: int64

In SQL: SELECT city, SUM(sales) FROM df GROUP BY city;

What groupby actually does

1. SPLIT — rows are bucketed by city
2. APPLY — sum is computed per bucket
3. COMBINE — results stacked into one Series/DataFrame

The result has the group key as the index.

All the common aggregations

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
})

g = df.groupby("city")

print("count :", g["salary"].count().to_dict())
print("sum   :", g["salary"].sum().to_dict())
print("mean  :", g["salary"].mean().to_dict())
print("min   :", g["salary"].min().to_dict())
print("max   :", g["salary"].max().to_dict())
print("std   :", g["salary"].std().round(0).to_dict())
print("median:", g["salary"].median().to_dict())
print("first :", g["salary"].first().to_dict())
print("last  :", g["salary"].last().to_dict())

Multiple aggregations at once — .agg()

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
    "age":    [25, 30, 35, 40, 22, 50],
})

# Multiple functions on one column
print(df.groupby("city")["salary"].agg(["count", "sum", "mean", "min", "max"]))

Different aggs on different columns

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
    "age":    [25, 30, 35, 40, 22, 50],
})

print(df.groupby("city").agg({
    "salary": ["mean", "max"],
    "age":    ["mean", "min"],
}))

Named aggregations — clean column names

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
    "age":    [25, 30, 35, 40, 22, 50],
})

result = df.groupby("city").agg(
    avg_salary = ("salary", "mean"),
    max_salary = ("salary", "max"),
    avg_age    = ("age",    "mean"),
    count      = ("city",   "size"),
)
print(result)

Clean column names — no MultiIndex.

Group by multiple columns

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Mumbai","Mumbai","Delhi","Delhi","Pune","Pune"],
    "dept":   ["Eng","HR","Eng","Eng","HR","Eng","HR"],
    "salary": [100,80,120,90,70,110,60],
})

# Total salary per (city, dept)
print(df.groupby(["city", "dept"])["salary"].sum())
print()

# As a DataFrame (unstack the inner level into columns)
print(df.groupby(["city", "dept"])["salary"].sum().unstack(fill_value=0))

Custom aggregation function

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
})

# Pass a lambda
print(df.groupby("city")["salary"].agg(lambda s: s.max() - s.min()))
print()

# Named — using your own function
def salary_range(s):
    return s.max() - s.min()

print(df.groupby("city")["salary"].agg(range=salary_range))

.transform() — broadcast back to original shape

agg shrinks data (one row per group). transform keeps the original shape and broadcasts the group result.

Common use: per-group normalization.

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
    "salary": [50000, 80000, 75000, 60000, 90000, 100000],
})

# Salary minus city mean
df["salary_z"] = df["salary"] - df.groupby("city")["salary"].transform("mean")
print(df)

This is great for ML feature engineering — "how much above the city's average is this person earning?"

.filter() — keep entire groups that pass

Keep only cities with > 2 employees:

import pandas as pd

df = pd.DataFrame({
    "city": ["Mumbai","Mumbai","Mumbai","Delhi","Pune","Pune","Pune"],
    "name": ["A","B","C","D","E","F","G"],
})

# Keep groups with size > 2
result = df.groupby("city").filter(lambda g: len(g) > 2)
print(result)

Iterating over groups

For inspection or manual logic:

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi"],
    "salary": [50000, 60000, 75000, 80000],
})

for name, group in df.groupby("city"):
    print(f"=== {name} ===")
    print(group)
    print()

.size() vs .count()

  • size() — number of rows in each group (includes NaN).
  • count() — non-null count per column.
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "city":  ["Mumbai","Mumbai","Delhi","Delhi","Delhi"],
    "score": [85, np.nan, 70, 80, np.nan],
})

print("size:")
print(df.groupby("city").size())
print()

print("count (per column):")
print(df.groupby("city").count())

Top-K per group

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
df = pd.DataFrame({
    "city":   rng.choice(["Mumbai","Delhi","Pune"], size=15),
    "name":   [f"User{i}" for i in range(1, 16)],
    "salary": rng.integers(40000, 150000, size=15),
})

# Top-2 highest-paid per city
top2 = (
    df
    .sort_values("salary", ascending=False)
    .groupby("city")
    .head(2)
)
print(top2.sort_values(["city","salary"], ascending=[True, False]))

Pivot table — a shortcut over groupby

import pandas as pd

df = pd.DataFrame({
    "city":   ["Mumbai","Mumbai","Delhi","Delhi","Pune","Pune"],
    "dept":   ["Eng","HR","Eng","HR","Eng","HR"],
    "salary": [100, 80, 90, 70, 110, 60],
})

# Pivot — city as rows, dept as cols, mean salary in cells
print(df.pivot_table(
    index="city",
    columns="dept",
    values="salary",
    aggfunc="mean",
    fill_value=0,
))

More on this in Chapter 12 — Reshaping.

A practical example — sales analysis

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
n = 100

sales = pd.DataFrame({
    "date":     pd.date_range("2025-01-01", periods=n),
    "region":   rng.choice(["North","South","East","West"], size=n),
    "product":  rng.choice(["A","B","C"], size=n),
    "revenue":  rng.integers(100, 1000, size=n),
    "discount": rng.uniform(0, 0.3, size=n).round(2),
})

# Total revenue per region
print("Revenue per region:")
print(sales.groupby("region")["revenue"].sum().sort_values(ascending=False))
print()

# Per (region, product) summary
print("Per region+product (mean revenue):")
print(
    sales.groupby(["region", "product"])
         .agg(avg_revenue=("revenue","mean"), txns=("revenue","size"))
         .round(0)
)

Cheatsheet

Goal Code
Total per group df.groupby("g")["v"].sum()
Mean per group df.groupby("g")["v"].mean()
Multiple aggs df.groupby("g")["v"].agg(["sum","mean","max"])
Per-col aggs df.groupby("g").agg({"a":"sum", "b":"mean"})
Named aggs df.groupby("g").agg(x=("a","mean"), y=("b","sum"))
Group by multi cols df.groupby(["c1","c2"])["v"].sum()
Group-wise transform df.groupby("g")["v"].transform("mean")
Keep groups passing condition df.groupby("g").filter(lambda g: ...)
Top-K per group df.sort_values("v", ascending=False).groupby("g").head(K)
Group size df.groupby("g").size()
Pivot table df.pivot_table(index="r", columns="c", values="v", aggfunc=...)

Common pitfalls

  • groupby returns a special object, not a DataFrame — you need to chain an aggregation: .sum(), .agg(...), etc.
  • Result has group key as index — use .reset_index() to make it a column again.
  • as_index=Falsedf.groupby("g", as_index=False).sum() keeps the key as a column.
  • NaN keys are dropped — pass dropna=False to include them.
  • apply is slow — prefer agg, transform, or vectorized ops.
  • size() vs count()size is row count (NaN included); count is per-column non-null count.

Practice

What does this print?

Expected: 450

import pandas as pd
df = pd.DataFrame({"city": ["A","B","A","B"], "sales": [100, 200, 150, 250]})
print(df.groupby("city")["sales"].sum()["B"])

Get one row per (city, dept) with total salary as a NORMAL DataFrame (city/dept as columns, not the index)

Expected: True

import pandas as pd
df = pd.DataFrame({
    "city": ["Mumbai","Mumbai","Delhi","Delhi"],
    "dept": ["Eng","HR","Eng","HR"],
    "salary": [100, 80, 90, 70],
})
result = df.groupby(["city","dept"])["salary"].sum()        # bug: returns a Series with MultiIndex
print(isinstance(result, pd.DataFrame))

Quiz — Quick check

What you remember

Q1. df.groupby("g") by itself returns…

  • A new DataFrame
  • A list of groups
  • A special GroupBy object — you need to chain an aggregation
  • An error

Why: The split phase has happened; the apply phase hasn't. You must follow up with .sum(), .agg(...), .transform(...), etc.

Q2. What's the difference between agg and transform?

  • No difference
  • agg collapses each group to one row; transform keeps the original shape, broadcasting the group result
  • transform is faster
  • agg is for strings

Why: df.groupby("g")["v"].mean() (with agg) returns one row per group. df.groupby("g")["v"].transform("mean") returns a column with each row showing its group's mean — same length as the original.

Q3. Why use as_index=False in df.groupby("g", as_index=False)?

  • To get the group key as a regular column instead of the index
  • To make grouping faster
  • To handle NaN
  • To allow multiple keys

Why: By default the group key becomes the index of the result. as_index=False keeps it as a column — often more convenient for further operations.

Common doubts

What's a 'named aggregation' and when should I use it?

Named agg syntax: df.groupby("g").agg(avg_sales=("sales", "mean"), n=("sales", "size")). Gives you clean column names without MultiIndex. Strongly preferred over .agg({"sales": ["mean", "size"]}) which produces nested column names.

Why is groupby().apply() so slow?

Because apply calls your function once per group in Python — interpreted, with object overhead. For aggregations, prefer agg or transform (vectorized). Only fall back to apply for genuinely custom per-group logic that can't be expressed otherwise.

How do I aggregate to one row per group with multiple columns AND different aggs per column?

Use named aggregation: df.groupby("g").agg(x=("a", "mean"), y=("b", "sum"), z=("c", "max")). Each output column gets its own aggregation, with a clean name.

What's next

Merge, Join, Concatenate