Real-World Examples¶
Five small end-to-end projects combining everything from chapters 1–13.
1. Cleaning a messy customer dataset¶
import pandas as pd
import numpy as np
# Messy raw data
df = pd.DataFrame({
"Name": ["alice", " bob ", "Carol", "DAVE", "alice", None, "eve"],
"Email": ["alice@gmail.com", "BOB@YAHOO.COM", "carol@?", "dave@gmail.com",
"alice@gmail.com", "frank@x.com", None],
"Age": [25, "thirty", 35, np.nan, 25, 40, "?"],
"Joined": ["2024-01-15","2024/02/20","20240310","2024-04-05","2024-01-15","2024-05-15","2024-06-01"],
})
print("Before:")
print(df)
print()
# 1. Normalize names — strip whitespace, title case
df["Name"] = df["Name"].str.strip().str.title()
# 2. Normalize emails — lowercase
df["Email"] = df["Email"].str.lower()
# 3. Coerce age to numeric — invalid → NaN
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
# 4. Parse dates with multiple formats
df["Joined"] = pd.to_datetime(df["Joined"], format="mixed", errors="coerce")
# 5. Drop rows missing critical fields
df = df.dropna(subset=["Name", "Email"])
# 6. De-duplicate
df = df.drop_duplicates(subset=["Email"])
# 7. Fill missing age with median
df["Age"] = df["Age"].fillna(df["Age"].median())
# 8. Validate emails — drop bad ones
df = df[df["Email"].str.match(r"^[\w\.-]+@[\w\.-]+\.\w+$", na=False)]
# 9. Reset index
df = df.reset_index(drop=True)
print("After:")
print(df)
2. Sales analytics — group, aggregate, rank¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
n = 500
sales = pd.DataFrame({
"date": pd.date_range("2025-01-01", periods=n, freq="h"),
"region": rng.choice(["North","South","East","West"], size=n),
"product": rng.choice(["A","B","C","D"], size=n),
"qty": rng.integers(1, 10, size=n),
"price": rng.uniform(10, 100, size=n).round(2),
})
sales["revenue"] = sales["qty"] * sales["price"]
sales["month"] = sales["date"].dt.to_period("M")
# Top 3 products per region
top = (
sales
.groupby(["region", "product"], as_index=False)
.agg(total_revenue=("revenue","sum"))
.sort_values(["region","total_revenue"], ascending=[True, False])
.groupby("region")
.head(3)
)
print("Top 3 products per region:")
print(top)
print()
# Monthly summary
monthly = (
sales
.groupby("month")
.agg(
revenue = ("revenue","sum"),
orders = ("revenue","size"),
avg_ord = ("revenue","mean"),
)
.round(0)
)
print("Monthly summary:")
print(monthly)
3. Joining + analyzing customer orders¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
users = pd.DataFrame({
"user_id": range(1, 11),
"name": [f"User{i}" for i in range(1, 11)],
"city": rng.choice(["Mumbai","Delhi","Pune"], size=10),
"signup": pd.date_range("2024-01-01", periods=10, freq="W"),
})
orders = pd.DataFrame({
"order_id": range(101, 131),
"user_id": rng.integers(1, 11, size=30),
"amount": rng.uniform(100, 1000, size=30).round(2),
"date": pd.date_range("2025-01-01", periods=30, freq="2D"),
})
# Per-user spending summary
user_summary = (
orders
.groupby("user_id", as_index=False)
.agg(
order_count = ("amount","size"),
total_spent = ("amount","sum"),
avg_order = ("amount","mean"),
)
.round(2)
)
# Join with users for context
report = users.merge(user_summary, on="user_id", how="left").fillna(0)
print(report)
print()
# Per-city totals
print("Per-city totals:")
print(
report
.groupby("city")
.agg(
customers = ("user_id","count"),
revenue = ("total_spent","sum"),
avg_revenue = ("total_spent","mean"),
)
.round(2)
)
4. Time-series — daily traffic with rolling averages¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
n = 90
df = pd.DataFrame({
"visits": np.maximum(0, (1000 + 100*np.sin(np.linspace(0, 6, n)) + rng.normal(0, 80, n))).astype(int),
}, index=pd.date_range("2025-01-01", periods=n, freq="D"))
# 7-day moving average to smooth out daily noise
df["ma7"] = df["visits"].rolling(7).mean().round(0)
df["ma30"] = df["visits"].rolling(30).mean().round(0)
# Day-of-week effect
df["weekday"] = df.index.day_name()
by_weekday = df.groupby("weekday")["visits"].mean().round(0).reindex([
"Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"
])
print("Last 10 days:")
print(df.tail(10))
print()
print("Average by weekday:")
print(by_weekday)
5. Pivot table for an executive report¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
n = 200
df = pd.DataFrame({
"date": pd.date_range("2025-01-01", periods=n, freq="3D"),
"country": rng.choice(["IN","US","UK"], size=n),
"channel": rng.choice(["organic","paid","social"], size=n),
"revenue": rng.integers(100, 1000, size=n),
})
df["month"] = df["date"].dt.to_period("M")
# Monthly revenue by country × channel
report = df.pivot_table(
index="month",
columns=["country", "channel"],
values="revenue",
aggfunc="sum",
fill_value=0,
margins=True,
margins_name="TOTAL",
)
# Flatten the MultiIndex columns for clean output
report.columns = [f"{c}_{ch}" if c != "" else ch for c, ch in report.columns]
print(report)
6. Bonus — find churn risk customers¶
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
n = 50
customers = pd.DataFrame({
"user_id": range(1, n+1),
"name": [f"User{i}" for i in range(1, n+1)],
"signed_up": pd.date_range("2024-01-01", periods=n, freq="D"),
"last_login": pd.date_range("2024-12-01", periods=n, freq="W"),
"monthly_plan": rng.choice(["free","basic","pro"], size=n, p=[0.5, 0.3, 0.2]),
})
# Days since last login
today = pd.Timestamp("2025-06-01")
customers["days_inactive"] = (today - customers["last_login"]).dt.days
# Churn risk = high inactivity AND on a paid plan
at_risk = customers[
(customers["days_inactive"] > 60) &
(customers["monthly_plan"] != "free")
].sort_values("days_inactive", ascending=False)
print(f"At-risk customers ({len(at_risk)} found):")
print(at_risk[["name", "monthly_plan", "days_inactive"]].head(15))
# By plan
print("\nAt-risk count by plan:")
print(at_risk["monthly_plan"].value_counts())
Pandas — what to do next¶
You've covered the working core of Pandas:
- ✅ Creating Series and DataFrames from any source
- ✅ Loading CSV/JSON/Excel/SQL/Parquet
- ✅ Inspecting with head/info/describe/dtypes/isna
- ✅ Selecting with
[],.loc,.iloc,.query - ✅ Filtering with conditions,
.isin,.between,.str,.query - ✅ Modifying columns — assign, rename, drop, apply, map, cut, dummies
- ✅ Missing data —
isna,fillna,dropna,interpolate, group-wise fill - ✅ Sorting —
sort_values,nlargest,rank - ✅ GroupBy —
agg,transform,filter, named aggs, top-K per group - ✅ Merge / Concat — inner/left/right/outer, multi-key, indicator
- ✅ Reshape —
pivot,pivot_table,melt,stack/unstack,crosstab,explode - ✅ Time series — date_range,
dt, resample, rolling, shift
Where to go next¶
- NumPy — what Pandas runs on. Knowing NumPy lets you write faster custom logic.
- Machine Learning — sklearn assumes Pandas DataFrames. You're ready.
- Matplotlib / Seaborn / Plotly — for charts. Pandas integrates seamlessly:
df.plot(...). - Polars — a faster, newer Pandas-like library worth knowing about.
Practice ideas¶
- Download a CSV from Kaggle and reproduce one of the examples above on it.
- Pick a public API, load it with
pd.read_json, and analyze. - Build a notebook that cleans one of your own messy datasets end-to-end.
Practice¶
What does this print?
Expected: 2
Reset the index so it's 0..N-1 after filtering
Expected: 0
Quiz — Quick check¶
What you remember
Q1. What's the most common first step in a real Pandas pipeline?
-
df.plot() - Inspecting with
head,info,describe,isna().sum()to understand the shape and quality - Saving to Parquet
- Setting a datetime index
Why: You can't transform data well if you don't understand it. Always profile first. This routine catches the gotchas (missing values, wrong dtypes, duplicates) before they bite you downstream.
Q2. When chaining several operations (df.merge(...).groupby(...).agg(...)), why prefer this over creating intermediate variables?
- Faster
- Reads as a pipeline — clearer intent, no leftover temp names cluttering scope
- Required by Pandas
- Avoids copies
Why: Chaining communicates "this is one transformation". Heavily-named temp variables suggest the author wanted them for debugging — fine while developing, but production code is cleaner chained.
Q3. What's the right way to validate that a merge produced the expected row count?
- Print the result
- Pass
validate=to merge ("one_to_one","one_to_many", etc.) — raises if reality differs from your assumption - Use
assert len(result) == ... - No way to check
Why:
validate=is built into merge for exactly this. It catches duplicate-key bugs that would otherwise inflate your data silently.
Common doubts¶
When should I move from Pandas to Polars / DuckDB / Dask?
When Pandas starts hurting: row counts > 10 million on a single machine, multi-machine work, or when query times become annoying. Polars is the easiest jump (similar API, much faster). DuckDB is amazing for ad-hoc SQL over Parquet files. Dask is for "Pandas at scale" if you must stay in the API.
How do I handle truly messy real-world data?
Step-by-step. Don't try to do everything in one chain. (1) Load. (2) Inspect (head, info, isna). (3) Fix obvious things one at a time (dtypes, NaN markers). (4) De-duplicate. (5) Validate (do counts match what you expect?). (6) Re-inspect. Cleaning is iterative, not linear.
Is it OK to chain 20 operations in a single pipeline?
Yes — Pandas pipelines often run that long. The convention is to put each step on its own line for readability. If a step needs explanation, add a comment above it. Add .pipe(my_function) to factor out reusable transformations.