Merge, Join, Concatenate¶
How to combine multiple DataFrames into one. Three main verbs:
| Function | Like SQL... |
|---|---|
pd.concat([df1, df2]) |
UNION — stack on top or side-by-side |
pd.merge(df1, df2, on="key") |
JOIN — match on a column |
df1.join(df2) |
Like merge, but uses the index |
concat — stack DataFrames¶
Vertical (stack rows)¶
import pandas as pd
df1 = pd.DataFrame({"name": ["Alice","Bob"], "age": [25, 30]})
df2 = pd.DataFrame({"name": ["Carol","Dave"], "age": [35, 40]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)
ignore_index=True gives a fresh 0..N-1 index.
Horizontal (side-by-side)¶
import pandas as pd
a = pd.DataFrame({"name": ["Alice","Bob"]})
b = pd.DataFrame({"age": [25, 30]})
print(pd.concat([a, b], axis=1))
When columns differ¶
import pandas as pd
df1 = pd.DataFrame({"name":["Alice"], "age":[25]})
df2 = pd.DataFrame({"name":["Bob"], "city":["Delhi"]})
print(pd.concat([df1, df2], ignore_index=True))
Missing columns get NaN.
merge — SQL-style joins¶
import pandas as pd
users = pd.DataFrame({
"user_id": [1, 2, 3, 4],
"name": ["Alice","Bob","Carol","Dave"],
})
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104, 105],
"user_id": [1, 1, 2, 3, 5], # user 5 doesn't exist
"total": [200, 350, 100, 500, 175],
})
# Inner join (default) — only rows matching in both
print(pd.merge(users, orders, on="user_id"))
The four join types¶
import pandas as pd
a = pd.DataFrame({"id":[1,2,3,4], "name":["A","B","C","D"]})
b = pd.DataFrame({"id":[3,4,5,6], "score":[80,90,70,60]})
print("INNER (default) — only matching rows:")
print(pd.merge(a, b, on="id"))
print()
print("LEFT — keep all of a, fill missing b with NaN:")
print(pd.merge(a, b, on="id", how="left"))
print()
print("RIGHT — keep all of b:")
print(pd.merge(a, b, on="id", how="right"))
print()
print("OUTER — keep everything from both:")
print(pd.merge(a, b, on="id", how="outer"))
Visualizing joins¶
left.df right.df
+----+ +----+
| 1 | | 3 |
| 2 | | 4 |
| 3 | ⋈ | 5 |
| 4 | | 6 |
+----+ +----+
INNER: {3, 4} ← intersection
LEFT : {1, 2, 3, 4} ← all from left
RIGHT: {3, 4, 5, 6} ← all from right
OUTER: {1, 2, 3, 4, 5, 6} ← union
Different column names for the key¶
import pandas as pd
users = pd.DataFrame({"uid":[1,2,3], "name":["A","B","C"]})
orders = pd.DataFrame({"user_id":[1,1,2], "amount":[100,150,200]})
print(pd.merge(
users, orders,
left_on="uid",
right_on="user_id",
))
Merge on multiple keys¶
import pandas as pd
df1 = pd.DataFrame({
"city": ["Mumbai","Mumbai","Delhi"],
"year": [2024, 2025, 2024],
"pop": [12, 13, 30],
})
df2 = pd.DataFrame({
"city": ["Mumbai","Delhi","Delhi"],
"year": [2024, 2024, 2025],
"gdp": [100, 200, 220],
})
print(pd.merge(df1, df2, on=["city","year"], how="outer"))
Handling duplicate column names¶
When both DataFrames have a column not used for joining with the same name, Pandas adds _x and _y:
import pandas as pd
a = pd.DataFrame({"id":[1,2], "name":["Alice","Bob"], "age":[25,30]})
b = pd.DataFrame({"id":[1,2], "name":["Alice","Bob"], "age":[26,31]}) # age differs
print(pd.merge(a, b, on=["id","name"]))
print()
# Custom suffixes
print(pd.merge(a, b, on=["id","name"], suffixes=("_old","_new")))
indicator=True — see which side a row came from¶
import pandas as pd
a = pd.DataFrame({"id":[1,2,3,4]})
b = pd.DataFrame({"id":[3,4,5,6]})
result = pd.merge(a, b, on="id", how="outer", indicator=True)
print(result)
The _merge column tells you "left_only", "right_only", or "both" — useful for debugging.
df.merge — method form¶
Same as pd.merge but works in a chain:
import pandas as pd
a = pd.DataFrame({"id":[1,2,3], "name":["A","B","C"]})
b = pd.DataFrame({"id":[2,3,4], "score":[80,90,70]})
result = (
a
.merge(b, on="id", how="left")
.fillna({"score": 0})
)
print(result)
join — index-based¶
import pandas as pd
a = pd.DataFrame({"name": ["Alice","Bob","Carol"]}, index=[1,2,3])
b = pd.DataFrame({"city": ["Mumbai","Delhi","Pune"]}, index=[2,3,4])
# Joins on the INDEX
print(a.join(b, how="inner"))
print()
print(a.join(b, how="left"))
print()
print(a.join(b, how="outer"))
join is just merge with left_index=True, right_index=True. Use merge when in doubt.
Anti-join — "rows in A not in B"¶
There's no built-in anti-join. Use indicator:
import pandas as pd
active = pd.DataFrame({"user_id":[1,2,3,4,5]})
churned = pd.DataFrame({"user_id":[2,4]})
merged = active.merge(churned, on="user_id", how="left", indicator=True)
still_active = merged[merged["_merge"] == "left_only"].drop(columns="_merge")
print(still_active)
A realistic example — joining 3 tables¶
import pandas as pd
users = pd.DataFrame({
"user_id": [1, 2, 3],
"name": ["Alice","Bob","Carol"],
"city": ["Mumbai","Delhi","Pune"],
})
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104],
"user_id": [1, 1, 2, 3],
"product_id":[1001, 1002, 1001, 1003],
"amount": [200, 350, 150, 500],
})
products = pd.DataFrame({
"product_id": [1001, 1002, 1003],
"product_name":["T-Shirt","Mug","Notebook"],
"category": ["apparel","drinkware","stationery"],
})
# Chain merges
report = (
orders
.merge(users, on="user_id")
.merge(products, on="product_id")
[["name","city","product_name","category","amount"]]
)
print(report)
Cheatsheet¶
| Goal | Code |
|---|---|
| Stack rows | pd.concat([df1, df2], ignore_index=True) |
| Stack columns | pd.concat([df1, df2], axis=1) |
| Inner join | pd.merge(a, b, on="k") |
| Left join | pd.merge(a, b, on="k", how="left") |
| Right join | ... how="right" |
| Outer join | ... how="outer" |
| Multi-key join | on=["k1","k2"] |
| Different key names | left_on=..., right_on=... |
| Show source side | indicator=True |
| Index-based join | a.join(b) |
| Anti-join | merge + _merge == "left_only" |
| Custom suffixes | suffixes=("_a","_b") |
Common pitfalls¶
- ❗ Many-to-many merges — if both sides have duplicates on the key, you get the cartesian product. Often a bug. Use
validate="one_to_many"to enforce. - ❗ Wrong
how— default isinner; you may silently lose rows. - ❗
ignore_index=Trueforgotten onconcat— you get duplicate index values. - ❗ Different dtypes on the key —
intvsstrwon't match. Cast before merging. - ❗ Column-name collisions — Pandas adds
_x/_y. Customize withsuffixes=. - ❗
joinvsmerge—joinis index-based by default.mergedefaults to column-based. Easy to confuse.
Practice¶
What does this print?
Expected: 2
Keep ALL users, even those without orders (use LEFT join)
Expected: 4
Quiz — Quick check¶
What you remember
Q1. What does the default pd.merge(a, b, on="id") do?
- Inner join — only rows where
idexists in both - Outer join
- Left join
- Concatenates the DataFrames
Why: The default
how="inner"returns the intersection. If you want to preserve all rows from one side, usehow="left"(most common in practice).
Q2. What's the difference between concat and merge?
- No difference
-
concatstacks DataFrames (vertically or horizontally);mergedoes relational joins on a key column -
concatis faster -
mergeis deprecated
Why:
concatis the UNION operation (just append rows).mergeis the JOIN operation (match by key). Different purposes.
Q3. Why use indicator=True on a merge?
- Makes the merge faster
- Adds a
_mergecolumn showing whether each row came from left, right, or both - Validates the keys
- Enables outer joins
Why: Great for debugging — "why is this row in my result?" or for anti-joins ("rows only in left, not in right" → filter where
_merge == "left_only").
Common doubts¶
How do I avoid silently exploding rows when merging?
Pass validate="one_to_many" (or one_to_one, many_to_one). It raises if the actual relationship doesn't match what you expect. Saves hours of debugging when a duplicated key creates a cartesian product.
Why are my dtypes mismatched after merging?
Most often: NaN got introduced (left join with no match) and bumped an int column to float. Fix by filling and casting back: result["c"] = result["c"].fillna(0).astype(int). Or use nullable Int64.
Should I use concat or merge when combining two DataFrames with the same columns?
Use concat to stack rows — same columns, more rows. Use merge to add columns — same rows (matched by key), more columns. They solve different problems.