Merge, Join, Concatenate¶

How to combine multiple DataFrames into one. Three main verbs:

Function	Like SQL...
`pd.concat([df1, df2])`	`UNION` — stack on top or side-by-side
`pd.merge(df1, df2, on="key")`	`JOIN` — match on a column
`df1.join(df2)`	Like merge, but uses the index

`concat` — stack DataFrames¶

Vertical (stack rows)¶

import pandas as pd

df1 = pd.DataFrame({"name": ["Alice","Bob"], "age": [25, 30]})
df2 = pd.DataFrame({"name": ["Carol","Dave"], "age": [35, 40]})

result = pd.concat([df1, df2], ignore_index=True)
print(result)

ignore_index=True gives a fresh 0..N-1 index.

Horizontal (side-by-side)¶

import pandas as pd

a = pd.DataFrame({"name": ["Alice","Bob"]})
b = pd.DataFrame({"age":  [25, 30]})

print(pd.concat([a, b], axis=1))

When columns differ¶

import pandas as pd

df1 = pd.DataFrame({"name":["Alice"], "age":[25]})
df2 = pd.DataFrame({"name":["Bob"], "city":["Delhi"]})

print(pd.concat([df1, df2], ignore_index=True))

Missing columns get NaN.

`merge` — SQL-style joins¶

import pandas as pd

users = pd.DataFrame({
    "user_id": [1, 2, 3, 4],
    "name":    ["Alice","Bob","Carol","Dave"],
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104, 105],
    "user_id":  [1, 1, 2, 3, 5],            # user 5 doesn't exist
    "total":    [200, 350, 100, 500, 175],
})

# Inner join (default) — only rows matching in both
print(pd.merge(users, orders, on="user_id"))

The four join types¶

import pandas as pd

a = pd.DataFrame({"id":[1,2,3,4], "name":["A","B","C","D"]})
b = pd.DataFrame({"id":[3,4,5,6], "score":[80,90,70,60]})

print("INNER (default) — only matching rows:")
print(pd.merge(a, b, on="id"))
print()

print("LEFT — keep all of a, fill missing b with NaN:")
print(pd.merge(a, b, on="id", how="left"))
print()

print("RIGHT — keep all of b:")
print(pd.merge(a, b, on="id", how="right"))
print()

print("OUTER — keep everything from both:")
print(pd.merge(a, b, on="id", how="outer"))

Visualizing joins¶

left.df       right.df
+----+        +----+
| 1  |        | 3  |
| 2  |        | 4  |
| 3  |   ⋈    | 5  |
| 4  |        | 6  |
+----+        +----+

INNER:  {3, 4}                  ← intersection
LEFT :  {1, 2, 3, 4}            ← all from left
RIGHT:  {3, 4, 5, 6}            ← all from right
OUTER:  {1, 2, 3, 4, 5, 6}      ← union

Different column names for the key¶

import pandas as pd

users  = pd.DataFrame({"uid":[1,2,3], "name":["A","B","C"]})
orders = pd.DataFrame({"user_id":[1,1,2], "amount":[100,150,200]})

print(pd.merge(
    users, orders,
    left_on="uid",
    right_on="user_id",
))

Merge on multiple keys¶

import pandas as pd

df1 = pd.DataFrame({
    "city": ["Mumbai","Mumbai","Delhi"],
    "year": [2024, 2025, 2024],
    "pop":  [12, 13, 30],
})
df2 = pd.DataFrame({
    "city": ["Mumbai","Delhi","Delhi"],
    "year": [2024, 2024, 2025],
    "gdp":  [100, 200, 220],
})

print(pd.merge(df1, df2, on=["city","year"], how="outer"))

Handling duplicate column names¶

When both DataFrames have a column not used for joining with the same name, Pandas adds _x and _y:

import pandas as pd

a = pd.DataFrame({"id":[1,2], "name":["Alice","Bob"], "age":[25,30]})
b = pd.DataFrame({"id":[1,2], "name":["Alice","Bob"], "age":[26,31]})    # age differs

print(pd.merge(a, b, on=["id","name"]))
print()

# Custom suffixes
print(pd.merge(a, b, on=["id","name"], suffixes=("_old","_new")))

`indicator=True` — see which side a row came from¶

import pandas as pd

a = pd.DataFrame({"id":[1,2,3,4]})
b = pd.DataFrame({"id":[3,4,5,6]})

result = pd.merge(a, b, on="id", how="outer", indicator=True)
print(result)

The _merge column tells you "left_only", "right_only", or "both" — useful for debugging.

`df.merge` — method form¶

Same as pd.merge but works in a chain:

import pandas as pd

a = pd.DataFrame({"id":[1,2,3], "name":["A","B","C"]})
b = pd.DataFrame({"id":[2,3,4], "score":[80,90,70]})

result = (
    a
    .merge(b, on="id", how="left")
    .fillna({"score": 0})
)
print(result)

`join` — index-based¶

import pandas as pd

a = pd.DataFrame({"name": ["Alice","Bob","Carol"]}, index=[1,2,3])
b = pd.DataFrame({"city": ["Mumbai","Delhi","Pune"]}, index=[2,3,4])

# Joins on the INDEX
print(a.join(b, how="inner"))
print()
print(a.join(b, how="left"))
print()
print(a.join(b, how="outer"))

join is just merge with left_index=True, right_index=True. Use merge when in doubt.

Anti-join — "rows in A not in B"¶

There's no built-in anti-join. Use indicator:

import pandas as pd

active   = pd.DataFrame({"user_id":[1,2,3,4,5]})
churned  = pd.DataFrame({"user_id":[2,4]})

merged = active.merge(churned, on="user_id", how="left", indicator=True)
still_active = merged[merged["_merge"] == "left_only"].drop(columns="_merge")
print(still_active)

A realistic example — joining 3 tables¶

import pandas as pd

users = pd.DataFrame({
    "user_id": [1, 2, 3],
    "name":    ["Alice","Bob","Carol"],
    "city":    ["Mumbai","Delhi","Pune"],
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104],
    "user_id":  [1, 1, 2, 3],
    "product_id":[1001, 1002, 1001, 1003],
    "amount":   [200, 350, 150, 500],
})

products = pd.DataFrame({
    "product_id": [1001, 1002, 1003],
    "product_name":["T-Shirt","Mug","Notebook"],
    "category":   ["apparel","drinkware","stationery"],
})

# Chain merges
report = (
    orders
    .merge(users, on="user_id")
    .merge(products, on="product_id")
    [["name","city","product_name","category","amount"]]
)
print(report)

Cheatsheet¶

Goal	Code
Stack rows	`pd.concat([df1, df2], ignore_index=True)`
Stack columns	`pd.concat([df1, df2], axis=1)`
Inner join	`pd.merge(a, b, on="k")`
Left join	`pd.merge(a, b, on="k", how="left")`
Right join	`... how="right"`
Outer join	`... how="outer"`
Multi-key join	`on=["k1","k2"]`
Different key names	`left_on=..., right_on=...`
Show source side	`indicator=True`
Index-based join	`a.join(b)`
Anti-join	merge + `_merge == "left_only"`
Custom suffixes	`suffixes=("_a","_b")`

Common pitfalls¶

❗ Many-to-many merges — if both sides have duplicates on the key, you get the cartesian product. Often a bug. Use validate="one_to_many" to enforce.
❗ Wrong how — default is inner; you may silently lose rows.
❗ ignore_index=True forgotten on concat — you get duplicate index values.
❗ Different dtypes on the key — int vs str won't match. Cast before merging.
❗ Column-name collisions — Pandas adds _x/_y. Customize with suffixes=.
❗ join vs merge — join is index-based by default. merge defaults to column-based. Easy to confuse.

Practice¶

What does this print?

Expected: 2

import pandas as pd
a = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]})
b = pd.DataFrame({"id": [2, 3, 4], "score": [80, 90, 70]})
print(len(pd.merge(a, b, on="id")))      # inner join — only matching ids

Keep ALL users, even those without orders (use LEFT join)

Expected: 4

import pandas as pd
users  = pd.DataFrame({"user_id": [1, 2, 3, 4]})
orders = pd.DataFrame({"user_id": [1, 1, 2], "amount": [100, 200, 150]})
result = pd.merge(users, orders, on="user_id")   # bug: inner join drops users 3 and 4
print(len(result))

Quiz — Quick check¶

What you remember

Q1. What does the default pd.merge(a, b, on="id") do?

Inner join — only rows where id exists in both
Outer join
Left join
Concatenates the DataFrames

Why: The default how="inner" returns the intersection. If you want to preserve all rows from one side, use how="left" (most common in practice).

Q2. What's the difference between concat and merge?

No difference
concat stacks DataFrames (vertically or horizontally); merge does relational joins on a key column
concat is faster
merge is deprecated

Why: concat is the UNION operation (just append rows). merge is the JOIN operation (match by key). Different purposes.

Q3. Why use indicator=True on a merge?

Makes the merge faster
Adds a _merge column showing whether each row came from left, right, or both
Validates the keys
Enables outer joins

Why: Great for debugging — "why is this row in my result?" or for anti-joins ("rows only in left, not in right" → filter where _merge == "left_only").

Common doubts¶

How do I avoid silently exploding rows when merging?

Pass validate="one_to_many" (or one_to_one, many_to_one). It raises if the actual relationship doesn't match what you expect. Saves hours of debugging when a duplicated key creates a cartesian product.

Why are my dtypes mismatched after merging?

Most often: NaN got introduced (left join with no match) and bumped an int column to float. Fix by filling and casting back: result["c"] = result["c"].fillna(0).astype(int). Or use nullable Int64.

Should I use concat or merge when combining two DataFrames with the same columns?

Use concat to stack rows — same columns, more rows. Use merge to add columns — same rows (matched by key), more columns. They solve different problems.

What's next¶

→ Reshaping — pivot, melt, stack