Skip to content

Merge, Join, Concatenate

How to combine multiple DataFrames into one. Three main verbs:

Function Like SQL...
pd.concat([df1, df2]) UNION — stack on top or side-by-side
pd.merge(df1, df2, on="key") JOIN — match on a column
df1.join(df2) Like merge, but uses the index

concat — stack DataFrames

Vertical (stack rows)

import pandas as pd

df1 = pd.DataFrame({"name": ["Alice","Bob"], "age": [25, 30]})
df2 = pd.DataFrame({"name": ["Carol","Dave"], "age": [35, 40]})

result = pd.concat([df1, df2], ignore_index=True)
print(result)

ignore_index=True gives a fresh 0..N-1 index.

Horizontal (side-by-side)

import pandas as pd

a = pd.DataFrame({"name": ["Alice","Bob"]})
b = pd.DataFrame({"age":  [25, 30]})

print(pd.concat([a, b], axis=1))

When columns differ

import pandas as pd

df1 = pd.DataFrame({"name":["Alice"], "age":[25]})
df2 = pd.DataFrame({"name":["Bob"], "city":["Delhi"]})

print(pd.concat([df1, df2], ignore_index=True))

Missing columns get NaN.

merge — SQL-style joins

import pandas as pd

users = pd.DataFrame({
    "user_id": [1, 2, 3, 4],
    "name":    ["Alice","Bob","Carol","Dave"],
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104, 105],
    "user_id":  [1, 1, 2, 3, 5],            # user 5 doesn't exist
    "total":    [200, 350, 100, 500, 175],
})

# Inner join (default) — only rows matching in both
print(pd.merge(users, orders, on="user_id"))

The four join types

import pandas as pd

a = pd.DataFrame({"id":[1,2,3,4], "name":["A","B","C","D"]})
b = pd.DataFrame({"id":[3,4,5,6], "score":[80,90,70,60]})

print("INNER (default) — only matching rows:")
print(pd.merge(a, b, on="id"))
print()

print("LEFT — keep all of a, fill missing b with NaN:")
print(pd.merge(a, b, on="id", how="left"))
print()

print("RIGHT — keep all of b:")
print(pd.merge(a, b, on="id", how="right"))
print()

print("OUTER — keep everything from both:")
print(pd.merge(a, b, on="id", how="outer"))

Visualizing joins

left.df       right.df
+----+        +----+
| 1  |        | 3  |
| 2  |        | 4  |
| 3  |   ⋈    | 5  |
| 4  |        | 6  |
+----+        +----+

INNER:  {3, 4}                  ← intersection
LEFT :  {1, 2, 3, 4}            ← all from left
RIGHT:  {3, 4, 5, 6}            ← all from right
OUTER:  {1, 2, 3, 4, 5, 6}      ← union

Different column names for the key

import pandas as pd

users  = pd.DataFrame({"uid":[1,2,3], "name":["A","B","C"]})
orders = pd.DataFrame({"user_id":[1,1,2], "amount":[100,150,200]})

print(pd.merge(
    users, orders,
    left_on="uid",
    right_on="user_id",
))

Merge on multiple keys

import pandas as pd

df1 = pd.DataFrame({
    "city": ["Mumbai","Mumbai","Delhi"],
    "year": [2024, 2025, 2024],
    "pop":  [12, 13, 30],
})
df2 = pd.DataFrame({
    "city": ["Mumbai","Delhi","Delhi"],
    "year": [2024, 2024, 2025],
    "gdp":  [100, 200, 220],
})

print(pd.merge(df1, df2, on=["city","year"], how="outer"))

Handling duplicate column names

When both DataFrames have a column not used for joining with the same name, Pandas adds _x and _y:

import pandas as pd

a = pd.DataFrame({"id":[1,2], "name":["Alice","Bob"], "age":[25,30]})
b = pd.DataFrame({"id":[1,2], "name":["Alice","Bob"], "age":[26,31]})    # age differs

print(pd.merge(a, b, on=["id","name"]))
print()

# Custom suffixes
print(pd.merge(a, b, on=["id","name"], suffixes=("_old","_new")))

indicator=True — see which side a row came from

import pandas as pd

a = pd.DataFrame({"id":[1,2,3,4]})
b = pd.DataFrame({"id":[3,4,5,6]})

result = pd.merge(a, b, on="id", how="outer", indicator=True)
print(result)

The _merge column tells you "left_only", "right_only", or "both" — useful for debugging.

df.merge — method form

Same as pd.merge but works in a chain:

import pandas as pd

a = pd.DataFrame({"id":[1,2,3], "name":["A","B","C"]})
b = pd.DataFrame({"id":[2,3,4], "score":[80,90,70]})

result = (
    a
    .merge(b, on="id", how="left")
    .fillna({"score": 0})
)
print(result)

join — index-based

import pandas as pd

a = pd.DataFrame({"name": ["Alice","Bob","Carol"]}, index=[1,2,3])
b = pd.DataFrame({"city": ["Mumbai","Delhi","Pune"]}, index=[2,3,4])

# Joins on the INDEX
print(a.join(b, how="inner"))
print()
print(a.join(b, how="left"))
print()
print(a.join(b, how="outer"))

join is just merge with left_index=True, right_index=True. Use merge when in doubt.

Anti-join — "rows in A not in B"

There's no built-in anti-join. Use indicator:

import pandas as pd

active   = pd.DataFrame({"user_id":[1,2,3,4,5]})
churned  = pd.DataFrame({"user_id":[2,4]})

merged = active.merge(churned, on="user_id", how="left", indicator=True)
still_active = merged[merged["_merge"] == "left_only"].drop(columns="_merge")
print(still_active)

A realistic example — joining 3 tables

import pandas as pd

users = pd.DataFrame({
    "user_id": [1, 2, 3],
    "name":    ["Alice","Bob","Carol"],
    "city":    ["Mumbai","Delhi","Pune"],
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104],
    "user_id":  [1, 1, 2, 3],
    "product_id":[1001, 1002, 1001, 1003],
    "amount":   [200, 350, 150, 500],
})

products = pd.DataFrame({
    "product_id": [1001, 1002, 1003],
    "product_name":["T-Shirt","Mug","Notebook"],
    "category":   ["apparel","drinkware","stationery"],
})

# Chain merges
report = (
    orders
    .merge(users, on="user_id")
    .merge(products, on="product_id")
    [["name","city","product_name","category","amount"]]
)
print(report)

Cheatsheet

Goal Code
Stack rows pd.concat([df1, df2], ignore_index=True)
Stack columns pd.concat([df1, df2], axis=1)
Inner join pd.merge(a, b, on="k")
Left join pd.merge(a, b, on="k", how="left")
Right join ... how="right"
Outer join ... how="outer"
Multi-key join on=["k1","k2"]
Different key names left_on=..., right_on=...
Show source side indicator=True
Index-based join a.join(b)
Anti-join merge + _merge == "left_only"
Custom suffixes suffixes=("_a","_b")

Common pitfalls

  • Many-to-many merges — if both sides have duplicates on the key, you get the cartesian product. Often a bug. Use validate="one_to_many" to enforce.
  • Wrong how — default is inner; you may silently lose rows.
  • ignore_index=True forgotten on concat — you get duplicate index values.
  • Different dtypes on the keyint vs str won't match. Cast before merging.
  • Column-name collisions — Pandas adds _x/_y. Customize with suffixes=.
  • join vs mergejoin is index-based by default. merge defaults to column-based. Easy to confuse.

Practice

What does this print?

Expected: 2

import pandas as pd
a = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]})
b = pd.DataFrame({"id": [2, 3, 4], "score": [80, 90, 70]})
print(len(pd.merge(a, b, on="id")))      # inner join — only matching ids

Keep ALL users, even those without orders (use LEFT join)

Expected: 4

import pandas as pd
users  = pd.DataFrame({"user_id": [1, 2, 3, 4]})
orders = pd.DataFrame({"user_id": [1, 1, 2], "amount": [100, 200, 150]})
result = pd.merge(users, orders, on="user_id")   # bug: inner join drops users 3 and 4
print(len(result))

Quiz — Quick check

What you remember

Q1. What does the default pd.merge(a, b, on="id") do?

  • Inner join — only rows where id exists in both
  • Outer join
  • Left join
  • Concatenates the DataFrames

Why: The default how="inner" returns the intersection. If you want to preserve all rows from one side, use how="left" (most common in practice).

Q2. What's the difference between concat and merge?

  • No difference
  • concat stacks DataFrames (vertically or horizontally); merge does relational joins on a key column
  • concat is faster
  • merge is deprecated

Why: concat is the UNION operation (just append rows). merge is the JOIN operation (match by key). Different purposes.

Q3. Why use indicator=True on a merge?

  • Makes the merge faster
  • Adds a _merge column showing whether each row came from left, right, or both
  • Validates the keys
  • Enables outer joins

Why: Great for debugging — "why is this row in my result?" or for anti-joins ("rows only in left, not in right" → filter where _merge == "left_only").

Common doubts

How do I avoid silently exploding rows when merging?

Pass validate="one_to_many" (or one_to_one, many_to_one). It raises if the actual relationship doesn't match what you expect. Saves hours of debugging when a duplicated key creates a cartesian product.

Why are my dtypes mismatched after merging?

Most often: NaN got introduced (left join with no match) and bumped an int column to float. Fix by filling and casting back: result["c"] = result["c"].fillna(0).astype(int). Or use nullable Int64.

Should I use concat or merge when combining two DataFrames with the same columns?

Use concat to stack rows — same columns, more rows. Use merge to add columns — same rows (matched by key), more columns. They solve different problems.

What's next

Reshaping — pivot, melt, stack