Skip to content

Reading & Writing Files

Real data lives in files. Pandas can read and write most common formats — CSV, JSON, Excel, SQL, Parquet, Pickle — usually with one function call.

CSV — the universal format

Read

import pandas as pd
from io import StringIO

# Fake a CSV in memory (so this runs in your browser)
csv_data = """name,age,city,salary
Alice,25,Mumbai,50000
Bob,30,Delhi,60000
Carol,35,Pune,75000
Dave,40,Bangalore,90000
"""

df = pd.read_csv(StringIO(csv_data))
print(df)
print("\ndtypes:")
print(df.dtypes)

In real code:

df = pd.read_csv("data.csv")

Common read_csv options

import pandas as pd
from io import StringIO

data = """id;name;age;joined_at
1;Alice;25;2024-01-15
2;Bob;NA;2024-02-20
3;Carol;35;2024-03-10
"""

df = pd.read_csv(
    StringIO(data),
    sep=";",                           # custom separator (European CSV)
    parse_dates=["joined_at"],         # auto-parse this column as datetime
    na_values=["NA", "?", "-"],        # treat these strings as missing
    dtype={"id": "int32"},              # set column types
    encoding="utf-8",
)
print(df)
print()
print(df.dtypes)

Writing CSV

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob"],
    "age":  [25, 30],
})

# Save to CSV
df.to_csv("/tmp/output.csv", index=False)

# Read it back to verify
df2 = pd.read_csv("/tmp/output.csv")
print(df2)

Always pass index=False unless you want the row index saved as a column.

Streaming a large CSV in chunks

# For a 10 GB CSV — don't load it all at once
total = 0
for chunk in pd.read_csv("big.csv", chunksize=100_000):
    total += chunk["amount"].sum()
print(total)

JSON

import pandas as pd
from io import StringIO

json_data = """[
    {"name": "Alice", "age": 25, "city": "Mumbai"},
    {"name": "Bob",   "age": 30, "city": "Delhi"},
    {"name": "Carol", "age": 35, "city": "Pune"}
]"""

df = pd.read_json(StringIO(json_data))
print(df)

Save back:

import pandas as pd

df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})

df.to_json("/tmp/out.json", orient="records", indent=2)

# Read it back
with open("/tmp/out.json") as f:
    print(f.read())

Nested JSON — pd.json_normalize

For deeply-nested JSON, flatten with json_normalize:

import pandas as pd

records = [
    {"id": 1, "name": "Alice", "addr": {"city": "Mumbai", "pin": "400001"}},
    {"id": 2, "name": "Bob",   "addr": {"city": "Delhi",  "pin": "110001"}},
]
df = pd.json_normalize(records)
print(df)

Excel — xlsx files

# Requires `openpyxl`:  pip install openpyxl
df = pd.read_excel("data.xlsx")
df = pd.read_excel("data.xlsx", sheet_name="Q1")
df = pd.read_excel("data.xlsx", sheet_name=["Q1", "Q2"])  # all listed sheets → dict

df.to_excel("output.xlsx", index=False)

Excel is slow compared to CSV/Parquet. For large data, convert to Parquet.

SQL — read from a database

from sqlalchemy import create_engine

engine = create_engine("postgresql://user:pwd@host:5432/dbname")
# or: create_engine("sqlite:///local.db")

df = pd.read_sql("SELECT * FROM users WHERE age > 25", engine)
df = pd.read_sql_table("users", engine)

# Write back
df.to_sql("users_clean", engine, if_exists="replace", index=False)

Always parameterize to avoid SQL injection:

df = pd.read_sql(
    "SELECT * FROM users WHERE city = %(c)s",
    engine,
    params={"c": "Mumbai"},
)

Parquet — the modern columnar format

Faster and much smaller than CSV. The right choice for repeated reads.

# Requires pyarrow:  pip install pyarrow

df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")
Format 1M rows × 10 cols Read speed
CSV ~120 MB slow
JSON ~200 MB very slow
Parquet ~15 MB very fast
Pickle ~80 MB fast (but Python-only)

For any dataset you'll re-read more than once, convert once to Parquet.

Pickle — Python's native format

Saves any DataFrame exactly, including types like dates, categoricals, custom objects.

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

df.to_pickle("/tmp/df.pkl")
df2 = pd.read_pickle("/tmp/df.pkl")
print(df2)

Don't load pickles from untrusted sources — they can execute arbitrary code.

URLs work too

Most read functions accept a URL directly:

url = "https://raw.githubusercontent.com/datasets/iris/master/data/iris.csv"
df = pd.read_csv(url)

Auto-detect column types

import pandas as pd
from io import StringIO

data = """id,name,score,joined,active
1,Alice,85.5,2024-01-15,true
2,Bob,72.3,2024-02-20,false
3,Carol,91.0,2024-03-10,true
"""

df = pd.read_csv(StringIO(data), parse_dates=["joined"])
print(df.dtypes)
Pandas detected dtype
1, 2, 3 int64
85.5, 72.3 float64
Alice, Bob object (Python string)
2024-01-15 (with parse_dates) datetime64[ns]
true, false object (we'd need to map manually for bool)

Reading multiple files

import glob
import pandas as pd

# Combine many CSVs into one DataFrame
files = glob.glob("logs/2025-*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
print(df.shape)

Quick comparison

Format Pros Cons When
CSV Universal, human-readable Slow, no types sharing data, small files
JSON Nested data, web APIs Slow, verbose API responses
Excel Business-friendly Slow, requires openpyxl reports
SQL Live data Needs DB driver warehouses, production data
Parquet Fast, small, typed Binary, needs pyarrow re-reading datasets
Pickle Saves anything Python-only, unsafe quick local caching

Common pitfalls

  • Encoding errors — set encoding="utf-8" or encoding="latin-1" for legacy data.
  • Wrong separator — European CSVs use ;. Always pass sep= explicitly.
  • Date columns parsed as strings — pass parse_dates=[...] to read_csv.
  • index=False forgotten on to_csv — saves the row index as an extra column.
  • Loading huge CSVs into RAM — use chunksize or convert to Parquet.
  • SQL injection — never f-string user input into a query. Use params=.

Practice

What does this print?

Expected: a b

import pandas as pd
from io import StringIO
csv = "a,b\n1,2\n3,4"
print(pd.read_csv(StringIO(csv)))

Save to CSV without the row index appearing as a column

Expected: name,age

import pandas as pd
df = pd.DataFrame({"name": ["Alice"], "age": [25]})
df.to_csv("/tmp/out.csv")            # bug: writes an Unnamed: 0 index column
with open("/tmp/out.csv") as f:
    print(f.readline().strip())

Quiz — Quick check

What you remember

Q1. Why pass parse_dates=["col"] to read_csv?

  • To validate the data
  • To automatically convert that column to datetime64 instead of a string
  • To sort by date
  • To handle timezones

Why: Without parse_dates, date-looking strings are loaded as object (string) and you can't use .dt accessors or do date arithmetic.

Q2. Which format is smallest and fastest to re-read?

  • CSV
  • JSON
  • Parquet
  • Excel

Why: Parquet is a columnar binary format — typically 5–10× smaller than CSV and many times faster to read. Use it for any dataset you'll touch more than once.

Q3. Reading a 10 GB CSV — best approach?

  • Buy more RAM
  • Stream in chunks with pd.read_csv("big.csv", chunksize=100_000) and process each chunk
  • Use Excel
  • Pickle it first

Why: chunksize returns an iterator of DataFrames. You process and aggregate one chunk at a time without ever loading the whole file. For repeated access, convert once to Parquet.

Common doubts

Why should I never load pickle files from untrusted sources?

pickle.load() can execute arbitrary Python code during deserialization. A malicious pickle file can install malware. Only use pickle for files you produced yourself. For cross-team sharing, use Parquet, JSON, or CSV.

What's the right way to parameterize SQL queries?

Use params= in pd.read_sql. Never f-string user input into a query — that's textbook SQL injection. pd.read_sql("SELECT * FROM t WHERE id = %(i)s", conn, params={"i": user_id}) is safe.

Which encoding do I use when read_csv complains?

Try encoding="utf-8" first (the default). If you see UnicodeDecodeError, try encoding="latin-1" (almost any byte is valid latin-1), or encoding="cp1252" (common for Windows-exported files), or encoding="utf-16" (Excel-exported tab-separated files).

What's next

Inspecting Data