Reading & Writing Files¶

Real data lives in files. Pandas can read and write most common formats — CSV, JSON, Excel, SQL, Parquet, Pickle — usually with one function call.

CSV — the universal format¶

Read¶

import pandas as pd
from io import StringIO

# Fake a CSV in memory (so this runs in your browser)
csv_data = """name,age,city,salary
Alice,25,Mumbai,50000
Bob,30,Delhi,60000
Carol,35,Pune,75000
Dave,40,Bangalore,90000
"""

df = pd.read_csv(StringIO(csv_data))
print(df)
print("\ndtypes:")
print(df.dtypes)

In real code:

df = pd.read_csv("data.csv")

Common `read_csv` options¶

import pandas as pd
from io import StringIO

data = """id;name;age;joined_at
1;Alice;25;2024-01-15
2;Bob;NA;2024-02-20
3;Carol;35;2024-03-10
"""

df = pd.read_csv(
    StringIO(data),
    sep=";",                           # custom separator (European CSV)
    parse_dates=["joined_at"],         # auto-parse this column as datetime
    na_values=["NA", "?", "-"],        # treat these strings as missing
    dtype={"id": "int32"},              # set column types
    encoding="utf-8",
)
print(df)
print()
print(df.dtypes)

Writing CSV¶

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob"],
    "age":  [25, 30],
})

# Save to CSV
df.to_csv("/tmp/output.csv", index=False)

# Read it back to verify
df2 = pd.read_csv("/tmp/output.csv")
print(df2)

Always pass index=False unless you want the row index saved as a column.

Streaming a large CSV in chunks¶

# For a 10 GB CSV — don't load it all at once
total = 0
for chunk in pd.read_csv("big.csv", chunksize=100_000):
    total += chunk["amount"].sum()
print(total)

JSON¶

import pandas as pd
from io import StringIO

json_data = """[
    {"name": "Alice", "age": 25, "city": "Mumbai"},
    {"name": "Bob",   "age": 30, "city": "Delhi"},
    {"name": "Carol", "age": 35, "city": "Pune"}
]"""

df = pd.read_json(StringIO(json_data))
print(df)

Save back:

import pandas as pd

df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})

df.to_json("/tmp/out.json", orient="records", indent=2)

# Read it back
with open("/tmp/out.json") as f:
    print(f.read())

Nested JSON — `pd.json_normalize`¶

For deeply-nested JSON, flatten with json_normalize:

import pandas as pd

records = [
    {"id": 1, "name": "Alice", "addr": {"city": "Mumbai", "pin": "400001"}},
    {"id": 2, "name": "Bob",   "addr": {"city": "Delhi",  "pin": "110001"}},
]
df = pd.json_normalize(records)
print(df)

Excel — xlsx files¶

# Requires `openpyxl`:  pip install openpyxl
df = pd.read_excel("data.xlsx")
df = pd.read_excel("data.xlsx", sheet_name="Q1")
df = pd.read_excel("data.xlsx", sheet_name=["Q1", "Q2"])  # all listed sheets → dict

df.to_excel("output.xlsx", index=False)

Excel is slow compared to CSV/Parquet. For large data, convert to Parquet.

SQL — read from a database¶

from sqlalchemy import create_engine

engine = create_engine("postgresql://user:pwd@host:5432/dbname")
# or: create_engine("sqlite:///local.db")

df = pd.read_sql("SELECT * FROM users WHERE age > 25", engine)
df = pd.read_sql_table("users", engine)

# Write back
df.to_sql("users_clean", engine, if_exists="replace", index=False)

Always parameterize to avoid SQL injection:

df = pd.read_sql(
    "SELECT * FROM users WHERE city = %(c)s",
    engine,
    params={"c": "Mumbai"},
)

Parquet — the modern columnar format¶

Faster and much smaller than CSV. The right choice for repeated reads.

# Requires pyarrow:  pip install pyarrow

df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")

Format	1M rows × 10 cols	Read speed
CSV	~120 MB	slow
JSON	~200 MB	very slow
Parquet	~15 MB	very fast
Pickle	~80 MB	fast (but Python-only)

For any dataset you'll re-read more than once, convert once to Parquet.

Pickle — Python's native format¶

Saves any DataFrame exactly, including types like dates, categoricals, custom objects.

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

df.to_pickle("/tmp/df.pkl")
df2 = pd.read_pickle("/tmp/df.pkl")
print(df2)

Don't load pickles from untrusted sources — they can execute arbitrary code.

URLs work too¶

Most read functions accept a URL directly:

url = "https://raw.githubusercontent.com/datasets/iris/master/data/iris.csv"
df = pd.read_csv(url)

Auto-detect column types¶

import pandas as pd
from io import StringIO

data = """id,name,score,joined,active
1,Alice,85.5,2024-01-15,true
2,Bob,72.3,2024-02-20,false
3,Carol,91.0,2024-03-10,true
"""

df = pd.read_csv(StringIO(data), parse_dates=["joined"])
print(df.dtypes)

Pandas detected	dtype
`1`, `2`, `3`	`int64`
`85.5`, `72.3`	`float64`
`Alice`, `Bob`	`object` (Python string)
`2024-01-15` (with `parse_dates`)	`datetime64[ns]`
`true`, `false`	`object` (we'd need to map manually for `bool`)

Reading multiple files¶

import glob
import pandas as pd

# Combine many CSVs into one DataFrame
files = glob.glob("logs/2025-*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
print(df.shape)

Quick comparison¶

Format	Pros	Cons	When
CSV	Universal, human-readable	Slow, no types	sharing data, small files
JSON	Nested data, web APIs	Slow, verbose	API responses
Excel	Business-friendly	Slow, requires openpyxl	reports
SQL	Live data	Needs DB driver	warehouses, production data
Parquet	Fast, small, typed	Binary, needs pyarrow	re-reading datasets
Pickle	Saves anything	Python-only, unsafe	quick local caching

Common pitfalls¶

❗ Encoding errors — set encoding="utf-8" or encoding="latin-1" for legacy data.
❗ Wrong separator — European CSVs use ;. Always pass sep= explicitly.
❗ Date columns parsed as strings — pass parse_dates=[...] to read_csv.
❗ index=False forgotten on to_csv — saves the row index as an extra column.
❗ Loading huge CSVs into RAM — use chunksize or convert to Parquet.
❗ SQL injection — never f-string user input into a query. Use params=.

Practice¶

What does this print?

Expected: a b

import pandas as pd
from io import StringIO
csv = "a,b\n1,2\n3,4"
print(pd.read_csv(StringIO(csv)))

Save to CSV without the row index appearing as a column

Expected: name,age

import pandas as pd
df = pd.DataFrame({"name": ["Alice"], "age": [25]})
df.to_csv("/tmp/out.csv")            # bug: writes an Unnamed: 0 index column
with open("/tmp/out.csv") as f:
    print(f.readline().strip())

Quiz — Quick check¶

What you remember

Q1. Why pass parse_dates=["col"] to read_csv?

To validate the data
To automatically convert that column to datetime64 instead of a string
To sort by date
To handle timezones

Why: Without parse_dates, date-looking strings are loaded as object (string) and you can't use .dt accessors or do date arithmetic.

Q2. Which format is smallest and fastest to re-read?

CSV
JSON
Parquet
Excel

Why: Parquet is a columnar binary format — typically 5–10× smaller than CSV and many times faster to read. Use it for any dataset you'll touch more than once.

Q3. Reading a 10 GB CSV — best approach?

Buy more RAM
Stream in chunks with pd.read_csv("big.csv", chunksize=100_000) and process each chunk
Use Excel
Pickle it first

Why: chunksize returns an iterator of DataFrames. You process and aggregate one chunk at a time without ever loading the whole file. For repeated access, convert once to Parquet.

Common doubts¶

Why should I never load pickle files from untrusted sources?

pickle.load() can execute arbitrary Python code during deserialization. A malicious pickle file can install malware. Only use pickle for files you produced yourself. For cross-team sharing, use Parquet, JSON, or CSV.

What's the right way to parameterize SQL queries?

Use params= in pd.read_sql. Never f-string user input into a query — that's textbook SQL injection. pd.read_sql("SELECT * FROM t WHERE id = %(i)s", conn, params={"i": user_id}) is safe.

Which encoding do I use when read_csv complains?

Try encoding="utf-8" first (the default). If you see UnicodeDecodeError, try encoding="latin-1" (almost any byte is valid latin-1), or encoding="cp1252" (common for Windows-exported files), or encoding="utf-16" (Excel-exported tab-separated files).

What's next¶

→ Inspecting Data