Reading & Writing Files¶
Real data lives in files. Pandas can read and write most common formats — CSV, JSON, Excel, SQL, Parquet, Pickle — usually with one function call.
CSV — the universal format¶
Read¶
import pandas as pd
from io import StringIO
# Fake a CSV in memory (so this runs in your browser)
csv_data = """name,age,city,salary
Alice,25,Mumbai,50000
Bob,30,Delhi,60000
Carol,35,Pune,75000
Dave,40,Bangalore,90000
"""
df = pd.read_csv(StringIO(csv_data))
print(df)
print("\ndtypes:")
print(df.dtypes)
In real code:
Common read_csv options¶
import pandas as pd
from io import StringIO
data = """id;name;age;joined_at
1;Alice;25;2024-01-15
2;Bob;NA;2024-02-20
3;Carol;35;2024-03-10
"""
df = pd.read_csv(
StringIO(data),
sep=";", # custom separator (European CSV)
parse_dates=["joined_at"], # auto-parse this column as datetime
na_values=["NA", "?", "-"], # treat these strings as missing
dtype={"id": "int32"}, # set column types
encoding="utf-8",
)
print(df)
print()
print(df.dtypes)
Writing CSV¶
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob"],
"age": [25, 30],
})
# Save to CSV
df.to_csv("/tmp/output.csv", index=False)
# Read it back to verify
df2 = pd.read_csv("/tmp/output.csv")
print(df2)
Always pass index=False unless you want the row index saved as a column.
Streaming a large CSV in chunks¶
# For a 10 GB CSV — don't load it all at once
total = 0
for chunk in pd.read_csv("big.csv", chunksize=100_000):
total += chunk["amount"].sum()
print(total)
JSON¶
import pandas as pd
from io import StringIO
json_data = """[
{"name": "Alice", "age": 25, "city": "Mumbai"},
{"name": "Bob", "age": 30, "city": "Delhi"},
{"name": "Carol", "age": 35, "city": "Pune"}
]"""
df = pd.read_json(StringIO(json_data))
print(df)
Save back:
import pandas as pd
df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})
df.to_json("/tmp/out.json", orient="records", indent=2)
# Read it back
with open("/tmp/out.json") as f:
print(f.read())
Nested JSON — pd.json_normalize¶
For deeply-nested JSON, flatten with json_normalize:
import pandas as pd
records = [
{"id": 1, "name": "Alice", "addr": {"city": "Mumbai", "pin": "400001"}},
{"id": 2, "name": "Bob", "addr": {"city": "Delhi", "pin": "110001"}},
]
df = pd.json_normalize(records)
print(df)
Excel — xlsx files¶
# Requires `openpyxl`: pip install openpyxl
df = pd.read_excel("data.xlsx")
df = pd.read_excel("data.xlsx", sheet_name="Q1")
df = pd.read_excel("data.xlsx", sheet_name=["Q1", "Q2"]) # all listed sheets → dict
df.to_excel("output.xlsx", index=False)
Excel is slow compared to CSV/Parquet. For large data, convert to Parquet.
SQL — read from a database¶
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pwd@host:5432/dbname")
# or: create_engine("sqlite:///local.db")
df = pd.read_sql("SELECT * FROM users WHERE age > 25", engine)
df = pd.read_sql_table("users", engine)
# Write back
df.to_sql("users_clean", engine, if_exists="replace", index=False)
Always parameterize to avoid SQL injection:
Parquet — the modern columnar format¶
Faster and much smaller than CSV. The right choice for repeated reads.
# Requires pyarrow: pip install pyarrow
df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")
| Format | 1M rows × 10 cols | Read speed |
|---|---|---|
| CSV | ~120 MB | slow |
| JSON | ~200 MB | very slow |
| Parquet | ~15 MB | very fast |
| Pickle | ~80 MB | fast (but Python-only) |
For any dataset you'll re-read more than once, convert once to Parquet.
Pickle — Python's native format¶
Saves any DataFrame exactly, including types like dates, categoricals, custom objects.
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
df.to_pickle("/tmp/df.pkl")
df2 = pd.read_pickle("/tmp/df.pkl")
print(df2)
Don't load pickles from untrusted sources — they can execute arbitrary code.
URLs work too¶
Most read functions accept a URL directly:
Auto-detect column types¶
import pandas as pd
from io import StringIO
data = """id,name,score,joined,active
1,Alice,85.5,2024-01-15,true
2,Bob,72.3,2024-02-20,false
3,Carol,91.0,2024-03-10,true
"""
df = pd.read_csv(StringIO(data), parse_dates=["joined"])
print(df.dtypes)
| Pandas detected | dtype |
|---|---|
1, 2, 3 |
int64 |
85.5, 72.3 |
float64 |
Alice, Bob |
object (Python string) |
2024-01-15 (with parse_dates) |
datetime64[ns] |
true, false |
object (we'd need to map manually for bool) |
Reading multiple files¶
import glob
import pandas as pd
# Combine many CSVs into one DataFrame
files = glob.glob("logs/2025-*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
print(df.shape)
Quick comparison¶
| Format | Pros | Cons | When |
|---|---|---|---|
| CSV | Universal, human-readable | Slow, no types | sharing data, small files |
| JSON | Nested data, web APIs | Slow, verbose | API responses |
| Excel | Business-friendly | Slow, requires openpyxl | reports |
| SQL | Live data | Needs DB driver | warehouses, production data |
| Parquet | Fast, small, typed | Binary, needs pyarrow | re-reading datasets |
| Pickle | Saves anything | Python-only, unsafe | quick local caching |
Common pitfalls¶
- ❗ Encoding errors — set
encoding="utf-8"orencoding="latin-1"for legacy data. - ❗ Wrong separator — European CSVs use
;. Always passsep=explicitly. - ❗ Date columns parsed as strings — pass
parse_dates=[...]toread_csv. - ❗
index=Falseforgotten onto_csv— saves the row index as an extra column. - ❗ Loading huge CSVs into RAM — use
chunksizeor convert to Parquet. - ❗ SQL injection — never f-string user input into a query. Use
params=.
Practice¶
What does this print?
Expected: a b
Save to CSV without the row index appearing as a column
Expected: name,age
Quiz — Quick check¶
What you remember
Q1. Why pass parse_dates=["col"] to read_csv?
- To validate the data
- To automatically convert that column to
datetime64instead of a string - To sort by date
- To handle timezones
Why: Without
parse_dates, date-looking strings are loaded asobject(string) and you can't use.dtaccessors or do date arithmetic.
Q2. Which format is smallest and fastest to re-read?
- CSV
- JSON
- Parquet
- Excel
Why: Parquet is a columnar binary format — typically 5–10× smaller than CSV and many times faster to read. Use it for any dataset you'll touch more than once.
Q3. Reading a 10 GB CSV — best approach?
- Buy more RAM
- Stream in chunks with
pd.read_csv("big.csv", chunksize=100_000)and process each chunk - Use Excel
- Pickle it first
Why:
chunksizereturns an iterator of DataFrames. You process and aggregate one chunk at a time without ever loading the whole file. For repeated access, convert once to Parquet.
Common doubts¶
Why should I never load pickle files from untrusted sources?
pickle.load() can execute arbitrary Python code during deserialization. A malicious pickle file can install malware. Only use pickle for files you produced yourself. For cross-team sharing, use Parquet, JSON, or CSV.
What's the right way to parameterize SQL queries?
Use params= in pd.read_sql. Never f-string user input into a query — that's textbook SQL injection. pd.read_sql("SELECT * FROM t WHERE id = %(i)s", conn, params={"i": user_id}) is safe.
Which encoding do I use when read_csv complains?
Try encoding="utf-8" first (the default). If you see UnicodeDecodeError, try encoding="latin-1" (almost any byte is valid latin-1), or encoding="cp1252" (common for Windows-exported files), or encoding="utf-16" (Excel-exported tab-separated files).