Data Loading — CSV, JSON, SQL, APIs, Web Scraping¶
1. Why this matters¶
Every ML project starts with "how do I get the data." Most beginners can read a CSV but get stuck when the data is a paginated JSON API, a SQL warehouse, or behind a login wall. The five techniques here cover ~95% of practical sources.
2. Mental model¶
Every loader produces (or can produce) a DataFrame — rows × columns with named types:
flowchart LR
A[CSV file] --> P[pd.read_csv]
B[JSON file/string] --> Q[pd.read_json / json_normalize]
C[SQL DB] --> R[pd.read_sql + SQLAlchemy]
D[REST API] --> S[requests.get → json → DataFrame]
E[HTML page] --> T[BeautifulSoup or pd.read_html]
P --> DF[DataFrame]
Q --> DF
R --> DF
S --> DF
T --> DF
3. Core concepts¶
pd.read_csv— most flexible loader. Knowssep,header,dtype,parse_dates,chunksize(for streaming).pd.json_normalize— flattens nested JSON into a flat table.- SQLAlchemy engine — connection string +
pd.read_sql(query, engine). requests— HTTP client.r.json()returns a Python dict.pd.read_html(url)— parses ALL<table>tags on a page, returns a list of DataFrames.BeautifulSoup— for arbitrary HTML scraping when there's no<table>shortcut.- Pagination — APIs return data in pages. Loop with
?page=Nuntil empty.
4. Code — CSV¶
import pandas as pd
# Standard
df = pd.read_csv("data.csv")
# Big-CSV tricks
df = pd.read_csv(
"data.csv",
sep=",",
header=0, # row index of column names
names=["a", "b", "c"], # override column names
usecols=["a", "b"], # load only some columns
dtype={"a": "int32", "b": "category"}, # save RAM
parse_dates=["created_at"],
na_values=["NA", "?", "-"],
encoding="utf-8",
low_memory=False,
nrows=10_000, # load only first N rows for preview
)
# Stream a 10GB file in chunks
total = 0
for chunk in pd.read_csv("huge.csv", chunksize=100_000):
total += chunk["amount"].sum()
print(total)
# Multiple files → one DataFrame
import glob
df = pd.concat(
(pd.read_csv(f) for f in glob.glob("logs/*.csv")),
ignore_index=True,
)
# Write
df.to_csv("clean.csv", index=False)
df.to_parquet("clean.parquet") # ~10x smaller + faster than CSV
5. Code — JSON¶
import json, pandas as pd
# Simple — flat JSON
df = pd.read_json("data.json")
df = pd.read_json("https://api.example.com/data.json") # also works on URLs
# Nested JSON — flatten with json_normalize
data = json.load(open("nested.json"))
# Example: data = {"users": [{"id":1, "addr":{"city":"NYC"}}, ...]}
df = pd.json_normalize(
data,
record_path=["users"],
meta=["org_name"],
sep="_", # nested keys become "addr_city"
)
# JSON Lines (one JSON object per line — common in logs)
df = pd.read_json("events.jsonl", lines=True)
6. Code — SQL¶
from sqlalchemy import create_engine
import pandas as pd
# 1. Build engine from a connection string
engine = create_engine("postgresql://user:pwd@host:5432/dbname")
# Or:
# engine = create_engine("sqlite:///local.db")
# engine = create_engine("mysql+pymysql://user:pwd@host/db")
# 2. Read a query
df = pd.read_sql(
"SELECT user_id, ts, amount FROM orders WHERE ts > '2025-01-01'",
engine,
parse_dates=["ts"],
)
# 3. Parameterize (NEVER string-interpolate — SQL injection risk)
df = pd.read_sql(
"SELECT * FROM orders WHERE user_id = %(uid)s",
engine,
params={"uid": 42},
)
# 4. Big result — stream in chunks
for chunk in pd.read_sql("SELECT * FROM big_table", engine, chunksize=50_000):
process(chunk)
# 5. Write back
df.to_sql("orders_clean", engine, if_exists="replace", index=False, method="multi")
7. Code — REST APIs¶
import requests
import pandas as pd
# Single request
r = requests.get(
"https://api.example.com/v1/users",
headers={"Authorization": "Bearer YOUR_TOKEN",
"User-Agent": "research-script/0.1"},
params={"limit": 100},
timeout=10,
)
r.raise_for_status()
df = pd.DataFrame(r.json()["data"])
# Paginated API — loop until exhausted
def fetch_all(url, headers, page_param="page"):
rows, page = [], 1
while True:
r = requests.get(url, headers=headers,
params={page_param: page, "per_page": 100},
timeout=10)
r.raise_for_status()
batch = r.json()
if not batch: # empty list → done
break
rows.extend(batch)
page += 1
return pd.DataFrame(rows)
df = fetch_all("https://api.example.com/users", {"Authorization": "Bearer X"})
# Rate-limited APIs — add backoff
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=Retry(
total=5, backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
)))
8. Code — Web scraping¶
For HTML tables — one line:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")
# tables is a list — pick the one you want
gdp = tables[0]
print(gdp.head())
For arbitrary HTML — BeautifulSoup:
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://news.example.com",
headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "html.parser")
rows = []
for card in soup.select("article.post-card"):
rows.append({
"title": card.select_one("h2").get_text(strip=True),
"author": card.select_one(".author").get_text(strip=True),
"url": card.select_one("a")["href"],
})
df = pd.DataFrame(rows)
For JavaScript-rendered sites — use Playwright or Selenium (returns rendered HTML, then BeautifulSoup):
# pip install playwright && playwright install chromium
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://app.example.com/dashboard")
page.wait_for_selector(".data-row")
html = page.content()
browser.close()
soup = BeautifulSoup(html, "html.parser")
9. Common pitfalls¶
- ❗ Encoding errors on CSV. Set
encoding="utf-8"or try"latin-1"for legacy data. - ❗ Bad CSV separator detection. European CSVs use
;. Always setsep=explicitly. - ❗ Loading huge CSVs eagerly. Use
chunksizeor convert to Parquet. - ❗ SQL injection via string formatting. Use parameter binding (
params={...}) — neverf"... {user_input} ...". - ❗ Hammering rate-limited APIs. Add backoff, respect
Retry-Afterheaders. - ❗ Scraping without checking
robots.txtand ToS. Some sites forbid scraping; some require attribution or rate limits. - ❗ Forgetting
User-Agent. Many sites block requests with a missing or default UA. - ❗ Caching nothing. Re-running a scrape from scratch every time wastes hours. Pickle / Parquet intermediate results.
10. When to use what¶
| Source format | Use |
|---|---|
| Local file < 1 GB | pd.read_csv |
| Local file > 1 GB | chunksize OR convert to Parquet |
| Many small files | glob + pd.concat |
| Nested JSON | pd.json_normalize |
| Production DB | pd.read_sql + SQLAlchemy + parameterized queries |
| REST API | requests + session + retry adapter |
| HTML table | pd.read_html |
| Arbitrary HTML | requests + BeautifulSoup |
| JS-rendered site | Playwright / Selenium + BeautifulSoup |
| Streaming events | Kafka / Kinesis (out of scope here) |
11. Cheatsheet¶
# CSV
pd.read_csv(path, sep=",", parse_dates=["dt"], dtype={"id": "int32"})
df.to_parquet(path) # always prefer Parquet for re-use
# JSON
pd.read_json(path)
pd.json_normalize(data, record_path=["users"], meta=["org"])
pd.read_json(path, lines=True) # JSON Lines
# SQL
from sqlalchemy import create_engine
e = create_engine("postgresql://...")
pd.read_sql("SELECT * FROM t WHERE c=%(v)s", e, params={"v": x})
# API
import requests
r = requests.get(url, headers={...}, params={...}, timeout=10); r.raise_for_status()
df = pd.DataFrame(r.json())
# HTML
pd.read_html(url) # quick tables
BeautifulSoup(r.text, "html.parser").select(".x")
12. Q&A — recall test¶
-
Q: Fastest way to extract every
<table>from an HTML page into DataFrames? A:pd.read_html(url)— returns a list of DataFrames, one per<table>. -
Q: How do you safely query a DB with user input? A: Use parameterized queries —
pd.read_sql(query, engine, params={...}). Never string-format SQL. -
Q: Best practice for streaming a 50 GB CSV? A:
pd.read_csv(path, chunksize=N)returns an iterator of DataFrames; process chunks one at a time. Or convert once to Parquet for faster re-reads. -
Q: How do you flatten
[{"id":1, "addr":{"city":"NYC"}}]into a table? A:pd.json_normalize(records)— produces columnsid,addr.city. -
Q: Difference between
pd.read_jsonandpd.json_normalize? A:read_jsonreads a JSON file/URL when the structure is already flat or tabular.json_normalizeis for nested in-memory dicts/lists you want to flatten.
Practice¶
What does this print?
Expected: (120, 4)
Use stratified split so train/test have the same class balance
Expected: True
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
same_dist = np.allclose(np.bincount(y_tr)/len(y_tr), np.bincount(y_te)/len(y_te), atol=0.05)
print(same_dist) # bug: without stratify, class proportions can drift
Quiz — Quick check¶
What you remember
Q1. Why split data into train/test sets?
- To make training faster
- To estimate how the model performs on unseen data — the test set simulates production
- To save disk space
- Required by sklearn
Why: Without a held-out test set, you can't tell if your model has truly learned or just memorized. Test scores predict deployment performance.
Q2. When training a classifier, why use stratify=y in train_test_split?
- Reduces overfitting
- Preserves the class distribution in both train and test (essential for imbalanced classes)
- Makes the split deterministic
- Required for cross-validation
Why: Without stratify, an imbalanced split could put almost all the rare class in train and none in test (or vice versa) — useless for evaluation.
Q3. What's random_state=42 for?
- Makes the random split reproducible — same split every run
- Improves the model
- A NumPy version requirement
- Random initialization for the model
Why: Reproducibility. Anyone running your code gets the same train/test split and can verify your numbers.
Common doubts¶
What's the right train/test ratio?
Common defaults: 80/20 or 70/30. With huge datasets (>1M rows), even 99/1 leaves enough in test. For small datasets (<1000 rows), use cross-validation instead of a single split — every row participates as test data across the folds.
When should I add a validation set separate from test?
When you're tuning hyperparameters or doing model selection. Train on train, tune on validation, hold out test for the final "this is how the model actually performs" measurement. Or use cross_val_score on train and only touch test once.
Why does the train/test split need to be done BEFORE preprocessing?
Because preprocessing parameters (scaler means, encoder categories, imputer medians) are derived from data. If you fit them on the full dataset, the test set has "seen" the training process — data leakage. Always split first, fit preprocessing on train only.