Data Loading — CSV, JSON, SQL, APIs, Web Scraping¶

1. Why this matters¶

Every ML project starts with "how do I get the data." Most beginners can read a CSV but get stuck when the data is a paginated JSON API, a SQL warehouse, or behind a login wall. The five techniques here cover ~95% of practical sources.

2. Mental model¶

Every loader produces (or can produce) a DataFrame — rows × columns with named types:

flowchart LR
    A[CSV file] --> P[pd.read_csv]
    B[JSON file/string] --> Q[pd.read_json / json_normalize]
    C[SQL DB] --> R[pd.read_sql + SQLAlchemy]
    D[REST API] --> S[requests.get → json → DataFrame]
    E[HTML page] --> T[BeautifulSoup or pd.read_html]
    P --> DF[DataFrame]
    Q --> DF
    R --> DF
    S --> DF
    T --> DF

3. Core concepts¶

pd.read_csv — most flexible loader. Knows sep, header, dtype, parse_dates, chunksize (for streaming).
pd.json_normalize — flattens nested JSON into a flat table.
SQLAlchemy engine — connection string + pd.read_sql(query, engine).
requests — HTTP client. r.json() returns a Python dict.
pd.read_html(url) — parses ALL <table> tags on a page, returns a list of DataFrames.
BeautifulSoup — for arbitrary HTML scraping when there's no <table> shortcut.
Pagination — APIs return data in pages. Loop with ?page=N until empty.

4. Code — CSV¶

import pandas as pd

# Standard
df = pd.read_csv("data.csv")

# Big-CSV tricks
df = pd.read_csv(
    "data.csv",
    sep=",",
    header=0,                       # row index of column names
    names=["a", "b", "c"],          # override column names
    usecols=["a", "b"],             # load only some columns
    dtype={"a": "int32", "b": "category"},  # save RAM
    parse_dates=["created_at"],
    na_values=["NA", "?", "-"],
    encoding="utf-8",
    low_memory=False,
    nrows=10_000,                   # load only first N rows for preview
)

# Stream a 10GB file in chunks
total = 0
for chunk in pd.read_csv("huge.csv", chunksize=100_000):
    total += chunk["amount"].sum()
print(total)

# Multiple files → one DataFrame
import glob
df = pd.concat(
    (pd.read_csv(f) for f in glob.glob("logs/*.csv")),
    ignore_index=True,
)

# Write
df.to_csv("clean.csv", index=False)
df.to_parquet("clean.parquet")     # ~10x smaller + faster than CSV

5. Code — JSON¶

import json, pandas as pd

# Simple — flat JSON
df = pd.read_json("data.json")
df = pd.read_json("https://api.example.com/data.json")   # also works on URLs

# Nested JSON — flatten with json_normalize
data = json.load(open("nested.json"))
# Example: data = {"users": [{"id":1, "addr":{"city":"NYC"}}, ...]}
df = pd.json_normalize(
    data,
    record_path=["users"],
    meta=["org_name"],
    sep="_",                    # nested keys become "addr_city"
)

# JSON Lines (one JSON object per line — common in logs)
df = pd.read_json("events.jsonl", lines=True)

6. Code — SQL¶

from sqlalchemy import create_engine
import pandas as pd

# 1. Build engine from a connection string
engine = create_engine("postgresql://user:pwd@host:5432/dbname")
# Or:
# engine = create_engine("sqlite:///local.db")
# engine = create_engine("mysql+pymysql://user:pwd@host/db")

# 2. Read a query
df = pd.read_sql(
    "SELECT user_id, ts, amount FROM orders WHERE ts > '2025-01-01'",
    engine,
    parse_dates=["ts"],
)

# 3. Parameterize (NEVER string-interpolate — SQL injection risk)
df = pd.read_sql(
    "SELECT * FROM orders WHERE user_id = %(uid)s",
    engine,
    params={"uid": 42},
)

# 4. Big result — stream in chunks
for chunk in pd.read_sql("SELECT * FROM big_table", engine, chunksize=50_000):
    process(chunk)

# 5. Write back
df.to_sql("orders_clean", engine, if_exists="replace", index=False, method="multi")

7. Code — REST APIs¶

import requests
import pandas as pd

# Single request
r = requests.get(
    "https://api.example.com/v1/users",
    headers={"Authorization": "Bearer YOUR_TOKEN",
             "User-Agent": "research-script/0.1"},
    params={"limit": 100},
    timeout=10,
)
r.raise_for_status()
df = pd.DataFrame(r.json()["data"])

# Paginated API — loop until exhausted
def fetch_all(url, headers, page_param="page"):
    rows, page = [], 1
    while True:
        r = requests.get(url, headers=headers,
                         params={page_param: page, "per_page": 100},
                         timeout=10)
        r.raise_for_status()
        batch = r.json()
        if not batch:                # empty list → done
            break
        rows.extend(batch)
        page += 1
    return pd.DataFrame(rows)

df = fetch_all("https://api.example.com/users", {"Authorization": "Bearer X"})

# Rate-limited APIs — add backoff
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=Retry(
    total=5, backoff_factor=0.5,
    status_forcelist=[429, 500, 502, 503, 504],
)))

8. Code — Web scraping¶

For HTML tables — one line:

import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")
# tables is a list — pick the one you want
gdp = tables[0]
print(gdp.head())

For arbitrary HTML — BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get("https://news.example.com",
                 headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "html.parser")

rows = []
for card in soup.select("article.post-card"):
    rows.append({
        "title": card.select_one("h2").get_text(strip=True),
        "author": card.select_one(".author").get_text(strip=True),
        "url":   card.select_one("a")["href"],
    })

df = pd.DataFrame(rows)

For JavaScript-rendered sites — use Playwright or Selenium (returns rendered HTML, then BeautifulSoup):

# pip install playwright && playwright install chromium
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://app.example.com/dashboard")
    page.wait_for_selector(".data-row")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "html.parser")

9. Common pitfalls¶

❗ Encoding errors on CSV. Set encoding="utf-8" or try "latin-1" for legacy data.
❗ Bad CSV separator detection. European CSVs use ;. Always set sep= explicitly.
❗ Loading huge CSVs eagerly. Use chunksize or convert to Parquet.
❗ SQL injection via string formatting. Use parameter binding (params={...}) — never f"... {user_input} ...".
❗ Hammering rate-limited APIs. Add backoff, respect Retry-After headers.
❗ Scraping without checking robots.txt and ToS. Some sites forbid scraping; some require attribution or rate limits.
❗ Forgetting User-Agent. Many sites block requests with a missing or default UA.
❗ Caching nothing. Re-running a scrape from scratch every time wastes hours. Pickle / Parquet intermediate results.

10. When to use what¶

Source format	Use
Local file < 1 GB	`pd.read_csv`
Local file > 1 GB	`chunksize` OR convert to Parquet
Many small files	`glob` + `pd.concat`
Nested JSON	`pd.json_normalize`
Production DB	`pd.read_sql` + SQLAlchemy + parameterized queries
REST API	`requests` + session + retry adapter
HTML table	`pd.read_html`
Arbitrary HTML	`requests` + `BeautifulSoup`
JS-rendered site	Playwright / Selenium + BeautifulSoup
Streaming events	Kafka / Kinesis (out of scope here)

11. Cheatsheet¶

# CSV
pd.read_csv(path, sep=",", parse_dates=["dt"], dtype={"id": "int32"})
df.to_parquet(path)                              # always prefer Parquet for re-use

# JSON
pd.read_json(path)
pd.json_normalize(data, record_path=["users"], meta=["org"])
pd.read_json(path, lines=True)                   # JSON Lines

# SQL
from sqlalchemy import create_engine
e = create_engine("postgresql://...")
pd.read_sql("SELECT * FROM t WHERE c=%(v)s", e, params={"v": x})

# API
import requests
r = requests.get(url, headers={...}, params={...}, timeout=10); r.raise_for_status()
df = pd.DataFrame(r.json())

# HTML
pd.read_html(url)                                # quick tables
BeautifulSoup(r.text, "html.parser").select(".x")

12. Q&A — recall test¶

Q: Fastest way to extract every <table> from an HTML page into DataFrames? A: pd.read_html(url) — returns a list of DataFrames, one per <table>.
Q: How do you safely query a DB with user input? A: Use parameterized queries — pd.read_sql(query, engine, params={...}). Never string-format SQL.
Q: Best practice for streaming a 50 GB CSV? A: pd.read_csv(path, chunksize=N) returns an iterator of DataFrames; process chunks one at a time. Or convert once to Parquet for faster re-reads.
Q: How do you flatten [{"id":1, "addr":{"city":"NYC"}}] into a table? A: pd.json_normalize(records) — produces columns id, addr.city.
Q: Difference between pd.read_json and pd.json_normalize? A: read_json reads a JSON file/URL when the structure is already flat or tabular. json_normalize is for nested in-memory dicts/lists you want to flatten.

Practice¶

What does this print?

Expected: (120, 4)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_tr.shape)        # 80% of 150 = 120 rows

Use stratified split so train/test have the same class balance

Expected: True

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
same_dist = np.allclose(np.bincount(y_tr)/len(y_tr), np.bincount(y_te)/len(y_te), atol=0.05)
print(same_dist)         # bug: without stratify, class proportions can drift

Quiz — Quick check¶

What you remember

Q1. Why split data into train/test sets?

To make training faster
To estimate how the model performs on unseen data — the test set simulates production
To save disk space
Required by sklearn

Why: Without a held-out test set, you can't tell if your model has truly learned or just memorized. Test scores predict deployment performance.

Q2. When training a classifier, why use stratify=y in train_test_split?

Reduces overfitting
Preserves the class distribution in both train and test (essential for imbalanced classes)
Makes the split deterministic
Required for cross-validation

Why: Without stratify, an imbalanced split could put almost all the rare class in train and none in test (or vice versa) — useless for evaluation.

Q3. What's random_state=42 for?

Makes the random split reproducible — same split every run
Improves the model
A NumPy version requirement
Random initialization for the model

Why: Reproducibility. Anyone running your code gets the same train/test split and can verify your numbers.

Common doubts¶

What's the right train/test ratio?

Common defaults: 80/20 or 70/30. With huge datasets (>1M rows), even 99/1 leaves enough in test. For small datasets (<1000 rows), use cross-validation instead of a single split — every row participates as test data across the folds.

When should I add a validation set separate from test?

When you're tuning hyperparameters or doing model selection. Train on train, tune on validation, hold out test for the final "this is how the model actually performs" measurement. Or use cross_val_score on train and only touch test once.

Why does the train/test split need to be done BEFORE preprocessing?

Because preprocessing parameters (scaler means, encoder categories, imputer medians) are derived from data. If you fit them on the full dataset, the test set has "seen" the training process — data leakage. Always split first, fit preprocessing on train only.