Skip to content

Data Loading — CSV, JSON, SQL, APIs, Web Scraping

1. Why this matters

Every ML project starts with "how do I get the data." Most beginners can read a CSV but get stuck when the data is a paginated JSON API, a SQL warehouse, or behind a login wall. The five techniques here cover ~95% of practical sources.

2. Mental model

Every loader produces (or can produce) a DataFramerows × columns with named types:

flowchart LR
    A[CSV file] --> P[pd.read_csv]
    B[JSON file/string] --> Q[pd.read_json / json_normalize]
    C[SQL DB] --> R[pd.read_sql + SQLAlchemy]
    D[REST API] --> S[requests.get → json → DataFrame]
    E[HTML page] --> T[BeautifulSoup or pd.read_html]
    P --> DF[DataFrame]
    Q --> DF
    R --> DF
    S --> DF
    T --> DF

3. Core concepts

  • pd.read_csv — most flexible loader. Knows sep, header, dtype, parse_dates, chunksize (for streaming).
  • pd.json_normalize — flattens nested JSON into a flat table.
  • SQLAlchemy engine — connection string + pd.read_sql(query, engine).
  • requests — HTTP client. r.json() returns a Python dict.
  • pd.read_html(url) — parses ALL <table> tags on a page, returns a list of DataFrames.
  • BeautifulSoup — for arbitrary HTML scraping when there's no <table> shortcut.
  • Pagination — APIs return data in pages. Loop with ?page=N until empty.

4. Code — CSV

import pandas as pd

# Standard
df = pd.read_csv("data.csv")

# Big-CSV tricks
df = pd.read_csv(
    "data.csv",
    sep=",",
    header=0,                       # row index of column names
    names=["a", "b", "c"],          # override column names
    usecols=["a", "b"],             # load only some columns
    dtype={"a": "int32", "b": "category"},  # save RAM
    parse_dates=["created_at"],
    na_values=["NA", "?", "-"],
    encoding="utf-8",
    low_memory=False,
    nrows=10_000,                   # load only first N rows for preview
)

# Stream a 10GB file in chunks
total = 0
for chunk in pd.read_csv("huge.csv", chunksize=100_000):
    total += chunk["amount"].sum()
print(total)

# Multiple files → one DataFrame
import glob
df = pd.concat(
    (pd.read_csv(f) for f in glob.glob("logs/*.csv")),
    ignore_index=True,
)

# Write
df.to_csv("clean.csv", index=False)
df.to_parquet("clean.parquet")     # ~10x smaller + faster than CSV

5. Code — JSON

import json, pandas as pd

# Simple — flat JSON
df = pd.read_json("data.json")
df = pd.read_json("https://api.example.com/data.json")   # also works on URLs

# Nested JSON — flatten with json_normalize
data = json.load(open("nested.json"))
# Example: data = {"users": [{"id":1, "addr":{"city":"NYC"}}, ...]}
df = pd.json_normalize(
    data,
    record_path=["users"],
    meta=["org_name"],
    sep="_",                    # nested keys become "addr_city"
)

# JSON Lines (one JSON object per line — common in logs)
df = pd.read_json("events.jsonl", lines=True)

6. Code — SQL

from sqlalchemy import create_engine
import pandas as pd

# 1. Build engine from a connection string
engine = create_engine("postgresql://user:pwd@host:5432/dbname")
# Or:
# engine = create_engine("sqlite:///local.db")
# engine = create_engine("mysql+pymysql://user:pwd@host/db")

# 2. Read a query
df = pd.read_sql(
    "SELECT user_id, ts, amount FROM orders WHERE ts > '2025-01-01'",
    engine,
    parse_dates=["ts"],
)

# 3. Parameterize (NEVER string-interpolate — SQL injection risk)
df = pd.read_sql(
    "SELECT * FROM orders WHERE user_id = %(uid)s",
    engine,
    params={"uid": 42},
)

# 4. Big result — stream in chunks
for chunk in pd.read_sql("SELECT * FROM big_table", engine, chunksize=50_000):
    process(chunk)

# 5. Write back
df.to_sql("orders_clean", engine, if_exists="replace", index=False, method="multi")

7. Code — REST APIs

import requests
import pandas as pd

# Single request
r = requests.get(
    "https://api.example.com/v1/users",
    headers={"Authorization": "Bearer YOUR_TOKEN",
             "User-Agent": "research-script/0.1"},
    params={"limit": 100},
    timeout=10,
)
r.raise_for_status()
df = pd.DataFrame(r.json()["data"])

# Paginated API — loop until exhausted
def fetch_all(url, headers, page_param="page"):
    rows, page = [], 1
    while True:
        r = requests.get(url, headers=headers,
                         params={page_param: page, "per_page": 100},
                         timeout=10)
        r.raise_for_status()
        batch = r.json()
        if not batch:                # empty list → done
            break
        rows.extend(batch)
        page += 1
    return pd.DataFrame(rows)

df = fetch_all("https://api.example.com/users", {"Authorization": "Bearer X"})

# Rate-limited APIs — add backoff
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=Retry(
    total=5, backoff_factor=0.5,
    status_forcelist=[429, 500, 502, 503, 504],
)))

8. Code — Web scraping

For HTML tables — one line:

import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")
# tables is a list — pick the one you want
gdp = tables[0]
print(gdp.head())

For arbitrary HTML — BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get("https://news.example.com",
                 headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "html.parser")

rows = []
for card in soup.select("article.post-card"):
    rows.append({
        "title": card.select_one("h2").get_text(strip=True),
        "author": card.select_one(".author").get_text(strip=True),
        "url":   card.select_one("a")["href"],
    })

df = pd.DataFrame(rows)

For JavaScript-rendered sites — use Playwright or Selenium (returns rendered HTML, then BeautifulSoup):

# pip install playwright && playwright install chromium
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://app.example.com/dashboard")
    page.wait_for_selector(".data-row")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "html.parser")

9. Common pitfalls

  • Encoding errors on CSV. Set encoding="utf-8" or try "latin-1" for legacy data.
  • Bad CSV separator detection. European CSVs use ;. Always set sep= explicitly.
  • Loading huge CSVs eagerly. Use chunksize or convert to Parquet.
  • SQL injection via string formatting. Use parameter binding (params={...}) — never f"... {user_input} ...".
  • Hammering rate-limited APIs. Add backoff, respect Retry-After headers.
  • Scraping without checking robots.txt and ToS. Some sites forbid scraping; some require attribution or rate limits.
  • Forgetting User-Agent. Many sites block requests with a missing or default UA.
  • Caching nothing. Re-running a scrape from scratch every time wastes hours. Pickle / Parquet intermediate results.

10. When to use what

Source format Use
Local file < 1 GB pd.read_csv
Local file > 1 GB chunksize OR convert to Parquet
Many small files glob + pd.concat
Nested JSON pd.json_normalize
Production DB pd.read_sql + SQLAlchemy + parameterized queries
REST API requests + session + retry adapter
HTML table pd.read_html
Arbitrary HTML requests + BeautifulSoup
JS-rendered site Playwright / Selenium + BeautifulSoup
Streaming events Kafka / Kinesis (out of scope here)

11. Cheatsheet

# CSV
pd.read_csv(path, sep=",", parse_dates=["dt"], dtype={"id": "int32"})
df.to_parquet(path)                              # always prefer Parquet for re-use

# JSON
pd.read_json(path)
pd.json_normalize(data, record_path=["users"], meta=["org"])
pd.read_json(path, lines=True)                   # JSON Lines

# SQL
from sqlalchemy import create_engine
e = create_engine("postgresql://...")
pd.read_sql("SELECT * FROM t WHERE c=%(v)s", e, params={"v": x})

# API
import requests
r = requests.get(url, headers={...}, params={...}, timeout=10); r.raise_for_status()
df = pd.DataFrame(r.json())

# HTML
pd.read_html(url)                                # quick tables
BeautifulSoup(r.text, "html.parser").select(".x")

12. Q&A — recall test

  • Q: Fastest way to extract every <table> from an HTML page into DataFrames? A: pd.read_html(url) — returns a list of DataFrames, one per <table>.

  • Q: How do you safely query a DB with user input? A: Use parameterized queries — pd.read_sql(query, engine, params={...}). Never string-format SQL.

  • Q: Best practice for streaming a 50 GB CSV? A: pd.read_csv(path, chunksize=N) returns an iterator of DataFrames; process chunks one at a time. Or convert once to Parquet for faster re-reads.

  • Q: How do you flatten [{"id":1, "addr":{"city":"NYC"}}] into a table? A: pd.json_normalize(records) — produces columns id, addr.city.

  • Q: Difference between pd.read_json and pd.json_normalize? A: read_json reads a JSON file/URL when the structure is already flat or tabular. json_normalize is for nested in-memory dicts/lists you want to flatten.

Practice

What does this print?

Expected: (120, 4)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_tr.shape)        # 80% of 150 = 120 rows

Use stratified split so train/test have the same class balance

Expected: True

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
same_dist = np.allclose(np.bincount(y_tr)/len(y_tr), np.bincount(y_te)/len(y_te), atol=0.05)
print(same_dist)         # bug: without stratify, class proportions can drift

Quiz — Quick check

What you remember

Q1. Why split data into train/test sets?

  • To make training faster
  • To estimate how the model performs on unseen data — the test set simulates production
  • To save disk space
  • Required by sklearn

Why: Without a held-out test set, you can't tell if your model has truly learned or just memorized. Test scores predict deployment performance.

Q2. When training a classifier, why use stratify=y in train_test_split?

  • Reduces overfitting
  • Preserves the class distribution in both train and test (essential for imbalanced classes)
  • Makes the split deterministic
  • Required for cross-validation

Why: Without stratify, an imbalanced split could put almost all the rare class in train and none in test (or vice versa) — useless for evaluation.

Q3. What's random_state=42 for?

  • Makes the random split reproducible — same split every run
  • Improves the model
  • A NumPy version requirement
  • Random initialization for the model

Why: Reproducibility. Anyone running your code gets the same train/test split and can verify your numbers.

Common doubts

What's the right train/test ratio?

Common defaults: 80/20 or 70/30. With huge datasets (>1M rows), even 99/1 leaves enough in test. For small datasets (<1000 rows), use cross-validation instead of a single split — every row participates as test data across the folds.

When should I add a validation set separate from test?

When you're tuning hyperparameters or doing model selection. Train on train, tune on validation, hold out test for the final "this is how the model actually performs" measurement. Or use cross_val_score on train and only touch test once.

Why does the train/test split need to be done BEFORE preprocessing?

Because preprocessing parameters (scaler means, encoder categories, imputer medians) are derived from data. If you fit them on the full dataset, the test set has "seen" the training process — data leakage. Always split first, fit preprocessing on train only.