Exploratory Data Analysis (EDA)¶
1. Why this matters¶
Skip EDA and you'll waste a week chasing model accuracy on data that has duplicates, leakage, wrong types, or features with no signal. Senior ML practitioners spend 60-70% of their time here — beginners try to skip it.
A solid EDA pass tells you: - Are the labels balanced? - Which features are constant/near-constant (useless)? - Which features are highly correlated (redundant)? - Where are the outliers and missing values? - Which features look predictive on a first pass?
2. Mental model¶
EDA proceeds in widening circles:
flowchart TB
A[1. Shape & dtypes<br/>df.info, df.shape] --> B[2. Descriptive stats<br/>df.describe]
B --> C[3. Univariate<br/>each column alone]
C --> D[4. Bivariate<br/>pairs: corr / groupby]
D --> E[5. Target relationship<br/>feature vs y]
E --> F[6. Decisions:<br/>clean, drop, engineer]
Don't try to plot everything — focus on what's relevant to your target.
3. Step 1: First look¶
import pandas as pd
df = pd.read_csv("titanic.csv")
df.shape # (rows, cols)
df.info() # dtypes + non-null counts
df.head() # first 5 rows
df.tail() # last 5
df.sample(5) # random 5 (best for big files)
df.columns.tolist()
df.dtypes
df.isna().sum().sort_values(ascending=False) # missing-value count per column
df.duplicated().sum() # duplicate rows
What to notice:
- Columns with most rows missing → drop or impute.
- Object columns that look numeric → bad parsing.
- Suspicious dtype object for dates → fix with pd.to_datetime.
4. Step 2: Descriptive statistics¶
df.describe() # numeric: count, mean, std, min, 25%, 50%, 75%, max
df.describe(include="object") # categorical: count, unique, top, freq
df.describe(include="all") # everything
# Per-column quick stats
df["age"].agg(["mean", "median", "std", "min", "max", "skew", "kurt"])
# Targets
df["churned"].value_counts(normalize=True) # class balance as proportions
| Metric | What it tells you |
|---|---|
mean vs median |
If they diverge, distribution is skewed |
std |
Spread. Very small → constant feature |
min/max |
Outliers, sentinel values (-999, 9999) |
count |
Implied missing-value count vs len(df) |
skew |
> 1 = right-skewed; consider log/power transform |
kurt |
High = heavy tails / outliers |
Pandas profiling (one-liner full report — great for first pass):
# pip install ydata-profiling
from ydata_profiling import ProfileReport
ProfileReport(df, title="Profile", minimal=True).to_file("profile.html")
5. Step 3: Univariate analysis¶
Look at each column in isolation.
Numeric columns:
import matplotlib.pyplot as plt
import seaborn as sns
for col in df.select_dtypes("number").columns:
fig, axes = plt.subplots(1, 2, figsize=(10, 3))
sns.histplot(df[col].dropna(), kde=True, ax=axes[0]).set(title=f"{col} dist")
sns.boxplot(x=df[col], ax=axes[1]).set(title=f"{col} box")
plt.tight_layout(); plt.show()
Look for: - Strong skew → log/power transform later. - Multimodal distributions → maybe two sub-populations. - Tail outliers — investigate before deciding to remove.
Categorical columns:
for col in df.select_dtypes(["object", "category"]).columns:
print(col, df[col].nunique(), "unique")
print(df[col].value_counts().head(10))
print("---")
# Plot top categories
df["plan_type"].value_counts().head(10).plot.bar()
Look for:
- High cardinality (>50 unique) — encoding will explode columns; consider hashing or target encoding.
- Long tail — group rare categories into "other".
6. Step 4: Bivariate analysis¶
Now pairs — especially "feature vs target."
Numeric vs numeric — correlation:
corr = df.select_dtypes("number").corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0)
# Top correlations with the target
corr_with_y = corr["target"].abs().sort_values(ascending=False).head(15)
|corr| > 0.9 between two features → drop one (redundancy).
Numeric vs categorical — grouped stats:
df.groupby("plan_type")["churned"].mean().sort_values(ascending=False)
df.groupby("region")[["revenue", "tenure_months"]].agg(["mean", "median", "count"])
A categorical column where groupby(...)["target"].mean() varies wildly → predictive feature.
Numeric vs numeric — scatter / regression:
sns.scatterplot(data=df, x="age", y="charges", hue="smoker", alpha=0.5)
sns.lmplot(data=df, x="age", y="charges", hue="smoker")
Categorical vs categorical — contingency / chi-square:
pd.crosstab(df["region"], df["churned"], normalize="index")
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(pd.crosstab(df["region"], df["churned"]))
print("p-value:", p) # < 0.05 → relationship is likely real
Pair plot — quick visual summary (small datasets only):
7. Step 5: Relate everything to the target¶
Pick 5-10 features that look most promising and produce a one-page summary:
def feature_vs_target(df, feat, target):
if df[feat].dtype == "object" or df[feat].nunique() < 20:
return df.groupby(feat)[target].agg(["mean", "count"]).sort_values("mean", ascending=False)
return df[[feat, target]].corr().iloc[0, 1]
for f in ["age", "plan_type", "tenure_months", "region"]:
print(f, "→")
print(feature_vs_target(df, f, "churned"))
print()
This is what you bring to feature engineering — concrete evidence of which columns matter.
8. Common pitfalls¶
- ❗ Skipping EDA entirely — leads to garbage-in-garbage-out modeling.
- ❗ Plotting everything blindly. 200-column datasets need targeted plots, not 200 histograms.
- ❗ Confusing correlation with causation. Correlated features are useful for prediction, but say nothing about why.
- ❗ Ignoring class imbalance. A 99/1 imbalance means accuracy is useless and you need different sampling / metrics.
- ❗ Trusting
corr()for non-linear relationships. Pearson correlation only catches linear ones. Use Spearman (df.corr(method="spearman")) or mutual information for non-linear. - ❗ Using the test set in EDA. Look at train only. Looking at test creates subtle "data peeking" leakage.
- ❗ Letting outliers dominate visualizations. Use
df.quantile([0.01, 0.99])to clip displayed range — keep the data, change the plot range.
9. Cheatsheet¶
# Shape & types
df.shape; df.info(); df.dtypes; df.columns
df.isna().sum().sort_values(ascending=False)
df.duplicated().sum()
# Descriptive
df.describe()
df.describe(include="all")
df["col"].value_counts(normalize=True, dropna=False)
# Numeric univariate
df["col"].hist(); df["col"].plot.box(); df["col"].plot.kde()
df["col"].agg(["mean","median","std","skew","kurt"])
# Categorical univariate
df["col"].value_counts()
df["col"].nunique()
# Bivariate
df.corr(numeric_only=True)
df.corr(method="spearman", numeric_only=True)
df.groupby("cat_col")["num_col"].agg(["mean","median","count"])
pd.crosstab(df["a"], df["b"], normalize="index")
# Visuals (seaborn shorthand)
import seaborn as sns
sns.histplot(df["x"], kde=True)
sns.boxplot(data=df, x="cat", y="num")
sns.scatterplot(data=df, x="a", y="b", hue="c")
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", center=0)
sns.pairplot(df[cols], hue="target")
# Auto-report
from ydata_profiling import ProfileReport
ProfileReport(df, minimal=True).to_file("eda.html")
10. Q&A — recall test¶
-
Q: What does
df.describe()tell you thatdf.head()doesn't? A: Summary statistics — count, mean, std, min/max, quartiles.head()is just the first rows;describe()is the distribution shape. -
Q: Mean > median by a lot — what's the data likely telling you? A: Right-skewed distribution. Long tail of high values. Often a candidate for log or power transform.
-
Q: When does Pearson correlation mislead? A: Non-linear relationships. Two strongly related variables can have ~0 Pearson correlation if the relationship is U-shaped, sigmoid, etc. Use Spearman or scatter plots.
-
Q: How do you check class balance? A:
df["target"].value_counts(normalize=True)— proportions per class. -
Q: What signals a redundant feature? A:
|corr| > 0.9with another feature, OR near-zero variance (df[col].nunique() <= 1ordf[col].std() ≈ 0). -
Q: Should EDA touch the test set? A: No — only train. Looking at test introduces subtle leakage; you may unconsciously tune choices to features that happen to look promising in test.
Practice¶
What does this print?
Expected: 4
Compute the correlation between numeric columns only (without erroring on the string column)
Expected: (2, 2)
Quiz — Quick check¶
What you remember
Q1. Which method gives count, mean, std, min, max, and quartiles in one shot?
-
df.info() -
df.describe() -
df.head() -
df.dtypes
Why:
describe()is the go-to EDA summary for numeric columns. Passinclude="all"to also profile object columns.
Q2. What's a "target leakage" feature you might find during EDA?
- A feature with missing values
- A feature that's a near-perfect predictor because it contains information about the target that wouldn't be available at prediction time
- A feature with high cardinality
- A feature with outliers
Why: Example: predicting whether a sale will close, with a
commission_paidfeature. That feature only exists after the sale closed — leaked future info. Features that look "too good to be true" usually are.
Q3. Why visualize the distribution of each feature?
- To make the report look pretty
- To spot skew, outliers, multimodality, and decide what transformations might help
- To save memory
- Required by sklearn
Why: Histograms catch right-skewed features (log-transform candidates), bimodal distributions (suggesting hidden subgroups), and outliers. EDA is mostly visualization.
Common doubts¶
Should I do EDA on the train set only or on the whole dataset?
Train set only. Looking at test data during EDA introduces "researcher leakage" — you may unconsciously prefer choices that happen to work well on test. Strict separation of train/test from day one.
How long should EDA take?
Often 30-50% of the total project time. It's the highest-leverage activity — every modeling decision (which features to engineer, which model to try, what metric to optimize) flows from understanding the data. Skipping EDA usually means rebuilding the project later.
What's the most useful single chart for EDA?
For tabular data: pairwise scatter plots colored by target (sns.pairplot(df, hue="target")). Shows feature distributions, correlations, and how well classes/regression targets separate — all in one view.