Machine Learning¶

Status¶

✅ Notes complete — 14 subtopics distilled from the CampusX 100 Days of ML playlist, day-by-day repo, and scikit-learn docs.

Reading order¶

Read in order — this is the ML workflow, top to bottom. Skip to whichever chapter matches what you're working on.

#	Topic	Read
1	Introduction — types of ML, end-to-end workflow	01-introduction
2	Data loading — CSV, JSON, SQL, API, scraping	02-data-loading
3	Exploratory Data Analysis	03-eda
4	Feature Scaling — Standard / MinMax / Robust	04-feature-scaling
5	Encoding categorical features	05-encoding
6	Pipelines & Transformers	06-pipelines
7	Missing data — imputation, KNN, iterative	07-missing-data
8	Outliers — Z-score, IQR, percentiles	08-outliers
9	Feature engineering & PCA	09-feature-engineering
10	Linear regression & gradient descent	10-linear-regression
11	Polynomial regression & regularization (Ridge/Lasso/ElasticNet)	11-polynomial-and-regularization
12	Logistic regression & classification metrics	12-logistic-regression-and-classification
13	Ensembles — Random Forest, AdaBoost, GBM, Stacking, XGBoost, LightGBM	13-ensembles
14	Unsupervised — K-Means, DBSCAN, hierarchical	14-unsupervised-and-clustering

The big picture¶

flowchart TB
    subgraph Data [Data Layer 1-3]
      A1[Intro] --> A2[Load] --> A3[EDA]
    end
    subgraph Prep [Feature Prep 4-9]
      B1[Scaling] --> B2[Encoding] --> B3[Pipelines]
      B3 --> B4[Missing] --> B5[Outliers] --> B6[Feat. Eng + PCA]
    end
    subgraph Models [Supervised 10-13]
      C1[Linear] --> C2[Polynomial + Reg]
      C2 --> C3[Logistic / Class] --> C4[Ensembles]
    end
    subgraph U [Unsupervised 14]
      D1[K-Means / DBSCAN]
    end
    A3 -.-> B1
    B6 -.-> C1
    B6 -.-> D1

Learning roadmap¶

What I should be able to do after this topic¶

Frame any problem as supervised / unsupervised / something else.
Build a leak-free end-to-end pipeline in scikit-learn from data load to deployable model.
Pick the right metric for the problem (and explain why accuracy is usually wrong).
Build, evaluate, and tune a baseline + competitive model for any tabular dataset.
Spot data leakage, overfitting, class imbalance issues in someone else's notebook.
Choose between linear / tree-based / boosting / clustering based on the task.
Interpret feature importances responsibly (permutation, SHAP).