Machine Learning¶
Status¶
✅ Notes complete — 14 subtopics distilled from the CampusX 100 Days of ML playlist, day-by-day repo, and scikit-learn docs.
Reading order¶
Read in order — this is the ML workflow, top to bottom. Skip to whichever chapter matches what you're working on.
| # | Topic | Read |
|---|---|---|
| 1 | Introduction — types of ML, end-to-end workflow | 01-introduction |
| 2 | Data loading — CSV, JSON, SQL, API, scraping | 02-data-loading |
| 3 | Exploratory Data Analysis | 03-eda |
| 4 | Feature Scaling — Standard / MinMax / Robust | 04-feature-scaling |
| 5 | Encoding categorical features | 05-encoding |
| 6 | Pipelines & Transformers | 06-pipelines |
| 7 | Missing data — imputation, KNN, iterative | 07-missing-data |
| 8 | Outliers — Z-score, IQR, percentiles | 08-outliers |
| 9 | Feature engineering & PCA | 09-feature-engineering |
| 10 | Linear regression & gradient descent | 10-linear-regression |
| 11 | Polynomial regression & regularization (Ridge/Lasso/ElasticNet) | 11-polynomial-and-regularization |
| 12 | Logistic regression & classification metrics | 12-logistic-regression-and-classification |
| 13 | Ensembles — Random Forest, AdaBoost, GBM, Stacking, XGBoost, LightGBM | 13-ensembles |
| 14 | Unsupervised — K-Means, DBSCAN, hierarchical | 14-unsupervised-and-clustering |
The big picture¶
flowchart TB
subgraph Data [Data Layer 1-3]
A1[Intro] --> A2[Load] --> A3[EDA]
end
subgraph Prep [Feature Prep 4-9]
B1[Scaling] --> B2[Encoding] --> B3[Pipelines]
B3 --> B4[Missing] --> B5[Outliers] --> B6[Feat. Eng + PCA]
end
subgraph Models [Supervised 10-13]
C1[Linear] --> C2[Polynomial + Reg]
C2 --> C3[Logistic / Class] --> C4[Ensembles]
end
subgraph U [Unsupervised 14]
D1[K-Means / DBSCAN]
end
A3 -.-> B1
B6 -.-> C1
B6 -.-> D1
Learning roadmap¶
- ML mental model — features, labels, train/val/test, leakage
- Loading data from any common source (CSV, JSON, SQL, REST, scraping)
- Train/val/test splits, stratified, cross-validation
- Supervised: regression — linear, polynomial, Ridge, Lasso, ElasticNet
- Supervised: classification — logistic regression, metrics, threshold tuning
- Ensembles — Random Forest, AdaBoost, Gradient Boosting, XGBoost, LightGBM, Stacking
- Unsupervised: clustering (K-Means, DBSCAN, hierarchical), dimensionality reduction (PCA)
- Model evaluation — choosing the right metric for the task
- Overfitting, regularization, bias-variance
- Feature engineering, scaling, encoding, missing-data handling, outlier treatment
- Sklearn
Pipeline&ColumnTransformer— leak-proof production patterns
What I should be able to do after this topic¶
- Frame any problem as supervised / unsupervised / something else.
- Build a leak-free end-to-end pipeline in scikit-learn from data load to deployable model.
- Pick the right metric for the problem (and explain why accuracy is usually wrong).
- Build, evaluate, and tune a baseline + competitive model for any tabular dataset.
- Spot data leakage, overfitting, class imbalance issues in someone else's notebook.
- Choose between linear / tree-based / boosting / clustering based on the task.
- Interpret feature importances responsibly (permutation, SHAP).