Skip to content

Machine Learning

Status

Notes complete — 14 subtopics distilled from the CampusX 100 Days of ML playlist, day-by-day repo, and scikit-learn docs.

Reading order

Read in order — this is the ML workflow, top to bottom. Skip to whichever chapter matches what you're working on.

# Topic Read
1 Introduction — types of ML, end-to-end workflow 01-introduction
2 Data loading — CSV, JSON, SQL, API, scraping 02-data-loading
3 Exploratory Data Analysis 03-eda
4 Feature Scaling — Standard / MinMax / Robust 04-feature-scaling
5 Encoding categorical features 05-encoding
6 Pipelines & Transformers 06-pipelines
7 Missing data — imputation, KNN, iterative 07-missing-data
8 Outliers — Z-score, IQR, percentiles 08-outliers
9 Feature engineering & PCA 09-feature-engineering
10 Linear regression & gradient descent 10-linear-regression
11 Polynomial regression & regularization (Ridge/Lasso/ElasticNet) 11-polynomial-and-regularization
12 Logistic regression & classification metrics 12-logistic-regression-and-classification
13 Ensembles — Random Forest, AdaBoost, GBM, Stacking, XGBoost, LightGBM 13-ensembles
14 Unsupervised — K-Means, DBSCAN, hierarchical 14-unsupervised-and-clustering

The big picture

flowchart TB
    subgraph Data [Data Layer 1-3]
      A1[Intro] --> A2[Load] --> A3[EDA]
    end
    subgraph Prep [Feature Prep 4-9]
      B1[Scaling] --> B2[Encoding] --> B3[Pipelines]
      B3 --> B4[Missing] --> B5[Outliers] --> B6[Feat. Eng + PCA]
    end
    subgraph Models [Supervised 10-13]
      C1[Linear] --> C2[Polynomial + Reg]
      C2 --> C3[Logistic / Class] --> C4[Ensembles]
    end
    subgraph U [Unsupervised 14]
      D1[K-Means / DBSCAN]
    end
    A3 -.-> B1
    B6 -.-> C1
    B6 -.-> D1

Learning roadmap

  • ML mental model — features, labels, train/val/test, leakage
  • Loading data from any common source (CSV, JSON, SQL, REST, scraping)
  • Train/val/test splits, stratified, cross-validation
  • Supervised: regression — linear, polynomial, Ridge, Lasso, ElasticNet
  • Supervised: classification — logistic regression, metrics, threshold tuning
  • Ensembles — Random Forest, AdaBoost, Gradient Boosting, XGBoost, LightGBM, Stacking
  • Unsupervised: clustering (K-Means, DBSCAN, hierarchical), dimensionality reduction (PCA)
  • Model evaluation — choosing the right metric for the task
  • Overfitting, regularization, bias-variance
  • Feature engineering, scaling, encoding, missing-data handling, outlier treatment
  • Sklearn Pipeline & ColumnTransformer — leak-proof production patterns

What I should be able to do after this topic

  • Frame any problem as supervised / unsupervised / something else.
  • Build a leak-free end-to-end pipeline in scikit-learn from data load to deployable model.
  • Pick the right metric for the problem (and explain why accuracy is usually wrong).
  • Build, evaluate, and tune a baseline + competitive model for any tabular dataset.
  • Spot data leakage, overfitting, class imbalance issues in someone else's notebook.
  • Choose between linear / tree-based / boosting / clustering based on the task.
  • Interpret feature importances responsibly (permutation, SHAP).