TechSetupGuides
Intermediatepythonmachine-learningmljupyternotebooknumpymatplotlibeducationalgorithmsneural-networksdeep-learningdata-science

Homemade Machine Learning: ML algorithms from scratch in Python

Python implementations of popular machine learning algorithms from scratch with interactive Jupyter notebooks and mathematical explanations. Learn the fundamentals by building ML algorithms yourself.

  1. Step 1

    What is Homemade Machine Learning?

    Homemade Machine Learning is an educational repository by Oleksii Trekhleb (@trekhleb) that implements popular machine learning algorithms from scratch in Python. Unlike typical ML tutorials that rely on library one-liners, this project focuses on understanding the mathematics and fundamentals behind each algorithm.

    The repository contains:

    • Pure Python implementations of ML algorithms (no high-level ML libraries)
    • Interactive Jupyter Notebook demos for hands-on experimentation
    • Mathematical explanations and theory for each algorithm
    • Real-world datasets and visualization examples
    • Support for both supervised and unsupervised learning

    Key Learning Value:

    • Understand the math behind ML algorithms
    • See how algorithms work step-by-step
    • Experiment with training data and hyperparameters in real-time
    • Build intuition before using production ML libraries

    Note: These implementations are intentionally educational and not optimized for production use. For production ML, use established libraries like scikit-learn, TensorFlow, or PyTorch.

  2. Step 2

    Repository architecture

    The repository follows a clean structure that separates algorithm implementations, interactive demos, and supporting data:

    Core Structure:

    homemade-machine-learning/
    ├── homemade/               # Algorithm implementations
    │   ├── linear_regression/
    │   ├── logistic_regression/
    │   ├── k_means/
    │   ├── neural_network/
    │   ├── anomaly_detection/
    │   └── utils/             # Shared utilities (features, hypothesis, etc.)
    ├── notebooks/             # Interactive Jupyter demos
    │   ├── linear_regression/
    │   ├── logistic_regression/
    │   ├── k_means/
    │   ├── neural_network/
    │   └── anomaly_detection/
    ├── data/                  # Training datasets (CSV files)
    └── images/                # Documentation assets
    

    Organization Pattern: Each algorithm follows the same pattern:

    1. Implementation in homemade/<algorithm>/ with math documentation
    2. One or more demo notebooks in notebooks/<algorithm>/
    3. Datasets in data/ referenced by notebooks

    This structure makes it easy to:

    • Navigate between theory (code) and practice (notebooks)
    • Run demos independently
    • Compare different algorithm approaches
  3. Step 3

    Technology stack

    The project uses a minimal, focused tech stack centered on scientific Python libraries for numerical computing and visualization.

    Core Language:

    • Python 3.6+ (originally 3.6, compatible with newer versions)

    Scientific Computing:

    • NumPy 1.15.3 — Core numerical computing library for matrix operations, linear algebra, and vectorized calculations. The foundation of all algorithm implementations.
    • Pandas 0.23.4 — Data manipulation and CSV reading for dataset loading
    • SciPy 1.1.0 — Scientific computing utilities (optimization, statistics)

    Visualization:

    • Matplotlib 3.0.1 — Primary plotting library for 2D charts, scatter plots, decision boundaries
    • Plotly 3.4.1 — Interactive 3D visualizations and advanced plots

    Development Tools:

    • Jupyter 1.0.0 — Interactive notebook environment for running demos
    • Pylint 2.1.1 — Python linter for code quality (configured via pylintrc)

    CI/CD:

    • Travis CI — Automated linting on commits (configured via .travis.yml)

    Why This Stack? The deliberately minimal dependencies keep the focus on algorithmic fundamentals rather than framework abstractions. NumPy provides the mathematical primitives (matrix multiplication, derivatives, etc.) while Jupyter enables interactive experimentation.

    Tech Stack:
    
    ├── Python 3.6+
    ├── NumPy 1.15.3        (matrix ops, linear algebra)
    ├── Pandas 0.23.4       (data loading)
    ├── Matplotlib 3.0.1    (2D plotting)
    ├── Plotly 3.4.1        (3D visualization)
    ├── Jupyter 1.0.0       (notebooks)
    └── Pylint 2.1.1        (code quality)
  4. Step 4

    Installation and setup

    Getting started with Homemade Machine Learning requires Python 3.6+ and installing the scientific computing dependencies.

    Clone the Repository:

    git clone https://github.com/trekhleb/homemade-machine-learning.git
    cd homemade-machine-learning
    

    Create a Virtual Environment (Recommended):

    # Using venv (Python 3.6+)
    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    

    Install Dependencies:

    pip install -r requirements.txt
    

    This installs:

    • jupyter (notebook environment)
    • matplotlib (visualization)
    • numpy (numerical computing)
    • pandas (data manipulation)
    • plotly (interactive plots)
    • scipy (scientific utilities)
    • pylint (linting)

    Verify Installation:

    python -c "import numpy, pandas, matplotlib, jupyter; print('All dependencies installed!')"
    
    # Clone repository
    git clone https://github.com/trekhleb/homemade-machine-learning.git
    cd homemade-machine-learning
    
    # Create virtual environment
    python3 -m venv venv
    source venv/bin/activate
    
    # Install dependencies
    pip install -r requirements.txt
    
    # Verify installation
    python -c "import numpy, pandas, matplotlib, jupyter; print('Ready!')"
  5. Step 5

    Launching Jupyter notebooks

    The repository includes 11 interactive Jupyter notebooks that demonstrate each algorithm with real datasets.

    Launch Jupyter Locally:

    # From the repository root
    jupyter notebook
    

    This starts the Jupyter server and opens your browser at http://localhost:8888. Navigate to the notebooks/ folder to access the demos.

    Online Options (No Installation Required):

    1. NBViewer (Read-Only Preview):

      • Fast online preview of notebooks
      • View code, charts, and results
      • Cannot modify or run code
      • All demo links in the README point to NBViewer
    2. Binder (Interactive):

      • Full interactive notebook environment in your browser
      • Can modify code and re-run cells
      • Click "Execute on Binder" button in any NBViewer page
      • Takes ~2 minutes to build the environment

    Notebook Organization: Notebooks are grouped by algorithm type:

    • linear_regression/ — 3 demos (univariate, multivariate, non-linear)
    • logistic_regression/ — 4 demos (linear boundary, non-linear, MNIST, Fashion MNIST)
    • k_means/ — 1 demo (Iris clustering)
    • neural_network/ — 2 demos (MNIST, Fashion MNIST)
    • anomaly_detection/ — 1 demo (Gaussian distribution)
    # Local execution
    jupyter notebook
    # → Opens http://localhost:8888
    # → Navigate to notebooks/ folder
    
    # Alternative: Launch a specific notebook
    jupyter notebook notebooks/linear_regression/univariate_linear_regression_demo.ipynb
  6. Step 6

    Algorithm implementations overview

    The repository implements algorithms across three main categories:

    Supervised Learning

    Regression (predicting continuous values):

    • Linear Regression — Draw a line/plane through data points
      • Univariate: Single feature prediction
      • Multivariate: Multiple features
      • Non-linear: Polynomial and sinusoid features
      • Use cases: Stock prices, sales forecasting, trend analysis

    Classification (categorizing data into classes):

    • Logistic Regression — Binary and multi-class classification
      • Linear boundaries for simple separation
      • Non-linear boundaries using feature engineering
      • Multivariate for high-dimensional data (MNIST digits, Fashion MNIST)
      • Use cases: Spam detection, language detection, image recognition

    Unsupervised Learning

    Clustering (grouping similar data):

    • K-means — Partition data into K clusters
      • Iterative centroid refinement
      • Demo: Iris flower clustering
      • Use cases: Market segmentation, image compression, data analysis

    Anomaly Detection (identifying outliers):

    • Gaussian Distribution — Statistical anomaly detection
      • Model normal behavior with Gaussian distribution
      • Flag rare events based on probability threshold
      • Demo: Server monitoring (latency, throughput)
      • Use cases: Fraud detection, intrusion detection, system health

    Neural Networks

    • Multilayer Perceptron (MLP) — Feedforward neural network
      • Multiple hidden layers with activation functions
      • Backpropagation for training
      • Demos: Handwritten digit recognition, clothing classification
      • Use cases: General-purpose ML, image recognition, voice recognition
  7. Step 7

    Example: Linear regression walkthrough

    Let's walk through the univariate linear regression example to understand the structure.

    Implementation Location: homemade/linear_regression/linear_regression.py

    This file contains:

    • The LinearRegression class
    • Hypothesis function (linear equation)
    • Cost function (mean squared error)
    • Gradient descent optimization
    • Prediction method

    Demo Notebook: notebooks/linear_regression/univariate_linear_regression_demo.ipynb

    What the Demo Does:

    1. Loads a dataset (country happiness scores vs GDP)
    2. Visualizes the raw data as a scatter plot
    3. Trains a linear regression model
    4. Plots the regression line through the data
    5. Shows cost function convergence over iterations
    6. Makes predictions on new data points

    Key Learning Points:

    • See gradient descent in action (cost decreasing)
    • Understand how learning rate affects convergence
    • Experiment with different features (polynomial, etc.)
    • Visualize overfitting vs underfitting

    Try It Yourself:

    jupyter notebook notebooks/linear_regression/univariate_linear_regression_demo.ipynb
    

    Modify the learning rate, iterations, or add polynomial features to see how the model changes.

    # Example from the implementation
    from homemade.linear_regression import LinearRegression
    import numpy as np
    
    # Load data (GDP vs Happiness)
    data = np.loadtxt('data/world-happiness-report-2017.csv', delimiter=',')
    X = data[:, 0:1]  # GDP column
    y = data[:, 1:2]  # Happiness column
    
    # Train model
    model = LinearRegression(X, y)
    model.train(alpha=0.01, num_iterations=500)
    
    # Make predictions
    predictions = model.predict(X)
    
    # Visualize results (see notebook for full plotting code)
  8. Step 8

    Example: Neural network MNIST demo

    The neural network implementation showcases a more advanced algorithm with the classic MNIST handwritten digit recognition task.

    Implementation: homemade/neural_network/multilayer_perceptron.py

    Features:

    • Configurable layer architecture (input → hidden → output)
    • Sigmoid activation functions
    • Backpropagation for weight updates
    • Mini-batch gradient descent
    • Regularization support

    Demo Notebook: notebooks/neural_network/multilayer_perceptron_demo.ipynb

    Dataset:

    • 60,000 training images (28×28 pixels)
    • 10,000 test images
    • 10 digit classes (0-9)
    • Each image flattened to 784 features

    What You'll Learn:

    • How neural networks transform data through layers
    • Impact of hidden layer size on accuracy
    • Training progress visualization (accuracy over epochs)
    • Overfitting detection
    • Confusion matrix interpretation

    Typical Results:

    • Training accuracy: ~95-97%
    • Test accuracy: ~93-95%
    • Training time: 5-10 minutes (CPU)

    Experimentation Ideas:

    • Add more hidden layers
    • Change layer sizes (128 → 256 neurons)
    • Adjust learning rate
    • Enable/disable regularization
    • Compare with Fashion MNIST dataset
    # Neural network configuration example
    from homemade.neural_network import MultilayerPerceptron
    
    # Network architecture
    layers = [
        784,   # Input: 28×28 pixels flattened
        128,   # Hidden layer: 128 neurons
        10     # Output: 10 digit classes
    ]
    
    # Train model
    model = MultilayerPerceptron(X_train, y_train, layers)
    model.train(
        alpha=0.1,              # Learning rate
        lambda_param=0.0,       # Regularization
        num_iterations=500,     # Epochs
        batch_size=100          # Mini-batch size
    )
    
    # Evaluate
    accuracy = model.evaluate(X_test, y_test)
    print(f'Test accuracy: {accuracy:.2%}')
  9. Step 9

    Educational approach and learning path

    Homemade Machine Learning follows a pedagogical progression from simple to complex algorithms.

    Recommended Learning Path:

    1. Start with Linear Regression (Easiest)

      • Univariate demo first (single feature)
      • Then multivariate (multiple features)
      • Finally non-linear (feature engineering)
      • Builds intuition for cost functions and gradient descent
    2. Move to Logistic Regression

      • Linear boundary demo (natural extension of linear regression)
      • Non-linear boundary (feature engineering revisited)
      • Multivariate MNIST (high-dimensional classification)
    3. Explore Unsupervised Learning

      • K-means clustering (simpler than classification)
      • Anomaly detection (introduces probability distributions)
    4. Tackle Neural Networks (Most Complex)

      • Builds on all previous concepts
      • Combines gradient descent, classification, and feature learning
      • MNIST provides concrete benchmark

    Mathematical Prerequisites:

    • Linear algebra (matrices, vectors, dot products)
    • Calculus (derivatives, partial derivatives, chain rule)
    • Basic probability and statistics
    • Understanding of cost functions and optimization

    Most Examples Reference: The code and explanations are based on Andrew Ng's Machine Learning course (Coursera), making it easy to cross-reference with video lectures.

    Learning Progression:
    
    1. Linear Regression
       └─ Univariate → Multivariate → Non-linear
    
    2. Logistic Regression
       └─ Linear boundary → Non-linear → MNIST
    
    3. Unsupervised Learning
       ├─ K-means clustering
       └─ Anomaly detection
    
    4. Neural Networks
       └─ MLP → MNIST → Fashion MNIST
  10. Step 10

    Datasets included

    The repository includes several real-world datasets in the data/ folder:

    Regression Datasets:

    • World Happiness Report 2017 — Country happiness scores with economic indicators (GDP, freedom, generosity, etc.)
      • Used for: Linear regression demos
      • Features: Economy GDP, social support, life expectancy, freedom
      • Target: Happiness score

    Classification Datasets:

    • Iris Flower Dataset — Classic ML dataset with 3 flower species

      • 150 samples, 4 features (sepal/petal length and width)
      • Used for: Logistic regression, K-means clustering
    • MNIST Handwritten Digits — 70,000 grayscale images (60k train, 10k test)

      • 28×28 pixels per image
      • 10 classes (digits 0-9)
      • Used for: Logistic regression, neural networks
    • Fashion MNIST — Alternative to MNIST with clothing items

      • Same format as MNIST (28×28 grayscale)
      • 10 classes (t-shirt, trouser, dress, coat, sandal, etc.)
      • Used for: Logistic regression, neural networks

    Anomaly Detection:

    • Server Metrics — Synthetic dataset of server operational parameters
      • Features: Latency, throughput
      • Contains normal and anomalous behavior examples

    All datasets are loaded via NumPy or Pandas and include preprocessing examples in the notebooks.

    # Example dataset loading patterns
    
    # CSV loading with NumPy
    data = np.loadtxt('data/happiness.csv', delimiter=',')
    X = data[:, 0:2]  # Features
    y = data[:, 2:3]  # Target
    
    # Iris dataset via sklearn
    from sklearn import datasets
    iris = datasets.load_iris()
    X = iris.data[:, :2]  # First 2 features for visualization
    y = iris.target
    
    # MNIST via keras datasets
    from keras.datasets import mnist
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
  11. Step 11

    Code structure and utilities

    The homemade/utils/ directory contains shared utilities used across multiple algorithms:

    Features Module (features/):

    • prepare_for_training() — Normalize data and add bias column
    • normalize() — Feature scaling (zero mean, unit variance)
    • generate_polynomials() — Create polynomial features for non-linear regression
    • generate_sinusoids() — Create sinusoidal features

    Hypothesis Module:

    • linear_hypothesis() — Linear prediction function
    • sigmoid() — Logistic activation function

    Cost Functions:

    • Mean Squared Error (regression)
    • Cross-Entropy Loss (classification)
    • Regularization terms

    Optimization:

    • Gradient descent implementation
    • Mini-batch gradient descent
    • Learning rate scheduling helpers

    Plotting Utilities:

    • Decision boundary visualization
    • Cost function convergence plots
    • Confusion matrices
    • 3D surface plots for regression

    Why This Matters: Understanding these utilities is crucial because they reveal the common patterns across all ML algorithms (feature scaling, cost computation, gradient calculation). The main algorithm classes focus on the unique aspects while delegating these shared concerns to utilities.

    # Example utility usage
    from homemade.utils.features import prepare_for_training
    
    # Normalize features and add bias
    X_normalized, features_mean, features_std = prepare_for_training(X)
    
    # Generate polynomial features (degree 2)
    from homemade.utils.features import generate_polynomials
    X_poly = generate_polynomials(X, polynomial_degree=2)
    
    # Common pattern in all algorithms:
    # 1. Prepare features (normalize + bias)
    # 2. Initialize parameters (theta)
    # 3. Compute cost and gradients
    # 4. Update parameters via gradient descent
    # 5. Repeat until convergence
  12. Step 12

    Development and testing

    The repository includes development tooling for code quality and testing.

    Linting with Pylint: The project uses Pylint with a custom configuration (pylintrc) to maintain code quality.

    # Run linter on all implementations
    pylint ./homemade
    

    The pylintrc file contains project-specific rules and is used in CI.

    Continuous Integration: Travis CI automatically runs linting on every commit.

    Configuration (.travis.yml):

    • Python 3.6 environment
    • Installs dependencies from requirements.txt
    • Runs pylint ./homemade
    • Email notifications disabled

    Testing Approach: While the repository doesn't include formal unit tests (pytest suite), testing happens through:

    1. Interactive notebook execution (visual validation)
    2. Algorithm convergence verification
    3. Accuracy metrics on known datasets
    4. Comparison with expected results from Andrew Ng's course

    Contributing Guidelines: See CONTRIBUTING.md for guidelines on:

    • Code style and formatting
    • Adding new algorithms
    • Improving documentation
    • Submitting issues and pull requests
    # Development workflow
    
    # 1. Install dev dependencies
    pip install -r requirements.txt
    
    # 2. Make changes to algorithm implementations
    vim homemade/linear_regression/linear_regression.py
    
    # 3. Test via notebook
    jupyter notebook notebooks/linear_regression/...
    
    # 4. Run linter
    pylint ./homemade
    
    # 5. Commit (Travis CI will lint automatically)
    git add .
    git commit -m "Improve gradient descent convergence"
    git push
  13. Step 13

    Related projects and alternatives

    Oleksii Trekhleb (@trekhleb) maintains several related educational ML projects:

    Other Projects by the Same Author:

    1. machine-learning-octave — Octave/MATLAB version of this repository

      • GitHub: https://github.com/trekhleb/machine-learning-octave
      • Uses Octave instead of Python
      • Follows the same educational approach
      • Direct implementations from Andrew Ng's course
    2. Interactive Machine Learning Experiments — Web-based ML playground

      • GitHub: https://github.com/trekhleb/machine-learning-experiments
      • Live demos in the browser
      • Uses TensorFlow.js
      • More visual, less mathematical
    3. Homemade GPT (JavaScript) — GPT implementation from scratch

      • GitHub: https://github.com/trekhleb/homemade-gpt-js
      • Focus on transformer architecture
      • TypeScript/JavaScript implementation

    When to Use Homemade Machine Learning:

    • Learning ML fundamentals from scratch
    • Understanding mathematical foundations
    • Preparing for ML interviews
    • Teaching ML concepts
    • Transitioning from theory (courses) to code

    When to Use Production Libraries Instead:

    • Building production ML systems
    • Working with large datasets
    • Deploying models to production
    • Time-critical development
    • Advanced deep learning architectures

    Production ML Libraries:

    • scikit-learn (classical ML)
    • TensorFlow / PyTorch (deep learning)
    • XGBoost / LightGBM (gradient boosting)
    • Keras (high-level neural networks)
  14. Step 14

    Resources and community

    Official Resources:

    • GitHub Repository: https://github.com/trekhleb/homemade-machine-learning
    • Author: Oleksii Trekhleb (@trekhleb)
    • License: MIT License (open source, commercial use allowed)
    • Stars: ~23,000+ (as of 2024)

    Learning Resources:

    • Andrew Ng's ML Course: https://www.coursera.org/learn/machine-learning

      • Free course on Coursera
      • Most algorithms in this repo are based on this course
      • Highly recommended companion resource
    • NBViewer Links: Embedded in README for each algorithm

      • Fast read-only preview of notebooks
      • No installation required
    • Binder: Interactive notebook execution

      • Full Jupyter environment in browser
      • Click "Execute on Binder" in NBViewer

    Community Support:

    • GitHub Issues: Bug reports and feature requests
    • GitHub Discussions: Questions and community help
    • Pull Requests: Contributions welcome (see CONTRIBUTING.md)
    • Code of Conduct: See CODE_OF_CONDUCT.md

    Supporting the Project:

    • GitHub Sponsors: https://github.com/sponsors/trekhleb
    • Patreon: https://www.patreon.com/trekhleb

    The project is actively maintained with regular updates and improvements.

    Quick Links:
    
    Repository: https://github.com/trekhleb/homemade-machine-learning
    Author: @trekhleb
    License: MIT
    Stars: 23K+
    
    Learning:
    ├─ Andrew Ng Course: coursera.org/learn/machine-learning
    ├─ NBViewer: Interactive notebook previews
    └─ Binder: Run notebooks in browser
    
    Support:
    ├─ GitHub Issues (bugs)
    ├─ GitHub Discussions (questions)
    └─ Sponsors / Patreon (funding)

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.