Intermediatepythonmachine-learningmljupyternotebooknumpymatplotlibeducationalgorithmsneural-networksdeep-learningdata-science

Homemade Machine Learning: ML algorithms from scratch in Python

Python implementations of popular machine learning algorithms from scratch with interactive Jupyter notebooks and mathematical explanations. Learn the fundamentals by building ML algorithms yourself.

Step 1
What is Homemade Machine Learning?
Homemade Machine Learning is an educational repository by Oleksii Trekhleb (@trekhleb) that implements popular machine learning algorithms from scratch in Python. Unlike typical ML tutorials that rely on library one-liners, this project focuses on understanding the mathematics and fundamentals behind each algorithm.

The repository contains:
- Pure Python implementations of ML algorithms (no high-level ML libraries)
- Interactive Jupyter Notebook demos for hands-on experimentation
- Mathematical explanations and theory for each algorithm
- Real-world datasets and visualization examples
- Support for both supervised and unsupervised learning
Key Learning Value:
- Understand the math behind ML algorithms
- See how algorithms work step-by-step
- Experiment with training data and hyperparameters in real-time
- Build intuition before using production ML libraries
Note: These implementations are intentionally educational and not optimized for production use. For production ML, use established libraries like scikit-learn, TensorFlow, or PyTorch.

Step 2

Repository architecture

The repository follows a clean structure that separates algorithm implementations, interactive demos, and supporting data:

Core Structure:

homemade-machine-learning/
├── homemade/               # Algorithm implementations
│   ├── linear_regression/
│   ├── logistic_regression/
│   ├── k_means/
│   ├── neural_network/
│   ├── anomaly_detection/
│   └── utils/             # Shared utilities (features, hypothesis, etc.)
├── notebooks/             # Interactive Jupyter demos
│   ├── linear_regression/
│   ├── logistic_regression/
│   ├── k_means/
│   ├── neural_network/
│   └── anomaly_detection/
├── data/                  # Training datasets (CSV files)
└── images/                # Documentation assets

Organization Pattern: Each algorithm follows the same pattern:

Implementation in homemade/<algorithm>/ with math documentation
One or more demo notebooks in notebooks/<algorithm>/
Datasets in data/ referenced by notebooks

This structure makes it easy to:

Navigate between theory (code) and practice (notebooks)
Run demos independently
Compare different algorithm approaches

Step 3
Technology stack
The project uses a minimal, focused tech stack centered on scientific Python libraries for numerical computing and visualization.

Core Language:
- Python 3.6+ (originally 3.6, compatible with newer versions)
Scientific Computing:
- NumPy 1.15.3 — Core numerical computing library for matrix operations, linear algebra, and vectorized calculations. The foundation of all algorithm implementations.
- Pandas 0.23.4 — Data manipulation and CSV reading for dataset loading
- SciPy 1.1.0 — Scientific computing utilities (optimization, statistics)
Visualization:
- Matplotlib 3.0.1 — Primary plotting library for 2D charts, scatter plots, decision boundaries
- Plotly 3.4.1 — Interactive 3D visualizations and advanced plots
Development Tools:
- Jupyter 1.0.0 — Interactive notebook environment for running demos
- Pylint 2.1.1 — Python linter for code quality (configured via pylintrc)
CI/CD:
- Travis CI — Automated linting on commits (configured via .travis.yml)
Why This Stack? The deliberately minimal dependencies keep the focus on algorithmic fundamentals rather than framework abstractions. NumPy provides the mathematical primitives (matrix multiplication, derivatives, etc.) while Jupyter enables interactive experimentation.
```
Tech Stack:

├── Python 3.6+
├── NumPy 1.15.3        (matrix ops, linear algebra)
├── Pandas 0.23.4       (data loading)
├── Matplotlib 3.0.1    (2D plotting)
├── Plotly 3.4.1        (3D visualization)
├── Jupyter 1.0.0       (notebooks)
└── Pylint 2.1.1        (code quality)
```

Step 4

Installation and setup

Getting started with Homemade Machine Learning requires Python 3.6+ and installing the scientific computing dependencies.

Clone the Repository:

git clone https://github.com/trekhleb/homemade-machine-learning.git
cd homemade-machine-learning

Create a Virtual Environment (Recommended):

# Using venv (Python 3.6+)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies:

pip install -r requirements.txt

This installs:

jupyter (notebook environment)
matplotlib (visualization)
numpy (numerical computing)
pandas (data manipulation)
plotly (interactive plots)
scipy (scientific utilities)
pylint (linting)

Verify Installation:

python -c "import numpy, pandas, matplotlib, jupyter; print('All dependencies installed!')"

# Clone repository
git clone https://github.com/trekhleb/homemade-machine-learning.git
cd homemade-machine-learning

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import numpy, pandas, matplotlib, jupyter; print('Ready!')"

Step 5
Launching Jupyter notebooks
The repository includes 11 interactive Jupyter notebooks that demonstrate each algorithm with real datasets.

Launch Jupyter Locally:
```
# From the repository root
jupyter notebook
```
This starts the Jupyter server and opens your browser at http://localhost:8888. Navigate to the notebooks/ folder to access the demos.

Online Options (No Installation Required):
1. NBViewer (Read-Only Preview):
  - Fast online preview of notebooks
  - View code, charts, and results
  - Cannot modify or run code
  - All demo links in the README point to NBViewer
2. Binder (Interactive):
  - Full interactive notebook environment in your browser
  - Can modify code and re-run cells
  - Click "Execute on Binder" button in any NBViewer page
  - Takes ~2 minutes to build the environment
Notebook Organization: Notebooks are grouped by algorithm type:
- linear_regression/ — 3 demos (univariate, multivariate, non-linear)
- logistic_regression/ — 4 demos (linear boundary, non-linear, MNIST, Fashion MNIST)
- k_means/ — 1 demo (Iris clustering)
- neural_network/ — 2 demos (MNIST, Fashion MNIST)
- anomaly_detection/ — 1 demo (Gaussian distribution)
```
# Local execution
jupyter notebook
# → Opens http://localhost:8888
# → Navigate to notebooks/ folder

# Alternative: Launch a specific notebook
jupyter notebook notebooks/linear_regression/univariate_linear_regression_demo.ipynb
```
Step 6
Algorithm implementations overview
The repository implements algorithms across three main categories:

Supervised Learning

Regression (predicting continuous values):
- Linear Regression — Draw a line/plane through data points
  - Univariate: Single feature prediction
  - Multivariate: Multiple features
  - Non-linear: Polynomial and sinusoid features
  - Use cases: Stock prices, sales forecasting, trend analysis
Classification (categorizing data into classes):
- Logistic Regression — Binary and multi-class classification
  - Linear boundaries for simple separation
  - Non-linear boundaries using feature engineering
  - Multivariate for high-dimensional data (MNIST digits, Fashion MNIST)
  - Use cases: Spam detection, language detection, image recognition
Unsupervised Learning

Clustering (grouping similar data):
- K-means — Partition data into K clusters
  - Iterative centroid refinement
  - Demo: Iris flower clustering
  - Use cases: Market segmentation, image compression, data analysis
Anomaly Detection (identifying outliers):
- Gaussian Distribution — Statistical anomaly detection
  - Model normal behavior with Gaussian distribution
  - Flag rare events based on probability threshold
  - Demo: Server monitoring (latency, throughput)
  - Use cases: Fraud detection, intrusion detection, system health
Neural Networks
- Multilayer Perceptron (MLP) — Feedforward neural network
  - Multiple hidden layers with activation functions
  - Backpropagation for training
  - Demos: Handwritten digit recognition, clothing classification
  - Use cases: General-purpose ML, image recognition, voice recognition
Step 7
Example: Linear regression walkthrough
Let's walk through the univariate linear regression example to understand the structure.

Implementation Location: homemade/linear_regression/linear_regression.py

This file contains:
- The LinearRegression class
- Hypothesis function (linear equation)
- Cost function (mean squared error)
- Gradient descent optimization
- Prediction method
Demo Notebook: notebooks/linear_regression/univariate_linear_regression_demo.ipynb

What the Demo Does:
1. Loads a dataset (country happiness scores vs GDP)
2. Visualizes the raw data as a scatter plot
3. Trains a linear regression model
4. Plots the regression line through the data
5. Shows cost function convergence over iterations
6. Makes predictions on new data points
Key Learning Points:
- See gradient descent in action (cost decreasing)
- Understand how learning rate affects convergence
- Experiment with different features (polynomial, etc.)
- Visualize overfitting vs underfitting
Try It Yourself:
```
jupyter notebook notebooks/linear_regression/univariate_linear_regression_demo.ipynb
```
Modify the learning rate, iterations, or add polynomial features to see how the model changes.
```
# Example from the implementation
from homemade.linear_regression import LinearRegression
import numpy as np

# Load data (GDP vs Happiness)
data = np.loadtxt('data/world-happiness-report-2017.csv', delimiter=',')
X = data[:, 0:1]  # GDP column
y = data[:, 1:2]  # Happiness column

# Train model
model = LinearRegression(X, y)
model.train(alpha=0.01, num_iterations=500)

# Make predictions
predictions = model.predict(X)

# Visualize results (see notebook for full plotting code)
```
Step 8
Example: Neural network MNIST demo
The neural network implementation showcases a more advanced algorithm with the classic MNIST handwritten digit recognition task.

Implementation: homemade/neural_network/multilayer_perceptron.py

Features:
- Configurable layer architecture (input → hidden → output)
- Sigmoid activation functions
- Backpropagation for weight updates
- Mini-batch gradient descent
- Regularization support
Demo Notebook: notebooks/neural_network/multilayer_perceptron_demo.ipynb

Dataset:
- 60,000 training images (28×28 pixels)
- 10,000 test images
- 10 digit classes (0-9)
- Each image flattened to 784 features
What You'll Learn:
- How neural networks transform data through layers
- Impact of hidden layer size on accuracy
- Training progress visualization (accuracy over epochs)
- Overfitting detection
- Confusion matrix interpretation
Typical Results:
- Training accuracy: ~95-97%
- Test accuracy: ~93-95%
- Training time: 5-10 minutes (CPU)
Experimentation Ideas:
- Add more hidden layers
- Change layer sizes (128 → 256 neurons)
- Adjust learning rate
- Enable/disable regularization
- Compare with Fashion MNIST dataset
```
# Neural network configuration example
from homemade.neural_network import MultilayerPerceptron

# Network architecture
layers = [
    784,   # Input: 28×28 pixels flattened
    128,   # Hidden layer: 128 neurons
    10     # Output: 10 digit classes
]

# Train model
model = MultilayerPerceptron(X_train, y_train, layers)
model.train(
    alpha=0.1,              # Learning rate
    lambda_param=0.0,       # Regularization
    num_iterations=500,     # Epochs
    batch_size=100          # Mini-batch size
)

# Evaluate
accuracy = model.evaluate(X_test, y_test)
print(f'Test accuracy: {accuracy:.2%}')
```
Step 9
Educational approach and learning path
Homemade Machine Learning follows a pedagogical progression from simple to complex algorithms.

Recommended Learning Path:
1. Start with Linear Regression (Easiest)
  - Univariate demo first (single feature)
  - Then multivariate (multiple features)
  - Finally non-linear (feature engineering)
  - Builds intuition for cost functions and gradient descent
2. Move to Logistic Regression
  - Linear boundary demo (natural extension of linear regression)
  - Non-linear boundary (feature engineering revisited)
  - Multivariate MNIST (high-dimensional classification)
3. Explore Unsupervised Learning
  - K-means clustering (simpler than classification)
  - Anomaly detection (introduces probability distributions)
4. Tackle Neural Networks (Most Complex)
  - Builds on all previous concepts
  - Combines gradient descent, classification, and feature learning
  - MNIST provides concrete benchmark
Mathematical Prerequisites:
- Linear algebra (matrices, vectors, dot products)
- Calculus (derivatives, partial derivatives, chain rule)
- Basic probability and statistics
- Understanding of cost functions and optimization
Most Examples Reference: The code and explanations are based on Andrew Ng's Machine Learning course (Coursera), making it easy to cross-reference with video lectures.
```
Learning Progression:

1. Linear Regression
   └─ Univariate → Multivariate → Non-linear

2. Logistic Regression
   └─ Linear boundary → Non-linear → MNIST

3. Unsupervised Learning
   ├─ K-means clustering
   └─ Anomaly detection

4. Neural Networks
   └─ MLP → MNIST → Fashion MNIST
```
Step 10
Datasets included
The repository includes several real-world datasets in the data/ folder:

Regression Datasets:
- World Happiness Report 2017 — Country happiness scores with economic indicators (GDP, freedom, generosity, etc.)
  - Used for: Linear regression demos
  - Features: Economy GDP, social support, life expectancy, freedom
  - Target: Happiness score
Classification Datasets:
- Iris Flower Dataset — Classic ML dataset with 3 flower species
  - 150 samples, 4 features (sepal/petal length and width)
  - Used for: Logistic regression, K-means clustering
- MNIST Handwritten Digits — 70,000 grayscale images (60k train, 10k test)
  - 28×28 pixels per image
  - 10 classes (digits 0-9)
  - Used for: Logistic regression, neural networks
- Fashion MNIST — Alternative to MNIST with clothing items
  - Same format as MNIST (28×28 grayscale)
  - 10 classes (t-shirt, trouser, dress, coat, sandal, etc.)
  - Used for: Logistic regression, neural networks
Anomaly Detection:
- Server Metrics — Synthetic dataset of server operational parameters
  - Features: Latency, throughput
  - Contains normal and anomalous behavior examples
All datasets are loaded via NumPy or Pandas and include preprocessing examples in the notebooks.
```
# Example dataset loading patterns

# CSV loading with NumPy
data = np.loadtxt('data/happiness.csv', delimiter=',')
X = data[:, 0:2]  # Features
y = data[:, 2:3]  # Target

# Iris dataset via sklearn
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2]  # First 2 features for visualization
y = iris.target

# MNIST via keras datasets
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```
Step 11
Code structure and utilities
The homemade/utils/ directory contains shared utilities used across multiple algorithms:

Features Module (features/):
- prepare_for_training() — Normalize data and add bias column
- normalize() — Feature scaling (zero mean, unit variance)
- generate_polynomials() — Create polynomial features for non-linear regression
- generate_sinusoids() — Create sinusoidal features
Hypothesis Module:
- linear_hypothesis() — Linear prediction function
- sigmoid() — Logistic activation function
Cost Functions:
- Mean Squared Error (regression)
- Cross-Entropy Loss (classification)
- Regularization terms
Optimization:
- Gradient descent implementation
- Mini-batch gradient descent
- Learning rate scheduling helpers
Plotting Utilities:
- Decision boundary visualization
- Cost function convergence plots
- Confusion matrices
- 3D surface plots for regression
Why This Matters: Understanding these utilities is crucial because they reveal the common patterns across all ML algorithms (feature scaling, cost computation, gradient calculation). The main algorithm classes focus on the unique aspects while delegating these shared concerns to utilities.
```
# Example utility usage
from homemade.utils.features import prepare_for_training

# Normalize features and add bias
X_normalized, features_mean, features_std = prepare_for_training(X)

# Generate polynomial features (degree 2)
from homemade.utils.features import generate_polynomials
X_poly = generate_polynomials(X, polynomial_degree=2)

# Common pattern in all algorithms:
# 1. Prepare features (normalize + bias)
# 2. Initialize parameters (theta)
# 3. Compute cost and gradients
# 4. Update parameters via gradient descent
# 5. Repeat until convergence
```
Step 12
Development and testing
The repository includes development tooling for code quality and testing.

Linting with Pylint: The project uses Pylint with a custom configuration (pylintrc) to maintain code quality.
```
# Run linter on all implementations
pylint ./homemade
```
The pylintrc file contains project-specific rules and is used in CI.

Continuous Integration: Travis CI automatically runs linting on every commit.

Configuration (.travis.yml):
- Python 3.6 environment
- Installs dependencies from requirements.txt
- Runs pylint ./homemade
- Email notifications disabled
Testing Approach: While the repository doesn't include formal unit tests (pytest suite), testing happens through:
1. Interactive notebook execution (visual validation)
2. Algorithm convergence verification
3. Accuracy metrics on known datasets
4. Comparison with expected results from Andrew Ng's course
Contributing Guidelines: See CONTRIBUTING.md for guidelines on:
- Code style and formatting
- Adding new algorithms
- Improving documentation
- Submitting issues and pull requests
```
# Development workflow

# 1. Install dev dependencies
pip install -r requirements.txt

# 2. Make changes to algorithm implementations
vim homemade/linear_regression/linear_regression.py

# 3. Test via notebook
jupyter notebook notebooks/linear_regression/...

# 4. Run linter
pylint ./homemade

# 5. Commit (Travis CI will lint automatically)
git add .
git commit -m "Improve gradient descent convergence"
git push
```
Step 13
Related projects and alternatives
Oleksii Trekhleb (@trekhleb) maintains several related educational ML projects:

Other Projects by the Same Author:
1. machine-learning-octave — Octave/MATLAB version of this repository
  - GitHub: https://github.com/trekhleb/machine-learning-octave
  - Uses Octave instead of Python
  - Follows the same educational approach
  - Direct implementations from Andrew Ng's course
2. Interactive Machine Learning Experiments — Web-based ML playground
  - GitHub: https://github.com/trekhleb/machine-learning-experiments
  - Live demos in the browser
  - Uses TensorFlow.js
  - More visual, less mathematical
3. Homemade GPT (JavaScript) — GPT implementation from scratch
  - GitHub: https://github.com/trekhleb/homemade-gpt-js
  - Focus on transformer architecture
  - TypeScript/JavaScript implementation
When to Use Homemade Machine Learning:
- Learning ML fundamentals from scratch
- Understanding mathematical foundations
- Preparing for ML interviews
- Teaching ML concepts
- Transitioning from theory (courses) to code
When to Use Production Libraries Instead:
- Building production ML systems
- Working with large datasets
- Deploying models to production
- Time-critical development
- Advanced deep learning architectures
Production ML Libraries:
- scikit-learn (classical ML)
- TensorFlow / PyTorch (deep learning)
- XGBoost / LightGBM (gradient boosting)
- Keras (high-level neural networks)
Step 14
Resources and community
Official Resources:
- GitHub Repository: https://github.com/trekhleb/homemade-machine-learning
- Author: Oleksii Trekhleb (@trekhleb)
- License: MIT License (open source, commercial use allowed)
- Stars: ~23,000+ (as of 2024)
Learning Resources:
- Andrew Ng's ML Course: https://www.coursera.org/learn/machine-learning
  - Free course on Coursera
  - Most algorithms in this repo are based on this course
  - Highly recommended companion resource
- NBViewer Links: Embedded in README for each algorithm
  - Fast read-only preview of notebooks
  - No installation required
- Binder: Interactive notebook execution
  - Full Jupyter environment in browser
  - Click "Execute on Binder" in NBViewer
Community Support:
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: Questions and community help
- Pull Requests: Contributions welcome (see CONTRIBUTING.md)
- Code of Conduct: See CODE_OF_CONDUCT.md
Supporting the Project:
- GitHub Sponsors: https://github.com/sponsors/trekhleb
- Patreon: https://www.patreon.com/trekhleb
The project is actively maintained with regular updates and improvements.
```
Quick Links:

Repository: https://github.com/trekhleb/homemade-machine-learning
Author: @trekhleb
License: MIT
Stars: 23K+

Learning:
├─ Andrew Ng Course: coursera.org/learn/machine-learning
├─ NBViewer: Interactive notebook previews
└─ Binder: Run notebooks in browser

Support:
├─ GitHub Issues (bugs)
├─ GitHub Discussions (questions)
└─ Sponsors / Patreon (funding)
```

What is Homemade Machine Learning?

Repository architecture

Technology stack

Installation and setup

Launching Jupyter notebooks

Algorithm implementations overview

Supervised Learning

Unsupervised Learning

Neural Networks

Example: Linear regression walkthrough

Example: Neural network MNIST demo

Educational approach and learning path

Datasets included

Code structure and utilities

Development and testing

Related projects and alternatives

Resources and community

Feature requests

Discussion