InstructLab - Framework for Fine-tuning LLMs with Synthetic Data
Framework for fine-tuning LLMs with synthetic data using the LAB (Large-Scale Alignment for ChatBots) method. Now refactored into SDG Hub and Training Hub components.
- Step 1
Overview
InstructLab (ilab) is a framework for fine-tuning Large Language Models (LLMs) using synthetic data via the LAB (Large-Scale Alignment for ChatBots) method. Note: The original repository was archived in April 2026 and has been refactored into separate component repositories.
⚠ Heads up: The original instructlab/instructlab repository is archived (April 23, 2026). Development continues in new component repositories. - Step 2
Project Evolution & Architecture
As of September 2, 2025, InstructLab was restructured into separate repositories for improved maintainability.
Project Evolution Summary === Original (Archived) === Repository: instructlab/instructlab Stars: ~12,000 Status: Archived (April 23, 2026) Last release: v0.26.1 (May 5, 2025) URL: https://github.com/instructlab/instructlab New Component Repositories: 1. SDG Hub (Synthetic Data Generation) - URL: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub - Stars: ~142 - Status: Actively maintained 2. Training Hub - URL: https://github.com/Red-Hat-AI-Innovation-Team/training_hub - Status: Actively maintained 3. Taxonomy (Still Active) - URL: https://github.com/instructlab/taxonomy - Status: Community contributions ongoing - Step 3
Technology Stack & Requirements
InstructLab core technology stack and system requirements:
Core Technologies: - Python: 85.5% of codebase - Shell: 9.3% - Jupyter: 2.9% - Dockerfile: 1.6% - License: Apache-2.0 System Requirements: - Python >= 3.10 (Python 3.10 removed in v0.26.1) - RAM: 16GB+ recommended (32GB+ for production) - Storage: 20GB+ for models and data - GPU: Optional but recommended (NVIDIA CUDA) Core Dependencies: - instructlab-core: CLI functionality - instructlab-sdg: Synthetic data generation - instructlab-training: Training pipeline - transformers: HuggingFace models - torch: PyTorch - llama_cpp_python: Local inference Optional: - vllm: High-performance inference - bitsandbytes: Quantization (4bit/8bit) - accelerate: Distributed training - Step 4
Installation Options
Multiple installation methods depending on your needs:
# Option 1: Archived Original Version pip install instructlab==0.26.1 # Last stable release # Option 2: From Source (Archived) git clone https://github.com/instructlab/instructlab.git cd instructlab git checkout 0.26.1 pip install -e .[all] # Option 3: New SDG Hub (Recommended for new projects) git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git cd sdg_hub pip install -e . # Option 4: Training Hub git clone https://github.com/Red-Hat-AI-Innovation-Team/training_hub.git cd training_hub pip install -e . # Verify Installation ilab --version # Original package python -c "import sdg_hub" # New package # Clone Taxonomy (Active Repository) git clone https://github.com/instructlab/taxonomy.git - Step 5
Quick Start Workflow
Basic workflow demonstrating InstructLab core functionality:
# Step 1: Download a model (Granite-7b recommended) ilab model download --model granite-7b # Alternative: Use HuggingFace directly huggingface-cli download ibm-granite/granite-7b-lab \ --local-dir ./granite-7b # Step 2: Start chat with base model ilab chat --model ./granite-7b # Step 3: Generate synthetic data from taxonomy ilab data generate \ --taxonomy-path ./taxonomy \ --output-dir ./generated/ \ --seed 42 # Step 4: Train model with synthetic data ilab model train \ --model ./granite-7b \ --data-path ./generated/data.jsonl \ --output-dir ./trained_model/ \ --num-epochs 5 \ --batch-size 8 # Step 5: Evaluate the trained model ilab model eval \ --model ./trained_model/model.pth \ --dataset mmlu # Step 6: Test new model ilab chat --model ./trained_model/model.pth - Step 6
Taxonomy Structure & Usage
Taxonomy is the knowledge base that powers synthetic data generation:
# Clone and explore taxonomy git clone https://github.com/instructlab/taxonomy.git cd taxonomy # View structure ls taxonomy/knowledge/ # Knowledge topics ls taxonomy/skills/ # Task examples # Validate taxonomy before use ilab diff --taxonomy taxonomy/ # Generate data from specific topic ilab data generate \ --taxonomy-path taxonomy/knowledge/python/ \ --output-dir ./py_data/ # Create custom taxonomy file # taxonomy/knowledge/my-topic.yaml cat > my_topic.yaml << 'EOF' name: python_basics description: "Python programming basics" tasks: - type: knowledge questions: - text: "What is a Python list?" answered_question: | A mutable ordered collection using square brackets. # Validate and generate ilab diff --taxonomy my_topic.yaml ilab data generate --taxonomy-path ./my_topic.yaml # Preview generated data cat ./generated/knowledge.jsonl | head -5 - Step 7
Training Parameters Reference
Key training configuration options:
# Basic Training ilab model train \ --model ./granite-7b \ --taxonomy-path ./taxonomy \ --output-dir ./output # Advanced Training ilab model train \ --model ./granite-7b \ --data-path ./generated/data.jsonl \ --output-dir ./trained/ \ --num-epochs 10 \ --batch-size 16 \ --learning-rate 4.0e-5 \ --device cuda \ --quantization 4bit \ --max-seq-length 2048 \ --seed 42 \ --gradient-checkpointing # Parameter Reference: --model Path to base model (required) --taxonomy-path Path to taxonomy YAML files (required) --data-path Path to generated JSONL data --output-dir Output directory for trained model (required) --num-epochs Training epochs (default: 1) --batch-size Training batch size (default: 1) --learning-rate Learning rate (default: 4.0e-5) --device cuda, cpu, or auto (default: auto) --quantization 4bit or 8bit quantization --max-seq-length Max sequence length (default: 128) --seed Random seed for reproducibility --gradient-checkpointing Enable memory optimization - Step 8
Configuration & Environment
Configuration file management:
# Config file location ~/.config/instructlab/config.yaml # View current configuration ilab config show # Example config cat > ~/.config/instructlab/config.yaml << 'EOF' generic: debug_level: INFO model: chat: model_path: /path/to/granite-7b taxonomy: path: /path/to/taxonomy cli: sdg_backend: null # or 'ray' for distributed serve: model_path: /path/to/model device: auto EOF # Environment variables export ILAB_MODEL_PATH=/path/to/model export ILAB_TAXONOMY_PATH=/path/to/taxonomy export ILAB_DEVICE=cuda # Override with CLI flags ilab chat --model ./custom-model --config ~/my-config.yaml - Step 9
Supported Models
Recommended models for InstructLab:
Recommended Models: 1. IBM Granite (Primary) - granite-7b-lab (Most tested) - granite-3.0-3b (Faster) 2. LLaMA Family - LLaMA-3.1-8B (Meta) - LLaMA-2-13b 3. Merlynite - merlinite-7b Size Guidelines: - 3B: Fast, ~6GB VRAM - 7B: Recommended, ~12GB VRAM - 13B+: Better quality, 16-24GB VRAM Hardware Requirements (7B models): - Minimum: 16GB RAM + 8GB GPU VRAM - Recommended: 32GB RAM + 16GB GPU VRAM Cloud Alternatives: - Google Colab - AWS SageMaker - Lambda Labs - RunPod - Step 10
Troubleshooting Guide
Common issues and solutions:
Common Issues: 1. CUDA/GPU Issues Check: python -c "import torch; print(torch.cuda.is_available())" Fix: Match PyTorch CUDA version 2. Out of Memory (OOM) Fix: - Enable 4bit/8bit quantization - Reduce batch_size - Enable gradient checkpointing - Reduce max_seq_length 3. Slow Generation Fix: - Use vLLM for faster inference - Reduce taxonomy files - Enable caching 4. Taxonomy Errors Check: ilab diff --taxonomy my_file.yaml Fix: Validate YAML syntax, check required fields 5. Dependency Conflicts Fix: - Use virtual environment - Check Python version (>=3.10) - Update pip 6. Quantization Issues Fix: - Install bitsandbytes - Check GPU compatibility - Step 11
Resources & Community
Important resources for learning and support:
Repositories: - Original (Archived): https://github.com/instructlab/instructlab - SDG Hub: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub - Training Hub: https://github.com/Red-Hat-AI-Innovation-Team/training_hub - Taxonomy: https://github.com/instructlab/taxonomy Documentation: - Website: https://instructlab.ai - Docs: https://docs.instructlab.ai Research: - LAB Paper: https://arxiv.org/abs/2403.01081 - Governance: https://docs.instructlab.ai/community/GOVERNANCE/ Community: - Discussions: https://github.com/instructlab/instructlab/discussions - Security: https://github.com/instructlab/.github/blob/main/SECURITY.md
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.