TechSetupGuides
Advancedllamallmfine-tuninginferenceragpeftloralibpytorchtransformersnlpmachine-learningmetaai

Llama Cookbook - LLaMA Fine-tuning Framework

Official comprehensive guide for LLaMA model inference, fine-tuning, RAG, and end-to-end use cases. Includes examples for domain adaptation and LLM applications.

  1. Step 1

    Overview

    Llama Cookbook (formerly llama-recipes) is Meta's official companion project to the Llama models. It provides comprehensive examples and recipes for getting started with inference, fine-tuning for domain adaptation, RAG (Retrieval Augmented Generation), and end-to-end use cases with the Llama model family.

  2. Step 2

    Technology Stack

    Llama Cookbook is built with the following technologies:

    Name: llama-cookbook (formerly llama-recipes)
    License: MIT
    Stars: ~18,337
    Owner: meta-llama
    Repo: https://github.com/meta-llama/llama-cookbook
    Website: https://www.llama.com
    
    Languages:
    - Jupyter Notebook (Primary: 13M+ lines)
    - Python (~800K lines)
    - Java (~56K lines)
    - JavaScript (~35K lines)
    - Kotlin (~11K lines)
    
    Core Dependencies:
    - PyTorch >=2.2 - Deep learning framework
    - Accelerate - Training acceleration
    - Transformers >=4.45.1 - Hugging Face models
    - Peft - Parameter-efficient fine-tuning
    - LoraLib - Low-Rank Adaptation
    - Datasets - Hugging Face datasets
    - bitsandbytes - 8/4-bit quantization
    
    Optional Dependencies:
    - vllm: High-performance inference
    - auditnlg: Sensitive topics safety checker
    - langchain: LangChain integration
    - tests: pytest-mock for testing
    
    Key Features:
    - Inference for Llama models
    - Fine-tuning with Full Finetuning, LoRA, QLoRA
    - RAG (Retrieval Augmented Generation)
    - End-to-end use cases and applications
    - Model distillation
    - 3P integrations with providers
    - Multi-GPU and distributed training
    - Jupyter notebooks for interactive exploration
  3. Step 3

    Installation

    Install Llama Cookbook using pip or from source.

    # Install with pip (recommended)
    pip install llama-cookbook
    
    # Install with optional dependencies
    pip install llama-cookbook[tests]           # For unit tests
    pip install llama-cookbook[vllm]            # For vLLM inference
    pip install llama-cookbook[auditnlg]        # For safety checker
    pip install llama-cookbook[langchain]       # For LangChain examples
    
    # Install multiple optional dependencies
    pip install llama-cookbook[tests,vllm,auditnlg]
    
    # Install from source (for development)
    git clone https://github.com/meta-llama/llama-cookbook.git
    cd llama-cookbook
    pip install -U pip setuptools
    pip install -e .
    
    # For development with all dependencies
    cd llama-cookbook
    pip install -e .[tests,auditnlg,vllm,langchain]
  4. Step 4

    PyTorch Installation (CUDA-aware)

    Install PyTorch with the correct CUDA version for your GPU.

    # Check your CUDA version
    nvidia-smi
    
    # Install PyTorch with CUDA 11.8 (recommended)
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    
    # Install PyTorch with CUDA 12.1 (for H100 and newer GPUs)
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
    # Verify installation
    python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')
  5. Step 5

    Getting Llama Models

    Access Llama models from Hugging Face Hub.

    # Log into Hugging Face
    pip install huggingface_hub
    huggingface-cli login
    
    # Visit https://huggingface.co/meta-llama to access models
    # Models with 'hf' suffix are already converted to HF format
    
    # Example model names:
    # - meta-llama/Llama-3.3-70B-Instruct-hf
    # - meta-llama/Llama-3.2-1B-hf
    # - meta-llama/Llama-3.2-1B-Instruct-hf
    # - meta-llama/Llama-3.1-8B-Instruct-hf
    
    # Note: Models with 'hf' suffix require NO conversion step
  6. Step 6

    Basic Inference

    Run inference with a Llama model.

    # Using transformers library
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained(
        "meta-llama/Llama-3.2-1B-Instruct-hf",
        token="YOUR_HF_TOKEN"
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-1B-Instruct-hf",
        torch_dtype="auto",
        device_map="auto",
        token="YOUR_HF_TOKEN"
    )
    
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=256,
        temperature=0.7
    )
    
    result = pipe("What is machine learning?")
    print(result[0]['generated_text'])
  7. Step 7

    Fine-tuning with LoRA

    Perform parameter-efficient fine-tuning using LoRA.

    # Finetuning with LoRA
    cd src/llama_cookbook/finetuning/
    
    # Using the provided scripts
    python finetune_lora.py \
        --model_name meta-llama/Llama-3.2-1B-Instruct-hf \
        --dataset_name your-dataset \
        --max_seq_length 2048 \
        --per_device_train_batch_size 2 \
        --gradient_accumulation_steps 4 \
        --num_train_epochs 3 \
        --learning_rate 2e-4 \
        --output_dir ./results
  8. Step 8

    Fine-tuning with QLoRA

    Quantized LoRA for memory-efficient fine-tuning.

    # Finetuning with QLoRA (4-bit quantization)
    # QLoRA allows 7B+ models to fit on single consumer GPU (24GB)
    
    python finetune_qlora.py \
        --model_name meta-llama/Llama-3.1-8B-Instruct-hf \
        --use_peft True \
        --load_in_4bit True \
        --bnb_4bit_quant_type nf4 \
        --lora_r 64 \
        --lora_alpha 16 \
        --dataset_name your-dataset \
        --max_seq_length 2048 \
        --output_dir ./qlora_results
    ⚠ Heads up: QLoRA requires GPU with at least 16GB VRAM for 7B models, 24GB+ for larger models.
  9. Step 9

    RAG (Retrieval Augmented Generation)

    Build RAG systems with Llama models.

    # RAG with Llama Cookbook
    # Check the src/llama_cookbook/rag/ directory for examples
    
    # Key components:
    # 1. Document loading and chunking
    # 2. Embedding models
    # 3. Vector stores (FAISS, etc.)
    # 4. Llama model for generation
    
    # See RAG examples in:
    # - getting-started/RAG/
    # - src/llama_cookbook/rag/
  10. Step 10

    Using Jupyter Notebooks

    Interactive exploration with Jupyter notebooks.

    # Start Jupyter
    cd getting-started/
    jupyter notebook
    
    # Key notebooks:
    # - build_with_llama_api.ipynb   (Llama API integration)
    # - build_with_llama_4.ipynb     (5M context, Llama 4)
    
    # Prerequisites:
    pip install jupyter jupyterlab notebook
  11. Step 11

    Configuration Options

    Key configuration parameters.

    Fine-tuning Parameters:
    
    Model:
    --model_name      : Path to model
    --output_dir      : Output directory
    --use_peft        : Enable parameter-efficient fine-tuning
    --peft_method     : peft, lora, or qlora
    
    LoRA:
    --lora_r          : LoRA rank (default: 64)
    --lora_alpha      : LoRA alpha (default: 16)
    --lora_dropout    : LoRA dropout (default: 0.1)
    
    Data:
    --dataset_name    : HF dataset or local path
    --max_seq_length  : Maximum sequence length
    
    Training:
    --per_device_train_batch_size : Batch size per GPU
    --gradient_accumulation_steps : Accumulation steps
    --num_train_epochs    : Number of epochs
    --learning_rate       : Learning rate (e.g., 2e-4)
    --lr_scheduler_type   : cosine, linear, constant
    
    Optimization:
    --fp16              : Use 16-bit precision
    --bf16              : Use bfloat16 precision
    --gradient_checkpointing : Enable checkpointing
  12. Step 12

    Multi-GPU Training

    Distributed training across multiple GPUs.

    # Using accelerate (recommended)
    accelerate config
    
    # Run with accelerate
    accelerate launch finetune_script.py \
        --model_name meta-llama/Llama-3.1-8B-Instruct-hf \
        --use_peft True \
        --peft_method lora
    
    # Check GPU availability
    nvidia-smi
    python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"
  13. Step 13

    vLLM Inference

    High-throughput inference using vLLM.

    # Install vLLM
    pip install llama-cookbook[vllm]
    
    # Or install separately
    pip install vllm
    
    # Usage with vllm package:
    from vllm import LLM, SamplingParams
    
    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct-hf",
        tensor_parallel_size=4,
        max_model_len=8192
    )
    
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=512
    )
    
    outputs = llm.generate(
        prompts=["What is machine learning?"],
        sampling_params=sampling_params
    )
    
    for output in outputs:
        print(output.outputs[0].text)
  14. Step 14

    Repository Structure

    Understanding the organization:

    llama-cookbook/
    ├── 3p-integrations/       # Provider-specific recipes
    ├── end-to-end-use-cases/  # Complete applications
    │   ├── whatsapp_llama_4_bot/
    │   ├── research_paper_analyzer/
    │   └── ...
    ├── getting-started/       # Core tutorials
    │   ├── Inference examples
    │   ├── Finetuning examples
    │   ├── RAG examples
    │   ├── build_with_llama_api.ipynb
    │   └── build_with_llama_4.ipynb
    ├── src/                   # llama-recipes library
    │   ├── llama_cookbook/
    │   └── docs/              # Fine-tuning FAQ
    ├── pyproject.toml
    └── README.md
  15. Step 15

    End-to-End Use Cases

    Complete application examples.

    Available Use Cases:
    
    1. WhatsApp Llama 4 Bot
       - Path: end-to-end-use-cases/whatsapp_llama_4_bot/
       - Integrate Llama API with WhatsApp
    
    2. Research Paper Analyzer
       - Path: end-to-end-use-cases/research_paper_analyzer/
       - Analyze academic papers with Llama 4
    
    3. Book Character Mind Map
       - Path: end-to-end-use-cases/book-character-mindmap/
       - Create character relationships from books
    
    4. 5M Token Long Context
       - Path: getting-started/build_with_llama_4.ipynb
       - Handle extremely long documents (Llama 4)
  16. Step 16

    Key Features

    Llama Cookbook capabilities:

    1. Inference: Text generation with Llama models
    2. Fine-tuning: Full, LoRA, QLoRA methods
    3. RAG: Retrieval augmented generation
    4. Distillation: Knowledge transfer
    5. End-to-End: Complete applications
    6. 3P Integrations: Provider recipes
    7. Multi-GPU: Distributed training
    8. Jupyter: Interactive notebooks
    9. Quantization: 8-bit, 4-bit
    10. vLLM: High-throughput inference
    11. Safety: AuditNLG checker
    12. LangChain: Integration
    13. Long Context: 5M tokens (Llama 4)
    14. Llama API: Official API
    15. Hugging Face Native
  17. Step 17

    Use Cases

    Ideal applications:

    1. Domain Adaptation: Fine-tune for specific domains
    2. Custom Models: Task-specific variants
    3. RAG Systems: Knowledge-grounded apps
    4. Chatbots: Conversational AI
    5. Content Generation: Text workflows
    6. Code Generation: Programming
    7. Research: LLM experimentation
    8. Production: Enterprise deployments
    9. API Integration: Cloud solutions
    10. Cost Reduction: Smaller models
  18. Step 18

    FAQ

    Frequently asked questions.

    Q: What happened to llama-recipes?
    A: Renamed to llama-cookbook.
    
    Q: Links broken/folders missing?
    A: Repo was refactored. Use archive-main branch.
    
    Q: Where to find model details?
    A: https://www.llama.com
    
    Q: How to access models?
    A: Visit https://huggingface.co/meta-llama
    
    Q: 'hf' suffix vs original models?
    A: 'hf' = already converted to HF format.
    
    Q: Minimum GPU requirement?
    A: 7B with QLoRA: 16GB VRAM. 70B: Multi-GPU.
  19. Step 19

    Resources

    Additional resources.

    Main Resources:
    - Repository: https://github.com/meta-llama/llama-cookbook
    - Llama Models: https://www.llama.com
    - Llama API: https://llama.developer.meta.com
    - Hugging Face: https://huggingface.co/meta-llama
    - Models: https://github.com/meta-llama/llama-models
    - Synthetic Data Kit: https://github.com/meta-llama/synthetic-data-kit
    - Llama Prompt Ops: https://github.com/meta-llama/llama-prompt-ops
    - Contributing: See CONTRIBUTING.md

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.