TechSetupGuides
Advancedaiinfrastructuregpucudadistributed-systemsmoedeepseekpytorchhopperfp8parallel-computing

DeepSeek Open Infrastructure Index: Production AI Infrastructure Toolkit

A comprehensive collection of production-tested AI infrastructure tools for AGI development, including GPU kernels, distributed systems, parallel file systems, and MoE training components.

  1. Step 1

    What is Open Infrastructure Index?

    DeepSeek's Open Infrastructure Index is a curated collection of production-tested AI infrastructure components that power their online services. Released with full transparency, these tools represent battle-tested solutions for efficient AGI development, emphasizing community-driven innovation. The project is licensed under CC0-1.0 (public domain), making all components freely available for commercial and research use.

    The V3/R1 inference system demonstrates impressive performance with 73.7k/14.8k input/output tokens per second per H800 node, showcasing the practical efficiency of these infrastructure components.

  2. Step 2

    Core components overview

    The Open Infrastructure Index consists of seven major components, each solving specific challenges in AI infrastructure:

    FlashMLA: Optimized decoding kernel for Hopper GPUs with BF16 support and paged KV cache functionality (block size 64).

    DeepEP: The first open-source expert-parallel (EP) communication library for MoE model training and inference, supporting both intra-node (NVLink) and inter-node (RDMA) communication.

    DeepGEMM: FP8 GEMM library supporting dense and MoE operations, achieving up to 1350+ FP8 TFLOPS on Hopper GPUs.

    DualPipe: Bidirectional pipeline parallelism algorithm for computation-communication overlap in model training.

    EPLB: Expert-parallel load balancer for mixture-of-experts models.

    3FS (Fire-Flyer File System): Parallel file system achieving 6.6 TiB/s aggregate read throughput, supporting training data, checkpoints, and vector search.

    Smallpond: Lightweight data processing framework built on top of 3FS and DuckDB for petabyte-scale dataset processing.

  3. Step 3

    Technology stack

    The Open Infrastructure Index leverages cutting-edge hardware and software technologies:

    Hardware requirements:

    • NVIDIA Hopper GPUs (H800, SM90) or newer (SM100)
    • High-bandwidth NVLink for GPU-to-GPU communication
    • RDMA-capable networks (200-400Gbps InfiniBand demonstrated)
    • Thousands of SSDs for distributed storage (3FS)
    • Multi-NUMA domain configurations

    Software stack:

    • Programming Languages: CUDA, C++20, Python 3.8+, Rust 1.75+
    • Deep Learning: PyTorch 2.0+, NCCL 2.30.4+
    • CUDA Toolkit: 12.3+ (12.9+ recommended for SM100)
    • Build Systems: CMake, Python setuptools
    • Libraries: CUTLASS 4.0+, {fmt}, NVSHMEM
    • Database: DuckDB (for Smallpond), FoundationDB 7.1+ (for 3FS)
    • Other: libfuse 3.16.1+, various compression and SSL libraries

    Key innovations:

    • JIT (Just-In-Time) compilation for CUDA kernels at runtime
    • FP8 precision support for improved throughput
    • Disaggregated storage architecture
    • Computation-communication overlap techniques
    Hardware:
    ├── NVIDIA Hopper GPUs (SM90/SM100)
    ├── NVLink interconnect
    ├── RDMA networks (InfiniBand)
    └── Distributed SSD storage
    
    Software:
    ├── CUDA 12.3+ / 12.9+
    ├── PyTorch 2.0+
    ├── C++20 / CUDA / Python 3.8+ / Rust 1.75+
    ├── CMake build system
    ├── NCCL 2.30.4+
    ├── CUTLASS 4.0+
    └── DuckDB / FoundationDB
  4. Step 4

    FlashMLA: Efficient MLA decoding kernel

    FlashMLA provides optimized attention kernels for Hopper GPUs, supporting Multi-Query Attention (MQA) and Multi-Head Attention (MHA) with sparse and dense operations.

    Requirements:

    • GPU: SM90 or SM100 architecture
    • CUDA: 12.8+ (12.9+ for SM100)
    • PyTorch: 2.0+
    • Languages: C++ (49%), CUDA (39%), Python (12%)
    # Clone and install FlashMLA
    git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla
    cd flash-mla
    git submodule update --init --recursive
    pip install -v .
    
    # The installation compiles CUDA kernels for optimized attention
  5. Step 5

    DeepEP: Expert-parallel communication library

    DeepEP is the first open-source expert-parallel communication library designed for efficient MoE model operations. All kernels compile at runtime via JIT, requiring no CUDA compilation during installation.

    Requirements:

    • GPU: Hopper (SM90) or architectures with SM90 PTX ISA support
    • Python: 3.8+
    • PyTorch: 2.10+
    • CUDA: 12.3+
    • NCCL: 2.30.4+
    • NVSHMEM (for legacy method support)
    • Languages: CUDA (49.8%), C++ (25.1%), Python (24.1%)
    # Install NCCL via pip for automatic library discovery
    pip install "nvidia-nccl-cu13>=2.30.4" --no-deps
    
    # Clone and build DeepEP
    git clone https://github.com/deepseek-ai/DeepEP.git
    cd DeepEP
    python setup.py build
    python setup.py install
    
    # Test the installation
    python -c "import deepep; print('DeepEP installed successfully')"
  6. Step 6

    DeepGEMM: FP8 GEMM library

    DeepGEMM delivers high-performance FP8 matrix operations for dense and MoE computations, achieving over 1350 FP8 TFLOPS on Hopper GPUs. Like DeepEP, it uses JIT compilation to avoid requiring CUDA compilation tools post-installation.

    Requirements:

    • GPU: SM90 or SM100 architecture
    • CUDA: 12.3+ (12.9+ recommended)
    • Python: 3.8+
    • PyTorch: 2.1+
    • C++20 compatible compiler
    • CUTLASS: 4.0+
    • {fmt} library
    • Languages: CUDA (48.2%), C++ (35.2%), Python (16.2%)
    # Clone with submodules
    git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git
    cd DeepGEMM
    
    # Development setup
    ./develop.sh
    
    # Production installation
    ./install.sh
    
    # Verify installation
    python -c "import deepgemm; print('DeepGEMM ready')"
  7. Step 7

    DualPipe: Bidirectional pipeline parallelism

    DualPipe implements bidirectional pipeline parallelism for efficient computation-communication overlap during model training. Designed for DeepSeek V3/R1 training workflows.

    Requirements:

    • PyTorch: 2.0+
    • Languages: Python (100%)

    Note: Real-world applications require implementing a custom overlapped_forward_backward method tailored to your specific module architecture.

    # Clone DualPipe
    git clone https://github.com/deepseek-ai/DualPipe.git
    cd DualPipe
    
    # Run example scripts
    python examples/example_dualpipe.py
    python examples/example_dualpipev.py
    
    # For production use, customize the overlapped_forward_backward method
    # in your training pipeline
  8. Step 8

    EPLB: Expert-parallel load balancer

    EPLB computes optimal expert replication and placement plans for GPU load balancing in mixture-of-experts models.

    Requirements:

    • PyTorch (version not explicitly specified)
    • Languages: Python (100%)
    # Clone EPLB
    git clone https://github.com/deepseek-ai/eplb.git
    cd eplb
    
    # Usage example
    python << 'EOF'
    import torch
    import eplb
    
    # Compute expert rebalancing plan
    plan = eplb.rebalance_experts(
        # Your expert load metrics here
    )
    EOF
  9. Step 9

    3FS: Fire-Flyer parallel file system

    3FS is a high-performance parallel file system designed for AI workloads, achieving 6.6 TiB/s aggregate read throughput. It uses a disaggregated architecture combining thousands of SSDs across storage nodes with RDMA-capable networks.

    Requirements:

    • System libraries: cmake, libuv1-dev, liblz4-dev, liblzma-dev, libdouble-conversion-dev
    • Debugging tools: libgflags-dev, libgoogle-glog-dev, libgtest-dev, libgmock-dev
    • SSL and compression utilities
    • FoundationDB: 7.1+
    • libfuse: 3.16.1+
    • Rust: 1.75.0+
    • Compiler: clang-14 (primary), g++10 or g++11 (alternatives)
    • OS: Ubuntu 20.04/22.04, openEuler, OpenCloudOS, TencentOS
    • Languages: C++ (87%), Rust (4.4%), Python (2.1%)
    # Install system dependencies (Ubuntu)
    sudo apt-get update
    sudo apt-get install -y cmake libuv1-dev liblz4-dev liblzma-dev \
      libdouble-conversion-dev libgflags-dev libgoogle-glog-dev \
      libgtest-dev libgmock-dev clang-14
    
    # Clone 3FS
    git clone https://github.com/deepseek-ai/3FS.git
    cd 3FS
    git submodule update --init --recursive
    
    # Build with CMake
    mkdir build && cd build
    cmake .. -DCMAKE_C_COMPILER=clang-14 -DCMAKE_CXX_COMPILER=clang++-14
    make -j$(nproc)
    
    # Docker alternative for pre-configured environments
    # docker pull <3fs-image> # Check repository for official images
    ⚠ Heads up: 3FS requires significant infrastructure setup including FoundationDB cluster deployment, RDMA-capable networking, and distributed SSD storage. Refer to the official documentation for production deployment guidance.
  10. Step 10

    Smallpond: Data processing on 3FS

    Smallpond is a lightweight data processing framework built on top of 3FS and DuckDB, designed for petabyte-scale datasets without long-running services. It leverages DuckDB's query engine for local processing while integrating with 3FS for distributed storage.

    Requirements:

    • Python: 3.8 to 3.12
    • DuckDB: high-performance analytical database engine
    • 3FS: distributed file system backend
    • Languages: Python (100%)
    # Install Smallpond via pip
    pip install smallpond
    
    # For development
    pip install smallpond[dev]
    
    # For building documentation
    pip install smallpond[docs]
    
    # Run unit tests
    pytest
    
    # Basic usage example
    python << 'EOF'
    import smallpond
    
    # Configure connection to 3FS
    # Process large-scale datasets using DuckDB queries
    # Leverage distributed storage for scalability
    EOF
  11. Step 11

    Integration and workflow

    The Open Infrastructure Index components work together to create a complete AI training and inference pipeline:

    Training workflow:

    1. Store training data in 3FS for high-throughput access
    2. Use Smallpond to preprocess and prepare datasets
    3. Leverage DualPipe for efficient pipeline parallelism
    4. Apply DeepEP for expert-parallel communication in MoE models
    5. Use EPLB to balance expert loads across GPUs
    6. Accelerate matrix operations with DeepGEMM FP8 kernels

    Inference workflow:

    1. Load model checkpoints from 3FS
    2. Use FlashMLA for optimized attention operations
    3. Apply DeepGEMM for fast FP8 inference
    4. Leverage DeepEP for distributed MoE inference

    Performance considerations:

    • FP8 precision trades minimal accuracy for significant throughput gains
    • JIT compilation allows kernel optimization at runtime
    • Disaggregated storage separates compute and storage scaling
    • RDMA networks minimize communication overhead
    # Example: Combining components in a training loop
    import torch
    import deepgemm
    import deepep
    import eplb
    from flash_mla import attention
    
    # Configure expert-parallel communication
    deepep.init_ep_group()
    
    # Compute load balancing plan
    expert_plan = eplb.rebalance_experts(
        expert_loads=current_loads,
        num_gpus=world_size
    )
    
    # Training step with FP8 operations
    with torch.autocast(device_type='cuda', dtype=torch.float8_e4m3fn):
        # Use DeepGEMM for matrix operations
        output = deepgemm.matmul(input, weight)
        
        # Use FlashMLA for attention
        attn_output = attention(query, key, value)
        
        # Expert-parallel all-to-all via DeepEP
        expert_output = deepep.all_to_all(expert_input)
  12. Step 12

    Hardware and deployment considerations

    Successfully deploying the Open Infrastructure Index requires careful hardware planning:

    GPU requirements:

    • NVIDIA Hopper (H800) or newer GPUs are mandatory for most components
    • SM90 architecture minimum; SM100 supported with CUDA 12.9+
    • Multiple GPUs per node connected via NVLink for optimal performance

    Network requirements:

    • 200-400Gbps InfiniBand or equivalent RDMA-capable networking
    • Low-latency interconnects for inter-node communication
    • Dedicated storage network for 3FS (optional but recommended)

    Storage requirements:

    • Thousands of SSDs for 3FS deployment (production scale)
    • High IOPS and bandwidth capabilities
    • Multi-NUMA domain configurations for optimal access patterns

    Software environment:

    • Linux distribution: Ubuntu 20.04/22.04, openEuler, OpenCloudOS, or TencentOS
    • CUDA driver matching toolkit version
    • NCCL properly configured for multi-node communication
    • FoundationDB cluster for 3FS metadata management

    Cloud deployment: Most cloud providers offer Hopper GPU instances (H100, H200) that meet the hardware requirements. However, achieving the full performance demonstrated by DeepSeek requires bare-metal deployment with custom networking and storage configurations.

    ⚠ Heads up: The Open Infrastructure Index is designed for large-scale AI infrastructure. Small-scale deployments may not realize the full performance benefits and may be better served by simpler alternatives.
  13. Step 13

    Academic papers and documentation

    DeepSeek has published academic papers detailing the design and performance of their infrastructure:

    SC24 (Supercomputing 2024): Papers on distributed training systems and storage architectures

    ISCA25 (International Symposium on Computer Architecture 2025): Papers on GPU kernel optimization and memory systems

    These publications provide deep technical insights into the design decisions, performance characteristics, and optimization techniques used throughout the infrastructure stack.

    Key publications:
    - SC24: Distributed systems and storage
    - ISCA25: GPU optimization and memory
  14. Step 14

    Resources and community

    Main repository: https://github.com/deepseek-ai/open-infra-index

    Individual component repositories:

    • FlashMLA: https://github.com/deepseek-ai/FlashMLA
    • DeepEP: https://github.com/deepseek-ai/DeepEP
    • DeepGEMM: https://github.com/deepseek-ai/DeepGEMM
    • DualPipe: https://github.com/deepseek-ai/DualPipe
    • EPLB: https://github.com/deepseek-ai/eplb
    • 3FS: https://github.com/deepseek-ai/3FS
    • Smallpond: https://github.com/deepseek-ai/smallpond

    License: CC0-1.0 (public domain dedication)

    Language support: Primary documentation in English; Chinese language documentation available

    Community: Each repository has its own issue tracker and contribution guidelines

    Main: github.com/deepseek-ai/open-infra-index
    Components:
    ├── FlashMLA (GPU kernels)
    ├── DeepEP (expert-parallel)
    ├── DeepGEMM (FP8 GEMM)
    ├── DualPipe (pipeline parallelism)
    ├── EPLB (load balancing)
    ├── 3FS (file system)
    └── Smallpond (data processing)
    
    License: CC0-1.0 (public domain)

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.