TechSetupGuides
Intermediateollamanvmeubuntuperformance-tuninglocal-llmrtx-3090

Ollama on a dedicated NVMe with performance tuning

Repurpose a second NVMe as Ollama model storage, install Ollama, and tune it for a 24 GB RTX 3090 — KV cache quantization, FlashAttention, model context windows, and a curated model stack with routing guidance.

  1. Step 1

    Prerequisites

    This guide assumes you've already finished NVIDIA GPU passthrough to an Ubuntu VM on Proxmox. That guide covers the underlying setup this one builds on. If you're picking up mid-stack, scan its 'Overview' step to confirm your environment matches.

  2. Step 2

    Set up the model NVMe

    Dedicate a fast NVMe to Ollama's model store. The original build used a second 1.8 TB NVMe at /dev/nvme1n1 — substitute your own device path throughout.

    If the drive has any existing partitions or filesystems, deal with them first using your preferred tools (lvremove / vgremove / pvremove for LVM, mdadm --zero-superblock for old mdraid, etc.) before wiping.

    # Wipe the drive
    wipefs -a /dev/nvme1n1
    sgdisk --zap-all /dev/nvme1n1
    # Partition and format
    parted /dev/nvme1n1 mklabel gpt
    parted /dev/nvme1n1 mkpart primary ext4 0% 100%
    mkfs.ext4 /dev/nvme1n1p1
    # Mount permanently
    mkdir -p /mnt/models
    blkid /dev/nvme1n1p1  # note UUID
    echo 'UUID=<uuid>  /mnt/models  ext4  defaults  0  2' >> /etc/fstab
    mount -a
    systemctl daemon-reload
    # Register with Proxmox
    pvesm add dir model-storage --path /mnt/models --content images,rootdir
  3. Step 3

    Allocate Disk to Ollama VM

    Add a 500 GB virtual disk backed by the model-storage NVMe. Directory-backed storage doesn't support snapshots — exclude this disk from snapshots and backups explicitly so neither operation fails on the VM.

    # Add 500GB virtual disk on model-storage to the VM
    qm set 100 --scsi1 model-storage:500
    # Exclude from snapshots (dir storage doesn't support snapshots)
    qm set 100 --scsi1 model-storage:100/vm-100-disk-0.raw,size=500G,backup=0,snapshot=0
  4. Step 4

    Format and Mount Inside VM

    Inside the VM, format the new virtual disk ext4 and mount it permanently at /mnt/models. This is where all Ollama model weights will live.

    # SSH into VM
    ssh ollama@192.168.20.110
    sudo mkfs.ext4 /dev/sdb
    sudo mkdir -p /mnt/models
    sudo blkid /dev/sdb  # note UUID
    echo 'UUID=<uuid>  /mnt/models  ext4  defaults  0  2' | sudo tee -a /etc/fstab
    sudo mount -a
    sudo systemctl daemon-reload
    df -h /mnt/models  # should show ~467GB available
  5. Step 5

    Install Ollama

    Run the official Ollama install script. It detects the NVIDIA GPU automatically and installs a systemd service that starts on boot.

    curl -fsSL https://ollama.com/install.sh | sh
    # Auto-detects NVIDIA GPU
    # Creates ollama systemd service
    # Verify
    systemctl status ollama
    ollama list
  6. Step 6

    Point Ollama at the Model NVMe

    By default, Ollama stores models in ~/.ollama. Override this to use the dedicated 500GB NVMe:

    sudo mkdir -p /mnt/models/ollama
    sudo chown ollama:ollama /mnt/models/ollama
    # Create systemd override
    sudo systemctl edit ollama
    # Add:
    [Service]
    Environment="OLLAMA_MODELS=/mnt/models/ollama"
    sudo systemctl daemon-reload
    sudo systemctl restart ollama
    # Verify override
    sudo cat /etc/systemd/system/ollama.service.d/override.conf
  7. Step 7

    Performance Tuning

    Three systemd drop-in files configure Ollama performance. All are created at /etc/systemd/system/ollama.service.d/ (system service, not user service): override.conf — OLLAMA_MODELS path (created during initial setup) Effect of each setting: kv-cache.conf: OLLAMA_KV_CACHE_TYPE=q8_0 stores the KV cache in 8-bit instead of fp16, roughly halving KV VRAM usage with negligible quality loss. Critical for fitting 32K+ context on 24GB. performance.conf: OLLAMA_FLASH_ATTENTION=1 enables FlashAttention for 10–30% throughput improvement. OLLAMA_KEEP_ALIVE=20m keeps models loaded for 20 minutes after the last request. The shorter timeout helps when multiple runtimes share the GPU — VRAM is freed quickly so a different agent can load its model without waiting. Note: deepseek-r1:32b is capped at 12K context. At 32K the model weights (19 GB) plus KV cache exceed 24 GB VRAM and OOM. Set num_ctx 12288 in its Modelfile.

    # /etc/systemd/system/ollama.service.d/kv-cache.conf
    [Service]
    Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
    # /etc/systemd/system/ollama.service.d/performance.conf
    [Service]
    Environment="OLLAMA_FLASH_ATTENTION=1"
    Environment="OLLAMA_KEEP_ALIVE=20m"
    systemctl daemon-reload && systemctl restart ollama
  8. Step 8

    Model Stack

    All models use Q4_K_M quantization unless noted and are configured with the context windows below via Modelfile. The RTX 3090 has 24GB VRAM — leave at least 2–3GB headroom and do not load multiple large models simultaneously. Model Use Context VRAM Notes Sanity check / GPU test 32K 1.3 GB Q8_0 Default / coding / tool calls 32K 18 GB MoE, ~147 tok/s General reasoning 64K 17 GB Q4_K_M, thinking adaptive Deep reasoning 12K 19 GB Q4_K_M, avoid tool-use, ~38 tok/s Uncensored / fast 32K 5 GB Q4_K_M, ~149 tok/s Uncensored reasoning 64K 17 GB Q4_K_M

    llama3.2:1b
    qwen3-coder:30b
    qwen3.5:27b
    deepseek-r1:32b
    dolphin3
    huihui_ai/qwen3.5-abliterated:27b
  9. Step 9

    Model Routing Guidance

    When routing tasks from Claude Code to local models via the Ollama MCP server, use this as a guide:

    • Default local route: qwen3-coder:30b — best tool calling reliability, RL-trained on SWE-bench
    • General reasoning: qwen3.5:27b — use for non-coding tasks; supports adaptive thinking
    • Complex reasoning: deepseek-r1:32b — capped at 12K context (OOMs at 32K); avoid for tool-use tasks
    • Uncensored general: dolphin3
    • Uncensored reasoning: huihui_ai/qwen3.5-abliterated:27b
    • Quick tests: llama3.2:1b Note: Route to uncensored models only when content filters on standard models are blocking a legitimate task, for creative writing with mature themes, or security research/red-teaming. Never use as default.
  10. Step 10

    Pull and Configure

    Each model requires a Modelfile to set 32K context. The pattern for each:

    # Check disk space first
    df -h /mnt/models
    # Pull base model
    ollama pull <model-name>
    # Create Modelfile with 32K context
    cat > /tmp/modelfile << 'EOF'
    FROM <model-name>
    PARAMETER num_ctx 32768
    EOF
    # Recreate under same name with context baked in
    ollama create <model-name>-32k -f /tmp/modelfile
    ollama rm <model-name>
    ollama create <model-name> -f /tmp/modelfile
    ollama rm <model-name>-32k
    # Verify GPU inference
    ollama run llama3.2:1b 'say hello in one sentence'
    watch -n1 nvidia-smi  # monitor GPU usage in separate session
    # Confirm final model list
    ollama list
  11. Step 11

    Verify Model Storage Location

    Verify Model Storage Location.

    ls /mnt/models/ollama/
    # Should show: blobs  manifests
    ollama list
    # Models listed here are stored on the NVMe
  12. Step 12

    Next: continue building the stack

    With this layer in place, the next guide in the series is Self-host OpenClaw with HTTPS, Brave search, and GitHub access. Run the OpenClaw agent gateway on your Ollama VM behind an nginx TLS reverse proxy. Wire up the Brave search skill, give it authenticated GitHub access for code tasks, and configure a CLAUDE.md so remote Claude Code sessions have persistent context.

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.