Intermediateollamanvmeubuntuperformance-tuninglocal-llmrtx-3090

Ollama on a dedicated NVMe with performance tuning

Repurpose a second NVMe as Ollama model storage, install Ollama, and tune it for a 24 GB RTX 3090 — KV cache quantization, FlashAttention, model context windows, and a curated model stack with routing guidance.

Step 1
Prerequisites
This guide assumes you've already finished NVIDIA GPU passthrough to an Ubuntu VM on Proxmox. That guide covers the underlying setup this one builds on. If you're picking up mid-stack, scan its 'Overview' step to confirm your environment matches.

Step 2

Set up the model NVMe

Dedicate a fast NVMe to Ollama's model store. The original build used a second 1.8 TB NVMe at /dev/nvme1n1 — substitute your own device path throughout.

If the drive has any existing partitions or filesystems, deal with them first using your preferred tools (lvremove / vgremove / pvremove for LVM, mdadm --zero-superblock for old mdraid, etc.) before wiping.

# Wipe the drive
wipefs -a /dev/nvme1n1
sgdisk --zap-all /dev/nvme1n1
# Partition and format
parted /dev/nvme1n1 mklabel gpt
parted /dev/nvme1n1 mkpart primary ext4 0% 100%
mkfs.ext4 /dev/nvme1n1p1
# Mount permanently
mkdir -p /mnt/models
blkid /dev/nvme1n1p1  # note UUID
echo 'UUID=<uuid>  /mnt/models  ext4  defaults  0  2' >> /etc/fstab
mount -a
systemctl daemon-reload
# Register with Proxmox
pvesm add dir model-storage --path /mnt/models --content images,rootdir

Step 3
Allocate Disk to Ollama VM
Add a 500 GB virtual disk backed by the model-storage NVMe. Directory-backed storage doesn't support snapshots — exclude this disk from snapshots and backups explicitly so neither operation fails on the VM.
```
# Add 500GB virtual disk on model-storage to the VM
qm set 100 --scsi1 model-storage:500
# Exclude from snapshots (dir storage doesn't support snapshots)
qm set 100 --scsi1 model-storage:100/vm-100-disk-0.raw,size=500G,backup=0,snapshot=0
```

Step 4

Format and Mount Inside VM

Inside the VM, format the new virtual disk ext4 and mount it permanently at /mnt/models. This is where all Ollama model weights will live.

# SSH into VM
ssh ollama@<ollama-vm-ip>
sudo mkfs.ext4 /dev/sdb
sudo mkdir -p /mnt/models
sudo blkid /dev/sdb  # note UUID
echo 'UUID=<uuid>  /mnt/models  ext4  defaults  0  2' | sudo tee -a /etc/fstab
sudo mount -a
sudo systemctl daemon-reload
df -h /mnt/models  # should show ~467GB available

Step 5
Install Ollama
Run the official Ollama install script. It detects the NVIDIA GPU automatically and installs a systemd service that starts on boot.
```
curl -fsSL https://ollama.com/install.sh | sh
# Auto-detects NVIDIA GPU
# Creates ollama systemd service
# Verify
systemctl status ollama
ollama list
```

Step 6

Point Ollama at the Model NVMe

By default, Ollama stores models in ~/.ollama. Override this to use the dedicated 500GB NVMe:

sudo mkdir -p /mnt/models/ollama
sudo chown ollama:ollama /mnt/models/ollama
# Create systemd override
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_MODELS=/mnt/models/ollama"
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify override
sudo cat /etc/systemd/system/ollama.service.d/override.conf

Step 7
Performance Tuning
Three systemd drop-in files configure Ollama performance. All are created at /etc/systemd/system/ollama.service.d/ (system service, not user service): override.conf — OLLAMA_MODELS path (created during initial setup) Effect of each setting: kv-cache.conf: OLLAMA_KV_CACHE_TYPE=q8_0 stores the KV cache in 8-bit instead of fp16, roughly halving KV VRAM usage with negligible quality loss. Critical for fitting 32K+ context on 24GB. performance.conf: OLLAMA_FLASH_ATTENTION=1 enables FlashAttention for 10–30% throughput improvement. OLLAMA_KEEP_ALIVE=20m keeps models loaded for 20 minutes after the last request. The shorter timeout helps when multiple runtimes share the GPU — VRAM is freed quickly so a different agent can load its model without waiting. Note: deepseek-r1:32b is capped at 12K context. At 32K the model weights (19 GB) plus KV cache exceed 24 GB VRAM and OOM. Set num_ctx 12288 in its Modelfile.
```
# /etc/systemd/system/ollama.service.d/kv-cache.conf
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
# /etc/systemd/system/ollama.service.d/performance.conf
[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=20m"
systemctl daemon-reload && systemctl restart ollama
```

Step 8

Model Stack

All models use Q4_K_M quantization unless noted and are configured with the context windows below via Modelfile. The RTX 3090 has 24GB VRAM — leave at least 2–3GB headroom and do not load multiple large models simultaneously.

Model                              Use                            Context  VRAM    Notes
---------------------------------  -----------------------------  -------  ------  ---------------------------------
llama3.2:1b                        Sanity check / GPU test        32K      1.3 GB  Q8_0
qwen3-coder:30b                    Default / coding / tool calls  32K      18 GB   MoE, ~147 tok/s
qwen3.5:27b                        General reasoning              64K      17 GB   Q4_K_M, thinking adaptive
deepseek-r1:32b                    Deep reasoning                 12K      19 GB   Q4_K_M, avoid tool-use, ~38 tok/s
dolphin3                           Uncensored / fast              32K      5 GB    Q4_K_M, ~149 tok/s
huihui_ai/qwen3.5-abliterated:27b  Uncensored reasoning           64K      17 GB   Q4_K_M

Step 9
Model Routing Guidance
When routing tasks from Claude Code to local models via the Ollama MCP server, use this as a guide:
- Default local route: qwen3-coder:30b — best tool calling reliability, RL-trained on SWE-bench
- General reasoning: qwen3.5:27b — use for non-coding tasks; supports adaptive thinking
- Complex reasoning: deepseek-r1:32b — capped at 12K context (OOMs at 32K); avoid for tool-use tasks
- Uncensored general: dolphin3
- Uncensored reasoning: huihui_ai/qwen3.5-abliterated:27b
- Quick tests: llama3.2:1b Note: Route to uncensored models only when content filters on standard models are blocking a legitimate task, for creative writing with mature themes, or security research/red-teaming. Never use as default.

Step 10

Pull and Configure

Each model requires a Modelfile to set 32K context. The pattern for each:

# Check disk space first
df -h /mnt/models
# Pull base model
ollama pull <model-name>
# Create Modelfile with 32K context
cat > /tmp/modelfile << 'EOF'
FROM <model-name>
PARAMETER num_ctx 32768
EOF
# Recreate under same name with context baked in
ollama create <model-name>-32k -f /tmp/modelfile
ollama rm <model-name>
ollama create <model-name> -f /tmp/modelfile
ollama rm <model-name>-32k
# Verify GPU inference
ollama run llama3.2:1b 'say hello in one sentence'
watch -n1 nvidia-smi  # monitor GPU usage in separate session
# Confirm final model list
ollama list

Step 11

Verify Model Storage Location

Verify Model Storage Location.

ls /mnt/models/ollama/
# Should show: blobs  manifests
ollama list
# Models listed here are stored on the NVMe

Step 12
Next: continue building the stack
With this layer in place, the next guide in the series is Self-host OpenClaw with HTTPS, Brave search, and GitHub access. Run the OpenClaw agent gateway on your Ollama VM behind an nginx TLS reverse proxy. Wire up the Brave search skill, give it authenticated GitHub access for code tasks, and configure a CLAUDE.md so remote Claude Code sessions have persistent context.

Ollama on a dedicated NVMe with performance tuning

Prerequisites

Set up the model NVMe

Allocate Disk to Ollama VM

Format and Mount Inside VM

Install Ollama

Point Ollama at the Model NVMe

Performance Tuning

Model Stack

Model Routing Guidance

Pull and Configure

Verify Model Storage Location

Next: continue building the stack

Feature requests

Discussion