Back to Docs

Pro 6000 Blackwell Models

Optimized models for 96GB NVIDIA RTX Pro 6000 GPUs

Pro 6000 Blackwell Optimized Models

One-click deploy templates for NVIDIA RTX Pro 6000 Blackwell GPUs with 96GB VRAM. Run 70B+ parameter models at full performance.

Overview

The Pro 6000 Blackwell section features AI models specifically optimized for NVIDIA RTX Pro 6000 Blackwell GPUs with 96GB VRAM. These pre-configured templates enable one-click deployment of large language models, taking full advantage of the Blackwell architecture.

Key Features

  • 96GB VRAM Capacity - Run 70B+ parameter models
  • Blackwell Architecture - Latest NVIDIA GPU technology
  • vLLM Optimized - Pre-configured for maximum throughput
  • One-Click Deploy - Launch production-ready models instantly
  • AWQ/GPTQ Support - Quantized models for efficiency
  • FP8 Precision - Leverage Blackwell's native FP8 support

Hardware Specifications

NVIDIA RTX Pro 6000 Blackwell

SpecificationValue
GPU Memory96 GB GDDR7
Memory Bandwidth1.8 TB/s
CUDA Cores18,176
Tensor Cores568 (5th gen)
TDP350W
FP8 Performance2,500+ TFLOPS
FP16 Performance1,250+ TFLOPS

Model Size Guidelines

WorkloadFitNotes
7B models (FP16)Excellent~14GB, room for long context
13B models (FP16)Excellent~26GB, fast inference
34B models (FP16)Good~42GB, fits with care
70B models (AWQ/FP8)Good~35-45GB with quantization
70B models (FP16)Marginal~140GB needed, use multi-GPU

Pre-Configured Models

Llama 3.1 70B Instruct

  • Model: meta-llama/Llama-3.1-70B-Instruct
  • VRAM Required: ~45GB (FP8)
  • Context Length: 128K tokens
  • Use Cases: General assistant, coding, analysis
  • Performance: ~50 tokens/sec
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

Qwen 2.5 72B Instruct

  • Model: Qwen/Qwen2.5-72B-Instruct
  • VRAM Required: ~45GB (FP8)
  • Context Length: 128K tokens
  • Use Cases: Multilingual, reasoning, math
  • Performance: ~45 tokens/sec
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

DeepSeek R1 Distill 32B

  • Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
  • VRAM Required: ~38GB (FP16)
  • Context Length: 64K tokens
  • Use Cases: Reasoning, chain-of-thought, math
  • Performance: ~80 tokens/sec
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90

Gemma 2 27B Instruct

  • Model: google/gemma-2-27b-it
  • VRAM Required: ~35GB (FP16)
  • Context Length: 8K tokens
  • Use Cases: Fast inference, general tasks
  • Performance: ~90 tokens/sec
python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-2-27b-it \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Llama 3.1 70B AWQ (Quantized)

  • Model: hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
  • VRAM Required: ~38GB
  • Context Length: 128K tokens
  • Performance: ~65 tokens/sec (faster than FP16!)
python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --host 0.0.0.0 \
  --port 8000 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

Model Comparison

Performance vs VRAM

ModelVRAMTokens/secContext
Llama 3.1 70B FP845GB~50128K
Llama 3.1 70B AWQ38GB~65128K
Qwen 2.5 72B FP845GB~45128K
DeepSeek R1 32B38GB~8064K
Gemma 2 27B35GB~908K

Recommended by Use Case

Use CaseRecommended ModelWhy
General AssistantLlama 3.1 70BBest all-around quality
CodingDeepSeek R1 32BSpecialized training
MultilingualQwen 2.5 72BExcellent non-English
Fast InferenceGemma 2 27BHighest throughput
Long ContextQwen 2.5 72B AWQ128K with efficiency
Math/ReasoningDeepSeek R1 32BChain-of-thought

Optimal vLLM Settings

python -m vllm.entrypoints.openai.api_server \
  --model YOUR_MODEL_HERE \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-num-seqs 32 \
  --api-key YOUR_API_KEY

Memory Optimization Tips

  1. Reduce max-model-len if you don't need long context
  2. Use quantized models (AWQ preferred over GPTQ)
  3. Lower max-num-seqs for less KV cache usage
  4. Set gpu-memory-utilization to 0.90-0.95

Troubleshooting

Out of Memory (OOM)

Symptoms: CUDA OOM error during model loading or inference

Solutions:

  1. Use quantized model (AWQ/GPTQ)
  2. Reduce --max-model-len
  3. Lower --max-num-seqs
  4. Set --gpu-memory-utilization 0.85
  5. Use --enforce-eager to disable CUDA graphs

Slow Loading

Symptoms: Model takes >5 minutes to load

Solutions:

  1. Use --load-format auto or safetensors
  2. Pre-download model: huggingface-cli download MODEL_ID
  3. Use local SSD storage for model weights
  4. Enable persistent storage for caching

Low Throughput

Symptoms: Tokens/sec lower than expected

Solutions:

  1. Enable --enable-chunked-prefill
  2. Increase --max-num-seqs for batching
  3. Use AWQ quantization (faster than FP16!)
  4. Check for thermal throttling in GPU metrics

Performance Benchmarks

Tested on Pro 6000 Blackwell (96GB)

ModelBatch 1Batch 8Batch 32
Llama 3.1 70B AWQ42 t/s180 t/s450 t/s
Qwen 2.5 72B GPTQ38 t/s160 t/s400 t/s
DeepSeek R1 32B65 t/s280 t/s680 t/s
Gemma 2 27B78 t/s340 t/s820 t/s

t/s = tokens per second, output tokens only

Security Recommendations

API Authentication

Always enable API key authentication in production:

# Generate secure key
API_KEY=$(openssl rand -hex 32)

# Start with auth
python -m vllm.entrypoints.openai.api_server \
  --model YOUR_MODEL \
  --api-key $API_KEY

Need Help?

Contact us at support@packet.ai