Pro 6000 Blackwell Optimized Models
One-click deploy templates for NVIDIA RTX Pro 6000 Blackwell GPUs with 96GB VRAM. Run 70B+ parameter models at full performance.
Overview
The Pro 6000 Blackwell section features AI models specifically optimized for NVIDIA RTX Pro 6000 Blackwell GPUs with 96GB VRAM. These pre-configured templates enable one-click deployment of large language models, taking full advantage of the Blackwell architecture.
Key Features
- 96GB VRAM Capacity - Run 70B+ parameter models
- Blackwell Architecture - Latest NVIDIA GPU technology
- vLLM Optimized - Pre-configured for maximum throughput
- One-Click Deploy - Launch production-ready models instantly
- AWQ/GPTQ Support - Quantized models for efficiency
- FP8 Precision - Leverage Blackwell's native FP8 support
Hardware Specifications
NVIDIA RTX Pro 6000 Blackwell
| Specification | Value |
|---|---|
| GPU Memory | 96 GB GDDR7 |
| Memory Bandwidth | 1.8 TB/s |
| CUDA Cores | 18,176 |
| Tensor Cores | 568 (5th gen) |
| TDP | 350W |
| FP8 Performance | 2,500+ TFLOPS |
| FP16 Performance | 1,250+ TFLOPS |
Model Size Guidelines
| Workload | Fit | Notes |
|---|---|---|
| 7B models (FP16) | Excellent | ~14GB, room for long context |
| 13B models (FP16) | Excellent | ~26GB, fast inference |
| 34B models (FP16) | Good | ~42GB, fits with care |
| 70B models (AWQ/FP8) | Good | ~35-45GB with quantization |
| 70B models (FP16) | Marginal | ~140GB needed, use multi-GPU |
Pre-Configured Models
Llama 3.1 70B Instruct
- Model: meta-llama/Llama-3.1-70B-Instruct
- VRAM Required: ~45GB (FP8)
- Context Length: 128K tokens
- Use Cases: General assistant, coding, analysis
- Performance: ~50 tokens/sec
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95Qwen 2.5 72B Instruct
- Model: Qwen/Qwen2.5-72B-Instruct
- VRAM Required: ~45GB (FP8)
- Context Length: 128K tokens
- Use Cases: Multilingual, reasoning, math
- Performance: ~45 tokens/sec
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--trust-remote-codeDeepSeek R1 Distill 32B
- Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- VRAM Required: ~38GB (FP16)
- Context Length: 64K tokens
- Use Cases: Reasoning, chain-of-thought, math
- Performance: ~80 tokens/sec
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90Gemma 2 27B Instruct
- Model: google/gemma-2-27b-it
- VRAM Required: ~35GB (FP16)
- Context Length: 8K tokens
- Use Cases: Fast inference, general tasks
- Performance: ~90 tokens/sec
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-2-27b-it \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90Llama 3.1 70B AWQ (Quantized)
- Model: hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
- VRAM Required: ~38GB
- Context Length: 128K tokens
- Performance: ~65 tokens/sec (faster than FP16!)
python -m vllm.entrypoints.openai.api_server \
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--host 0.0.0.0 \
--port 8000 \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.95Model Comparison
Performance vs VRAM
| Model | VRAM | Tokens/sec | Context |
|---|---|---|---|
| Llama 3.1 70B FP8 | 45GB | ~50 | 128K |
| Llama 3.1 70B AWQ | 38GB | ~65 | 128K |
| Qwen 2.5 72B FP8 | 45GB | ~45 | 128K |
| DeepSeek R1 32B | 38GB | ~80 | 64K |
| Gemma 2 27B | 35GB | ~90 | 8K |
Recommended by Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| General Assistant | Llama 3.1 70B | Best all-around quality |
| Coding | DeepSeek R1 32B | Specialized training |
| Multilingual | Qwen 2.5 72B | Excellent non-English |
| Fast Inference | Gemma 2 27B | Highest throughput |
| Long Context | Qwen 2.5 72B AWQ | 128K with efficiency |
| Math/Reasoning | DeepSeek R1 32B | Chain-of-thought |
Optimal vLLM Settings
python -m vllm.entrypoints.openai.api_server \
--model YOUR_MODEL_HERE \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enforce-eager \
--enable-chunked-prefill \
--max-num-seqs 32 \
--api-key YOUR_API_KEYMemory Optimization Tips
- Reduce max-model-len if you don't need long context
- Use quantized models (AWQ preferred over GPTQ)
- Lower max-num-seqs for less KV cache usage
- Set gpu-memory-utilization to 0.90-0.95
Troubleshooting
Out of Memory (OOM)
Symptoms: CUDA OOM error during model loading or inference
Solutions:
- Use quantized model (AWQ/GPTQ)
- Reduce
--max-model-len - Lower
--max-num-seqs - Set
--gpu-memory-utilization 0.85 - Use
--enforce-eagerto disable CUDA graphs
Slow Loading
Symptoms: Model takes >5 minutes to load
Solutions:
- Use
--load-format autoorsafetensors - Pre-download model:
huggingface-cli download MODEL_ID - Use local SSD storage for model weights
- Enable persistent storage for caching
Low Throughput
Symptoms: Tokens/sec lower than expected
Solutions:
- Enable
--enable-chunked-prefill - Increase
--max-num-seqsfor batching - Use AWQ quantization (faster than FP16!)
- Check for thermal throttling in GPU metrics
Performance Benchmarks
Tested on Pro 6000 Blackwell (96GB)
| Model | Batch 1 | Batch 8 | Batch 32 |
|---|---|---|---|
| Llama 3.1 70B AWQ | 42 t/s | 180 t/s | 450 t/s |
| Qwen 2.5 72B GPTQ | 38 t/s | 160 t/s | 400 t/s |
| DeepSeek R1 32B | 65 t/s | 280 t/s | 680 t/s |
| Gemma 2 27B | 78 t/s | 340 t/s | 820 t/s |
t/s = tokens per second, output tokens only
Security Recommendations
API Authentication
Always enable API key authentication in production:
# Generate secure key
API_KEY=$(openssl rand -hex 32)
# Start with auth
python -m vllm.entrypoints.openai.api_server \
--model YOUR_MODEL \
--api-key $API_KEYNeed Help?
Contact us at support@packet.ai
