GPU Metrics Dashboard
Real-time GPU monitoring and analytics for optimal performance tracking.
Overview
The GPU Metrics Dashboard provides comprehensive monitoring of your GPU instances. Track utilization, memory usage, temperature, and inference performance in real-time to optimize your AI workloads.
Key Features
- Real-time Monitoring - Live updates every 5 seconds
- GPU Utilization - Track compute usage percentage
- Memory Tracking - Monitor VRAM allocation and usage
- Temperature Monitoring - Watch GPU thermals
- Inference Metrics - Tokens/sec, queue depth, latency
- Historical Data - View trends over time
Dashboard Sections
GPU Overview Cards
Quick summary at the top showing:
- GPU Utilization - Current compute usage (%)
- Memory Usage - VRAM used / total (e.g., 42GB / 96GB)
- Temperature - Current GPU temperature
- Power Draw - Current power consumption (W)
Utilization Chart
Time-series graph showing GPU usage over time:
- Compute utilization percentage
- Memory utilization percentage
- Hover for exact values at any point
- Zoom and pan for detailed analysis
Memory Breakdown
Detailed VRAM allocation:
| Component | Description |
|---|---|
| Model Weights | Memory used by model parameters |
| KV Cache | Memory for attention key-value cache |
| Activation Memory | Memory for intermediate computations |
| System Reserved | CUDA and driver overhead |
Inference Metrics
Performance statistics for your model:
| Metric | Description | Typical Range |
|---|---|---|
| Tokens/sec | Output generation speed | 30-150 t/s |
| TTFT | Time to first token | 20-200 ms |
| Queue Depth | Pending requests | 0-32 |
| Active Requests | Currently processing | 1-64 |
| Batch Size | Current batch size | 1-128 |
Understanding Metrics
GPU Utilization
Percentage of GPU compute being used:
- 0-20% - Underutilized, consider smaller instance
- 20-60% - Light load, room for more requests
- 60-85% - Healthy utilization
- 85-100% - Heavy load, may need more capacity
Memory Usage
VRAM consumption guidelines:
- Model weights - Fixed based on model size
- KV cache - Grows with context length and batch size
- Headroom - Keep 5-10% free for safety
Temperature
GPU thermal status:
- <70°C - Cool, excellent
- 70-80°C - Normal operating range
- 80-85°C - Warm but safe
- >85°C - Hot, may throttle
Time Range Selection
View metrics over different periods:
- Live - Real-time streaming (5s intervals)
- 1 Hour - Recent performance
- 24 Hours - Daily patterns
- 7 Days - Weekly trends
- Custom - Select specific date range
Performance Optimization
Improve Throughput
- Increase
max-num-seqsfor more concurrent requests - Enable
--enable-chunked-prefillfor better batching - Use quantized models (AWQ/GPTQ) for higher throughput
- Reduce
max-model-lenif long context not needed
Reduce Latency
- Use smaller models for faster responses
- Enable
--enforce-eagerfor more predictable latency - Decrease batch size for lower individual request latency
- Pre-warm the model with a few requests after startup
Memory Optimization
- Lower
gpu-memory-utilizationif seeing OOM errors - Reduce
max-model-lento free KV cache memory - Use quantized models to reduce memory footprint
- Decrease
max-num-seqsfor less KV cache usage
Alerting
Set up alerts for important thresholds:
| Alert Type | Threshold | Action |
|---|---|---|
| High Temperature | >85°C | Check cooling, reduce load |
| Memory Full | >95% | Reduce batch size or context |
| Low Utilization | <10% | Consider smaller instance |
| High Queue Depth | >50 | Add capacity or optimize |
CLI Monitoring
SSH into your instance for detailed monitoring:
nvidia-smi
# One-time snapshot
nvidia-smi
# Continuous monitoring (every 1 second)
nvidia-smi -l 1
# GPU utilization only
nvidia-smi --query-gpu=utilization.gpu --format=csvnvtop
# Interactive GPU monitor (like htop for GPUs)
nvtopvLLM Metrics
# Check vLLM logs for performance
tail -f ~/hf-workspace/vllm.log | grep "tokens/s"Prometheus Integration
Export metrics to Prometheus for advanced monitoring:
# vLLM exposes metrics at /metrics
curl http://YOUR-IP:PORT/metrics
# Example metrics:
vllm:num_requests_running
vllm:num_requests_waiting
vllm:gpu_cache_usage_perc
vllm:avg_generation_throughput_toks_per_sNeed Help?
Contact us at support@packet.ai
