Back to Docs

GPU Metrics Dashboard

Real-time GPU monitoring and performance analytics

GPU Metrics Dashboard

Real-time GPU monitoring and analytics for optimal performance tracking.

Overview

The GPU Metrics Dashboard provides comprehensive monitoring of your GPU instances. Track utilization, memory usage, temperature, and inference performance in real-time to optimize your AI workloads.

Key Features

  • Real-time Monitoring - Live updates every 5 seconds
  • GPU Utilization - Track compute usage percentage
  • Memory Tracking - Monitor VRAM allocation and usage
  • Temperature Monitoring - Watch GPU thermals
  • Inference Metrics - Tokens/sec, queue depth, latency
  • Historical Data - View trends over time

Dashboard Sections

GPU Overview Cards

Quick summary at the top showing:

  • GPU Utilization - Current compute usage (%)
  • Memory Usage - VRAM used / total (e.g., 42GB / 96GB)
  • Temperature - Current GPU temperature
  • Power Draw - Current power consumption (W)

Utilization Chart

Time-series graph showing GPU usage over time:

  • Compute utilization percentage
  • Memory utilization percentage
  • Hover for exact values at any point
  • Zoom and pan for detailed analysis

Memory Breakdown

Detailed VRAM allocation:

ComponentDescription
Model WeightsMemory used by model parameters
KV CacheMemory for attention key-value cache
Activation MemoryMemory for intermediate computations
System ReservedCUDA and driver overhead

Inference Metrics

Performance statistics for your model:

MetricDescriptionTypical Range
Tokens/secOutput generation speed30-150 t/s
TTFTTime to first token20-200 ms
Queue DepthPending requests0-32
Active RequestsCurrently processing1-64
Batch SizeCurrent batch size1-128

Understanding Metrics

GPU Utilization

Percentage of GPU compute being used:

  • 0-20% - Underutilized, consider smaller instance
  • 20-60% - Light load, room for more requests
  • 60-85% - Healthy utilization
  • 85-100% - Heavy load, may need more capacity

Memory Usage

VRAM consumption guidelines:

  • Model weights - Fixed based on model size
  • KV cache - Grows with context length and batch size
  • Headroom - Keep 5-10% free for safety

Temperature

GPU thermal status:

  • <70°C - Cool, excellent
  • 70-80°C - Normal operating range
  • 80-85°C - Warm but safe
  • >85°C - Hot, may throttle

Time Range Selection

View metrics over different periods:

  • Live - Real-time streaming (5s intervals)
  • 1 Hour - Recent performance
  • 24 Hours - Daily patterns
  • 7 Days - Weekly trends
  • Custom - Select specific date range

Performance Optimization

Improve Throughput

  • Increase max-num-seqs for more concurrent requests
  • Enable --enable-chunked-prefill for better batching
  • Use quantized models (AWQ/GPTQ) for higher throughput
  • Reduce max-model-len if long context not needed

Reduce Latency

  • Use smaller models for faster responses
  • Enable --enforce-eager for more predictable latency
  • Decrease batch size for lower individual request latency
  • Pre-warm the model with a few requests after startup

Memory Optimization

  • Lower gpu-memory-utilization if seeing OOM errors
  • Reduce max-model-len to free KV cache memory
  • Use quantized models to reduce memory footprint
  • Decrease max-num-seqs for less KV cache usage

Alerting

Set up alerts for important thresholds:

Alert TypeThresholdAction
High Temperature>85°CCheck cooling, reduce load
Memory Full>95%Reduce batch size or context
Low Utilization<10%Consider smaller instance
High Queue Depth>50Add capacity or optimize

CLI Monitoring

SSH into your instance for detailed monitoring:

nvidia-smi

# One-time snapshot
nvidia-smi

# Continuous monitoring (every 1 second)
nvidia-smi -l 1

# GPU utilization only
nvidia-smi --query-gpu=utilization.gpu --format=csv

nvtop

# Interactive GPU monitor (like htop for GPUs)
nvtop

vLLM Metrics

# Check vLLM logs for performance
tail -f ~/hf-workspace/vllm.log | grep "tokens/s"

Prometheus Integration

Export metrics to Prometheus for advanced monitoring:

# vLLM exposes metrics at /metrics
curl http://YOUR-IP:PORT/metrics

# Example metrics:
vllm:num_requests_running
vllm:num_requests_waiting
vllm:gpu_cache_usage_perc
vllm:avg_generation_throughput_toks_per_s

Need Help?

Contact us at support@packet.ai