GPU Metrics Dashboard

Real-time GPU monitoring and analytics for optimal performance tracking.

Overview

The GPU Metrics Dashboard provides comprehensive monitoring of your GPU instances. Track utilization, memory usage, temperature, and inference performance in real-time to optimize your AI workloads.

Key Features

Real-time Monitoring - Live updates every 5 seconds
GPU Utilization - Track compute usage percentage
Memory Tracking - Monitor VRAM allocation and usage
Temperature Monitoring - Watch GPU thermals
Inference Metrics - Tokens/sec, queue depth, latency
Historical Data - View trends over time

Dashboard Sections

GPU Overview Cards

Quick summary at the top showing:

GPU Utilization - Current compute usage (%)
Memory Usage - VRAM used / total (e.g., 42GB / 96GB)
Temperature - Current GPU temperature
Power Draw - Current power consumption (W)

Utilization Chart

Time-series graph showing GPU usage over time:

Compute utilization percentage
Memory utilization percentage
Hover for exact values at any point
Zoom and pan for detailed analysis

Memory Breakdown

Detailed VRAM allocation:

Component	Description
Model Weights	Memory used by model parameters
KV Cache	Memory for attention key-value cache
Activation Memory	Memory for intermediate computations
System Reserved	CUDA and driver overhead

Inference Metrics

Performance statistics for your model:

Metric	Description	Typical Range
Tokens/sec	Output generation speed	30-150 t/s
TTFT	Time to first token	20-200 ms
Queue Depth	Pending requests	0-32
Active Requests	Currently processing	1-64
Batch Size	Current batch size	1-128

Understanding Metrics

GPU Utilization

Percentage of GPU compute being used:

0-20% - Underutilized, consider smaller instance
20-60% - Light load, room for more requests
60-85% - Healthy utilization
85-100% - Heavy load, may need more capacity

Memory Usage

VRAM consumption guidelines:

Model weights - Fixed based on model size
KV cache - Grows with context length and batch size
Headroom - Keep 5-10% free for safety

Temperature

GPU thermal status:

<70°C - Cool, excellent
70-80°C - Normal operating range
80-85°C - Warm but safe
>85°C - Hot, may throttle

Time Range Selection

View metrics over different periods:

Live - Real-time streaming (5s intervals)
1 Hour - Recent performance
24 Hours - Daily patterns
7 Days - Weekly trends
Custom - Select specific date range

Performance Optimization

Improve Throughput

Increase max-num-seqs for more concurrent requests
Enable --enable-chunked-prefill for better batching
Use quantized models (AWQ/GPTQ) for higher throughput
Reduce max-model-len if long context not needed

Reduce Latency

Use smaller models for faster responses
Enable --enforce-eager for more predictable latency
Decrease batch size for lower individual request latency
Pre-warm the model with a few requests after startup

Memory Optimization

Lower gpu-memory-utilization if seeing OOM errors
Reduce max-model-len to free KV cache memory
Use quantized models to reduce memory footprint
Decrease max-num-seqs for less KV cache usage

Alerting

Set up alerts for important thresholds:

Alert Type	Threshold	Action
High Temperature	>85°C	Check cooling, reduce load
Memory Full	>95%	Reduce batch size or context
Low Utilization	<10%	Consider smaller instance
High Queue Depth	>50	Add capacity or optimize

CLI Monitoring

SSH into your instance for detailed monitoring:

nvidia-smi

# One-time snapshot
nvidia-smi

# Continuous monitoring (every 1 second)
nvidia-smi -l 1

# GPU utilization only
nvidia-smi --query-gpu=utilization.gpu --format=csv

nvtop

# Interactive GPU monitor (like htop for GPUs)
nvtop

vLLM Metrics

# Check vLLM logs for performance
tail -f ~/hf-workspace/vllm.log | grep "tokens/s"

Prometheus Integration

Export metrics to Prometheus for advanced monitoring:

# vLLM exposes metrics at /metrics
curl http://YOUR-IP:PORT/metrics

# Example metrics:
vllm:num_requests_running
vllm:num_requests_waiting
vllm:gpu_cache_usage_perc
vllm:avg_generation_throughput_toks_per_s

GPU Metrics Dashboard

GPU Metrics Dashboard

Overview

Key Features

Dashboard Sections

GPU Overview Cards

Utilization Chart

Memory Breakdown

Inference Metrics

Understanding Metrics

GPU Utilization

Memory Usage

Temperature

Time Range Selection

Performance Optimization

Improve Throughput

Reduce Latency

Memory Optimization

Alerting

CLI Monitoring

nvidia-smi

nvtop

vLLM Metrics

Prometheus Integration

Need Help?