HuggingFace Model Deployment
Deploy HuggingFace models with one click. Get an OpenAI-compatible API endpoint in minutes.
Overview
The HuggingFace integration automatically:
- Provisions a GPU instance
- Downloads your selected model
- Starts a vLLM inference server
- Exposes an OpenAI-compatible API endpoint
Why vLLM?
vLLM is the fastest open-source LLM inference engine, with PagedAttention for efficient memory management and continuous batching for maximum throughput. Your deployed models run 2-4x faster than naive implementations.
Quick Start
- Click HuggingFace in the sidebar
- Search for a model or browse the catalog
- Select a model
- Choose your GPU configuration
- Click Deploy
Your model will be ready in 5-10 minutes (depending on model size).
Recommended Models
These models are tested and optimized for Packet.ai deployment:
General Purpose
| Model | Size | Min GPUs | Best For |
|---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct | 8B | 1x RTX 4090 | Fast general-purpose, coding, chat |
meta-llama/Llama-3.1-70B-Instruct | 70B | 4x RTX 4090 | High-quality reasoning, complex tasks |
mistralai/Mistral-7B-Instruct-v0.3 | 7B | 1x RTX 4090 | Efficient, fast inference |
Qwen/Qwen2.5-7B-Instruct | 7B | 1x RTX 4090 | Multilingual, math, coding |
google/gemma-2-9b-it | 9B | 1x RTX 4090 | Instruction following, creative |
Coding Specialists
| Model | Size | Min GPUs | Best For |
|---|---|---|---|
Qwen/Qwen2.5-Coder-7B-Instruct | 7B | 1x RTX 4090 | Code generation, completion |
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | 16B MoE | 1x RTX 4090 | Advanced code reasoning |
codellama/CodeLlama-7b-Instruct-hf | 7B | 1x RTX 4090 | Code infilling, completion |
Small & Fast
| Model | Size | Min GPUs | Best For |
|---|---|---|---|
microsoft/Phi-3.5-mini-instruct | 3.8B | 1x RTX 4090 | Ultra-fast, efficient |
HuggingFaceH4/zephyr-7b-beta | 7B | 1x RTX 4090 | Chat, assistant |
Deployment Options
When deploying, you can configure:
| Option | Description | Recommendation |
|---|---|---|
| GPU Pool | Select from available GPU types | RTX 4090 for 7B models, A100 for 70B+ |
| GPU Count | Number of GPUs (1-8) | See GPU sizing guide below |
| Persistent Storage | Cache models for faster restarts | Enable for frequently used models |
| HuggingFace Token | Required for gated models | Required for Llama, Gemma, etc. |
Gated Models
Some models on HuggingFace require accepting terms before use. These include Llama, Gemma, and other popular models.
Setup Steps
- Accept License: Visit the model page on HuggingFace and click "Agree and access repository"
- Create Access Token: Go to huggingface.co/settings/tokens
- Click "New token"
- Select "Read" access
- Copy the token (starts with
hf_)
- Enter Token: Paste your token when deploying the model
Token Security
Your HuggingFace token is only used during model download and is never stored permanently. It's transmitted securely and deleted after the model is loaded.
Using Your Deployed Model
Once deployed, you'll receive an API endpoint like:
http://35.190.160.152:20000/v1cURL
curl http://YOUR-IP:PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}'Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR-IP:PORT/v1",
api_key="not-needed" # No auth required for direct endpoint
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about GPUs"}
],
max_tokens=100,
temperature=0.7
)
print(response.choices[0].message.content)Streaming Responses
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True,
max_tokens=500
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)JavaScript/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://YOUR-IP:PORT/v1',
apiKey: 'not-needed',
});
const response = await client.chat.completions.create({
model: 'meta-llama/Llama-3.1-8B-Instruct',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Hello!' }
],
maxTokens: 100,
});
console.log(response.choices[0].message.content);API Endpoints
Your vLLM server exposes these endpoints:
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | Chat completions (recommended) |
/v1/completions | POST | Text completions (legacy) |
/v1/models | GET | List loaded models |
/health | GET | Health check |
/version | GET | vLLM version info |
Deployment Status
Your deployment goes through these stages:
| Status | Description | Duration |
|---|---|---|
| Pending | GPU being provisioned | ~30 seconds |
| Deploying | Instance starting | ~1 minute |
| Installing | Dependencies being installed | ~2 minutes |
| Starting | vLLM starting, model downloading/loading | 2-10 minutes |
| Running | Ready to accept requests | - |
GPU Sizing Guide
Model size determines GPU requirements. Use this guide to choose the right configuration:
| Model Size | GPU Memory Needed | Recommended Config |
|---|---|---|
| 1-7B | ~16GB | 1x RTX 4090 (24GB) |
| 7-15B | ~20-32GB | 1-2x RTX 4090 or 1x A100 40GB |
| 30-34B | ~40-70GB | 2x A100 40GB or 4x RTX 4090 |
| 65-70B | ~140GB | 4x A100 40GB or 8x RTX 4090 |
| 70B+ Quantized | ~40-70GB | 2x A100 or 4x RTX 4090 with AWQ/GPTQ |
Memory Calculation
Rule of thumb: Each billion parameters needs ~2GB in FP16. A 7B model needs ~14GB, plus overhead for KV cache. Start with the minimum and scale up if you hit out-of-memory errors.
Monitoring
View Deployment Logs
Click the Logs button on your deployment card to view real-time logs. You can also expand to full screen for detailed debugging.
Check Server Status
# Health check
curl http://YOUR-IP:PORT/health
# List loaded models
curl http://YOUR-IP:PORT/v1/models
# Check vLLM version
curl http://YOUR-IP:PORT/versionSSH Access for Debugging
For detailed debugging, SSH into your instance:
# Connect to instance
ssh -p <port> ubuntu@<host>
# View vLLM logs
tail -f ~/hf-workspace/vllm.log
# Check GPU utilization
nvidia-smi
# Watch GPU in real-time
watch -n 1 nvidia-smiUsing Persistent Storage
Enable persistent storage to:
- Cache downloaded models - Faster restarts (minutes → seconds)
- Save conversation logs - Keep inference logs
- Store fine-tuned adapters - Use custom LoRA adapters
With persistent storage, model downloads are cached and subsequent starts load directly from storage.
Troubleshooting
Model Not Loading
Check the deployment logs for errors. Common issues:
| Error | Cause | Solution |
|---|---|---|
CUDA out of memory | Model too large for GPU(s) | Use more GPUs or a smaller model |
401 Unauthorized | Gated model, no token | Accept terms and provide HF token |
Model not found | Invalid model ID | Check spelling on HuggingFace |
Connection timeout | Still downloading | Wait for model download to complete |
API Not Responding
- Check if deployment status is "Running"
- Verify the port is exposed (check the endpoint URL)
- Wait for model loading to complete (check logs)
- Try the health endpoint:
curl http://YOUR-IP:PORT/health
Slow Responses
- First request slow: Model loading into GPU memory, subsequent requests faster
- Consistently slow: Check GPU utilization with
nvidia-smi - High latency: Consider a smaller model or more GPUs
- Timeouts: Reduce
max_tokensor add streaming
Out of Memory
- Increase GPU count (Scale feature)
- Use a quantized model version (AWQ, GPTQ)
- Reduce
max_model_lenin vLLM config - Try a smaller model variant
Need Help?
Contact us at support@packet.ai
