HuggingFace Model Deployment

Deploy HuggingFace models with one click. Get an OpenAI-compatible API endpoint in minutes.

Overview

The HuggingFace integration automatically:

Provisions a GPU instance
Downloads your selected model
Starts a vLLM inference server
Exposes an OpenAI-compatible API endpoint

Why vLLM?

vLLM is the fastest open-source LLM inference engine, with PagedAttention for efficient memory management and continuous batching for maximum throughput. Your deployed models run 2-4x faster than naive implementations.

Quick Start

Click HuggingFace in the sidebar
Search for a model or browse the catalog
Select a model
Choose your GPU configuration
Click Deploy

Your model will be ready in 5-10 minutes (depending on model size).

Recommended Models

These models are tested and optimized for Packet.ai deployment:

General Purpose

Model	Size	Min GPUs	Best For
`meta-llama/Llama-3.1-8B-Instruct`	8B	1x RTX 4090	Fast general-purpose, coding, chat
`meta-llama/Llama-3.1-70B-Instruct`	70B	4x RTX 4090	High-quality reasoning, complex tasks
`mistralai/Mistral-7B-Instruct-v0.3`	7B	1x RTX 4090	Efficient, fast inference
`Qwen/Qwen2.5-7B-Instruct`	7B	1x RTX 4090	Multilingual, math, coding
`google/gemma-2-9b-it`	9B	1x RTX 4090	Instruction following, creative

Coding Specialists

Model	Size	Min GPUs	Best For
`Qwen/Qwen2.5-Coder-7B-Instruct`	7B	1x RTX 4090	Code generation, completion
`deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct`	16B MoE	1x RTX 4090	Advanced code reasoning
`codellama/CodeLlama-7b-Instruct-hf`	7B	1x RTX 4090	Code infilling, completion

Small & Fast

Model	Size	Min GPUs	Best For
`microsoft/Phi-3.5-mini-instruct`	3.8B	1x RTX 4090	Ultra-fast, efficient
`HuggingFaceH4/zephyr-7b-beta`	7B	1x RTX 4090	Chat, assistant

Deployment Options

When deploying, you can configure:

Option	Description	Recommendation
GPU Pool	Select from available GPU types	RTX 4090 for 7B models, A100 for 70B+
GPU Count	Number of GPUs (1-8)	See GPU sizing guide below
Persistent Storage	Cache models for faster restarts	Enable for frequently used models
HuggingFace Token	Required for gated models	Required for Llama, Gemma, etc.

Gated Models

Some models on HuggingFace require accepting terms before use. These include Llama, Gemma, and other popular models.

Setup Steps

Accept License: Visit the model page on HuggingFace and click "Agree and access repository"
- Llama 3.1
- Gemma 2
Create Access Token: Go to huggingface.co/settings/tokens
- Click "New token"
- Select "Read" access
- Copy the token (starts with hf_)
Enter Token: Paste your token when deploying the model

Token Security

Your HuggingFace token is only used during model download and is never stored permanently. It's transmitted securely and deleted after the model is loaded.

Using Your Deployed Model

Once deployed, you'll receive an API endpoint like:

http://35.190.160.152:20000/v1

cURL

curl http://YOUR-IP:PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://YOUR-IP:PORT/v1",
    api_key="not-needed"  # No auth required for direct endpoint
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    max_tokens=500
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://YOUR-IP:PORT/v1',
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Hello!' }
  ],
  maxTokens: 100,
});

console.log(response.choices[0].message.content);

API Endpoints

Your vLLM server exposes these endpoints:

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions (recommended)
`/v1/completions`	POST	Text completions (legacy)
`/v1/models`	GET	List loaded models
`/health`	GET	Health check
`/version`	GET	vLLM version info

Deployment Status

Your deployment goes through these stages:

Status	Description	Duration
Pending	GPU being provisioned	~30 seconds
Deploying	Instance starting	~1 minute
Installing	Dependencies being installed	~2 minutes
Starting	vLLM starting, model downloading/loading	2-10 minutes
Running	Ready to accept requests	-

GPU Sizing Guide

Model size determines GPU requirements. Use this guide to choose the right configuration:

Model Size	GPU Memory Needed	Recommended Config
1-7B	~16GB	1x RTX 4090 (24GB)
7-15B	~20-32GB	1-2x RTX 4090 or 1x A100 40GB
30-34B	~40-70GB	2x A100 40GB or 4x RTX 4090
65-70B	~140GB	4x A100 40GB or 8x RTX 4090
70B+ Quantized	~40-70GB	2x A100 or 4x RTX 4090 with AWQ/GPTQ

Memory Calculation

Rule of thumb: Each billion parameters needs ~2GB in FP16. A 7B model needs ~14GB, plus overhead for KV cache. Start with the minimum and scale up if you hit out-of-memory errors.

Monitoring

View Deployment Logs

Click the Logs button on your deployment card to view real-time logs. You can also expand to full screen for detailed debugging.

Check Server Status

# Health check
curl http://YOUR-IP:PORT/health

# List loaded models
curl http://YOUR-IP:PORT/v1/models

# Check vLLM version
curl http://YOUR-IP:PORT/version

SSH Access for Debugging

For detailed debugging, SSH into your instance:

# Connect to instance
ssh -p <port> ubuntu@<host>

# View vLLM logs
tail -f ~/hf-workspace/vllm.log

# Check GPU utilization
nvidia-smi

# Watch GPU in real-time
watch -n 1 nvidia-smi

Using Persistent Storage

Enable persistent storage to:

Cache downloaded models - Faster restarts (minutes → seconds)
Save conversation logs - Keep inference logs
Store fine-tuned adapters - Use custom LoRA adapters

With persistent storage, model downloads are cached and subsequent starts load directly from storage.

Troubleshooting

Model Not Loading

Check the deployment logs for errors. Common issues:

Error	Cause	Solution
`CUDA out of memory`	Model too large for GPU(s)	Use more GPUs or a smaller model
`401 Unauthorized`	Gated model, no token	Accept terms and provide HF token
`Model not found`	Invalid model ID	Check spelling on HuggingFace
`Connection timeout`	Still downloading	Wait for model download to complete

API Not Responding

Check if deployment status is "Running"
Verify the port is exposed (check the endpoint URL)
Wait for model loading to complete (check logs)
Try the health endpoint: curl http://YOUR-IP:PORT/health

Slow Responses

First request slow: Model loading into GPU memory, subsequent requests faster
Consistently slow: Check GPU utilization with nvidia-smi
High latency: Consider a smaller model or more GPUs
Timeouts: Reduce max_tokens or add streaming

Out of Memory

Increase GPU count (Scale feature)
Use a quantized model version (AWQ, GPTQ)
Reduce max_model_len in vLLM config
Try a smaller model variant

Hugging Face Deployment