Back to Docs

Hugging Face Deployment

One-click model deployment from Hugging Face

HuggingFace Model Deployment

Deploy HuggingFace models with one click. Get an OpenAI-compatible API endpoint in minutes.

Overview

The HuggingFace integration automatically:

  1. Provisions a GPU instance
  2. Downloads your selected model
  3. Starts a vLLM inference server
  4. Exposes an OpenAI-compatible API endpoint

Why vLLM?

vLLM is the fastest open-source LLM inference engine, with PagedAttention for efficient memory management and continuous batching for maximum throughput. Your deployed models run 2-4x faster than naive implementations.

Quick Start

  1. Click HuggingFace in the sidebar
  2. Search for a model or browse the catalog
  3. Select a model
  4. Choose your GPU configuration
  5. Click Deploy

Your model will be ready in 5-10 minutes (depending on model size).

These models are tested and optimized for Packet.ai deployment:

General Purpose

ModelSizeMin GPUsBest For
meta-llama/Llama-3.1-8B-Instruct8B1x RTX 4090Fast general-purpose, coding, chat
meta-llama/Llama-3.1-70B-Instruct70B4x RTX 4090High-quality reasoning, complex tasks
mistralai/Mistral-7B-Instruct-v0.37B1x RTX 4090Efficient, fast inference
Qwen/Qwen2.5-7B-Instruct7B1x RTX 4090Multilingual, math, coding
google/gemma-2-9b-it9B1x RTX 4090Instruction following, creative

Coding Specialists

ModelSizeMin GPUsBest For
Qwen/Qwen2.5-Coder-7B-Instruct7B1x RTX 4090Code generation, completion
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct16B MoE1x RTX 4090Advanced code reasoning
codellama/CodeLlama-7b-Instruct-hf7B1x RTX 4090Code infilling, completion

Small & Fast

ModelSizeMin GPUsBest For
microsoft/Phi-3.5-mini-instruct3.8B1x RTX 4090Ultra-fast, efficient
HuggingFaceH4/zephyr-7b-beta7B1x RTX 4090Chat, assistant

Deployment Options

When deploying, you can configure:

OptionDescriptionRecommendation
GPU PoolSelect from available GPU typesRTX 4090 for 7B models, A100 for 70B+
GPU CountNumber of GPUs (1-8)See GPU sizing guide below
Persistent StorageCache models for faster restartsEnable for frequently used models
HuggingFace TokenRequired for gated modelsRequired for Llama, Gemma, etc.

Gated Models

Some models on HuggingFace require accepting terms before use. These include Llama, Gemma, and other popular models.

Setup Steps

  1. Accept License: Visit the model page on HuggingFace and click "Agree and access repository"
  2. Create Access Token: Go to huggingface.co/settings/tokens
    • Click "New token"
    • Select "Read" access
    • Copy the token (starts with hf_)
  3. Enter Token: Paste your token when deploying the model

Token Security

Your HuggingFace token is only used during model download and is never stored permanently. It's transmitted securely and deleted after the model is loaded.

Using Your Deployed Model

Once deployed, you'll receive an API endpoint like:

http://35.190.160.152:20000/v1

cURL

curl http://YOUR-IP:PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://YOUR-IP:PORT/v1",
    api_key="not-needed"  # No auth required for direct endpoint
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    max_tokens=500
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://YOUR-IP:PORT/v1',
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Hello!' }
  ],
  maxTokens: 100,
});

console.log(response.choices[0].message.content);

API Endpoints

Your vLLM server exposes these endpoints:

EndpointMethodDescription
/v1/chat/completionsPOSTChat completions (recommended)
/v1/completionsPOSTText completions (legacy)
/v1/modelsGETList loaded models
/healthGETHealth check
/versionGETvLLM version info

Deployment Status

Your deployment goes through these stages:

StatusDescriptionDuration
PendingGPU being provisioned~30 seconds
DeployingInstance starting~1 minute
InstallingDependencies being installed~2 minutes
StartingvLLM starting, model downloading/loading2-10 minutes
RunningReady to accept requests-

GPU Sizing Guide

Model size determines GPU requirements. Use this guide to choose the right configuration:

Model SizeGPU Memory NeededRecommended Config
1-7B~16GB1x RTX 4090 (24GB)
7-15B~20-32GB1-2x RTX 4090 or 1x A100 40GB
30-34B~40-70GB2x A100 40GB or 4x RTX 4090
65-70B~140GB4x A100 40GB or 8x RTX 4090
70B+ Quantized~40-70GB2x A100 or 4x RTX 4090 with AWQ/GPTQ

Memory Calculation

Rule of thumb: Each billion parameters needs ~2GB in FP16. A 7B model needs ~14GB, plus overhead for KV cache. Start with the minimum and scale up if you hit out-of-memory errors.

Monitoring

View Deployment Logs

Click the Logs button on your deployment card to view real-time logs. You can also expand to full screen for detailed debugging.

Check Server Status

# Health check
curl http://YOUR-IP:PORT/health

# List loaded models
curl http://YOUR-IP:PORT/v1/models

# Check vLLM version
curl http://YOUR-IP:PORT/version

SSH Access for Debugging

For detailed debugging, SSH into your instance:

# Connect to instance
ssh -p <port> ubuntu@<host>

# View vLLM logs
tail -f ~/hf-workspace/vllm.log

# Check GPU utilization
nvidia-smi

# Watch GPU in real-time
watch -n 1 nvidia-smi

Using Persistent Storage

Enable persistent storage to:

  • Cache downloaded models - Faster restarts (minutes → seconds)
  • Save conversation logs - Keep inference logs
  • Store fine-tuned adapters - Use custom LoRA adapters

With persistent storage, model downloads are cached and subsequent starts load directly from storage.

Troubleshooting

Model Not Loading

Check the deployment logs for errors. Common issues:

ErrorCauseSolution
CUDA out of memoryModel too large for GPU(s)Use more GPUs or a smaller model
401 UnauthorizedGated model, no tokenAccept terms and provide HF token
Model not foundInvalid model IDCheck spelling on HuggingFace
Connection timeoutStill downloadingWait for model download to complete

API Not Responding

  1. Check if deployment status is "Running"
  2. Verify the port is exposed (check the endpoint URL)
  3. Wait for model loading to complete (check logs)
  4. Try the health endpoint: curl http://YOUR-IP:PORT/health

Slow Responses

  • First request slow: Model loading into GPU memory, subsequent requests faster
  • Consistently slow: Check GPU utilization with nvidia-smi
  • High latency: Consider a smaller model or more GPUs
  • Timeouts: Reduce max_tokens or add streaming

Out of Memory

  • Increase GPU count (Scale feature)
  • Use a quantized model version (AWQ, GPTQ)
  • Reduce max_model_len in vLLM config
  • Try a smaller model variant

Need Help?

Contact us at support@packet.ai