OpenAI-Compatible API Gateway

Use your deployed models with existing OpenAI SDKs and tools. Drop-in replacement for OpenAI APIs.

Overview

Packet.ai provides an OpenAI-compatible API proxy that routes requests to your deployed vLLM instance. This means you can use the same code, SDKs, and tools you use with OpenAI—just change the base URL and API key.

Prerequisites

Before using the API Gateway, ensure you have:

An active GPU subscription with a running pod
vLLM deployed and running on your pod (via Hugging Face deployment or manual setup)
Port 8000 exposed as a service using the "Expose Service" feature in your dashboard
A Packet.ai API key created in your dashboard under API Keys

Key Features

Feature	Endpoint	Description
Chat Completions	`/v1/chat/completions`	Full OpenAI-compatible chat API with streaming
Text Completions	`/v1/completions`	Legacy completions endpoint
Streaming	All endpoints	Real-time Server-Sent Events (SSE)
Model Listing	`/v1/models`	List available models on your instance
Auto-Discovery	-	Automatically finds your running vLLM instance

Quick Start

1. Get Your API Key

Create an API key in your dashboard under Settings → API Keys. Your key will look like:

pk_live_abc123...

2. Use the Packet.ai Proxy Endpoint

Point your OpenAI SDK to the Packet.ai API gateway:

https://dash.packet.ai/api/v1

3. Make Your First Request

curl https://dash.packet.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Note: Use "model": "auto" or omit the model field to automatically use whichever model is running on your instance.

SDK Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="auto",  # Uses your deployed model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Python with Streaming

stream = client.chat.completions.create(
    model="your-model-id",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: 'pk_live_YOUR_API_KEY',
});

const response = await client.chat.completions.create({
  model: 'auto',  // Uses your deployed model
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Hello!' }
  ],
});

console.log(response.choices[0].message.content);

JavaScript with Streaming

const stream = await client.chat.completions.create({
  model: 'auto',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

cURL with Streaming

curl https://dash.packet.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -N \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'

LangChain Integration

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://dash.packet.ai/api/v1",
    model="auto",
    api_key="pk_live_YOUR_API_KEY",
    temperature=0.7
)

response = llm.invoke("What is the capital of France?")
print(response.content)

LlamaIndex Integration

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="https://dash.packet.ai/api/v1",
    model="auto",
    api_key="pk_live_YOUR_API_KEY"
)

response = llm.complete("Hello, how are you?")

API Reference

Chat Completions

POST /v1/chat/completions

Parameter	Type	Required	Description
`model`	string	Yes	Model ID or "auto" for auto-detection
`messages`	array	Yes	Array of message objects with role and content
`max_tokens`	integer	No	Maximum tokens to generate (default: model max)
`temperature`	float	No	Sampling temperature 0-2 (default: 1.0)
`top_p`	float	No	Nucleus sampling parameter (default: 1.0)
`stream`	boolean	No	Enable streaming responses (default: false)
`stop`	array	No	Stop sequences to halt generation
`frequency_penalty`	float	No	Penalty for frequent tokens (-2.0 to 2.0)
`presence_penalty`	float	No	Penalty for present tokens (-2.0 to 2.0)

Text Completions

POST /v1/completions

Parameter	Type	Required	Description
`model`	string	Yes	Model ID or "auto"
`prompt`	string	Yes	Text prompt for completion
`max_tokens`	integer	No	Maximum tokens to generate
`temperature`	float	No	Sampling temperature

List Models

GET /v1/models

Returns the list of available models on this endpoint.

curl https://dash.packet.ai/api/v1/models \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY"

Health Check

GET /health

Check if the inference server is running and ready.

curl http://YOUR-IP:PORT/health

Response Format

Chat Completion Response

{
  "id": "chatcmpl-123abc",
  "object": "chat.completion",
  "created": 1705651234,
  "model": "meta-llama/Llama-3.1-70B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 9,
    "total_tokens": 24
  }
}

Streaming Response

When stream: true, responses are sent as Server-Sent Events:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Error Handling

Error Response Format

{
  "error": {
    "message": "Description of the error",
    "type": "error_type",
    "code": "error_code"
  }
}

Common Error Codes

HTTP Status	Error Code	Description	Resolution
400	`invalid_request`	Malformed request body	Check JSON syntax and required fields
401	`invalid_api_key`	Missing or invalid API key	Check Authorization header
403	`insufficient_quota`	Account balance depleted	Add funds in Billing section
404	`model_not_found`	Requested model not available	Check model ID or use "auto"
429	`rate_limit_exceeded`	Too many requests	Implement exponential backoff
500	`internal_error`	Server error	Retry request or contact support
503	`service_unavailable`	No running inference endpoint	Start your GPU and deploy vLLM

Python Error Handling Example

from openai import OpenAI, APIError, RateLimitError, AuthenticationError

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

try:
    response = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(response.choices[0].message.content)

except AuthenticationError as e:
    print(f"Authentication failed: {e}")
    # Check your API key

except RateLimitError as e:
    print(f"Rate limited: {e}")
    # Wait and retry with exponential backoff

except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")
    # Handle based on error code

JavaScript Error Handling Example

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: 'pk_live_YOUR_API_KEY',
});

try {
  const response = await client.chat.completions.create({
    model: 'auto',
    messages: [{ role: 'user', content: 'Hello!' }],
  });
  console.log(response.choices[0].message.content);
} catch (error) {
  if (error instanceof OpenAI.AuthenticationError) {
    console.error('Invalid API key');
  } else if (error instanceof OpenAI.RateLimitError) {
    console.error('Rate limited, retrying...');
    // Implement retry logic
  } else if (error instanceof OpenAI.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  }
}

Rate Limits

Rate limits depend on your account tier and current server load:

Tier	Requests/min	Tokens/min
Free	20	40,000
Standard	60	150,000
Premium	200	500,000

Rate Limit Headers

Responses include rate limit information:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1705651300

Handling Rate Limits

import time
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

def make_request_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="auto",
                messages=messages
            )
        except RateLimitError:
            wait_time = 2 ** attempt  # Exponential backoff
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

Best Practices

Set max_tokens - Always specify to control response length and costs
Use streaming - Better UX for long responses, shows progress immediately
Handle rate limits - Implement retry logic with exponential backoff
Monitor latency - First request may be slow while model loads
Use system prompts - Guide model behavior consistently
Cache responses - For identical queries, cache to reduce costs
Batch when possible - Use batch API for non-real-time workloads

Troubleshooting

"No running inference endpoint found"

Make sure you have an active GPU subscription with a running pod
Deploy a model via Hugging Face deployment or manually start vLLM
Expose port 8000 as a service (Dashboard → Your Pod → Expose Service)
Wait for the vLLM server to fully start (check deployment logs)

Connection Refused / Timeout

Check if pod status is "Running" in your dashboard
Verify vLLM is running on port 8000 inside your pod
Wait for model to finish loading (check deployment logs)

Slow First Response

First request triggers model loading into GPU memory
Subsequent requests will be much faster
Enable persistent storage to cache models between restarts

Authentication Failed

Make sure you're using a valid Packet.ai API key (starts with pk_live_)
Include the Authorization: Bearer YOUR_KEY header
Check that your API key hasn't been revoked in the dashboard

Model Not Found

Use "model": "auto" to automatically detect
Check available models with GET /v1/models
Ensure the model ID matches exactly (case-sensitive)

OpenAI-Compatible API