OpenAI-Compatible API
PopularUse your deployed models with existing OpenAI SDKs and tools
OpenAI-Compatible API Gateway
Use your deployed models with existing OpenAI SDKs and tools. Drop-in replacement for OpenAI APIs.
Overview
Packet.ai provides an OpenAI-compatible API proxy that routes requests to your deployed vLLM instance. This means you can use the same code, SDKs, and tools you use with OpenAI—just change the base URL and API key.
Prerequisites
Before using the API Gateway, ensure you have:
- An active GPU subscription with a running pod
- vLLM deployed and running on your pod (via Hugging Face deployment or manual setup)
- Port 8000 exposed as a service using the "Expose Service" feature in your dashboard
- A Packet.ai API key created in your dashboard under API Keys
Key Features
| Feature | Endpoint | Description |
|---|---|---|
| Chat Completions | /v1/chat/completions | Full OpenAI-compatible chat API with streaming |
| Text Completions | /v1/completions | Legacy completions endpoint |
| Streaming | All endpoints | Real-time Server-Sent Events (SSE) |
| Model Listing | /v1/models | List available models on your instance |
| Auto-Discovery | - | Automatically finds your running vLLM instance |
Quick Start
1. Get Your API Key
Create an API key in your dashboard under Settings → API Keys. Your key will look like:
pk_live_abc123...2. Use the Packet.ai Proxy Endpoint
Point your OpenAI SDK to the Packet.ai API gateway:
https://dash.packet.ai/api/v13. Make Your First Request
curl https://dash.packet.ai/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer pk_live_YOUR_API_KEY" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'Note: Use "model": "auto" or omit the model field to automatically use whichever model is running on your instance.
SDK Examples
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
response = client.chat.completions.create(
model="auto", # Uses your deployed model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about GPUs"}
],
max_tokens=100,
temperature=0.7
)
print(response.choices[0].message.content)Python with Streaming
stream = client.chat.completions.create(
model="your-model-id",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)JavaScript/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://dash.packet.ai/api/v1',
apiKey: 'pk_live_YOUR_API_KEY',
});
const response = await client.chat.completions.create({
model: 'auto', // Uses your deployed model
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Hello!' }
],
});
console.log(response.choices[0].message.content);JavaScript with Streaming
const stream = await client.chat.completions.create({
model: 'auto',
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}cURL with Streaming
curl https://dash.packet.ai/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer pk_live_YOUR_API_KEY" \
-N \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Count to 10"}],
"stream": true
}'LangChain Integration
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://dash.packet.ai/api/v1",
model="auto",
api_key="pk_live_YOUR_API_KEY",
temperature=0.7
)
response = llm.invoke("What is the capital of France?")
print(response.content)LlamaIndex Integration
from llama_index.llms.openai_like import OpenAILike
llm = OpenAILike(
api_base="https://dash.packet.ai/api/v1",
model="auto",
api_key="pk_live_YOUR_API_KEY"
)
response = llm.complete("Hello, how are you?")API Reference
Chat Completions
POST /v1/chat/completions
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID or "auto" for auto-detection |
messages | array | Yes | Array of message objects with role and content |
max_tokens | integer | No | Maximum tokens to generate (default: model max) |
temperature | float | No | Sampling temperature 0-2 (default: 1.0) |
top_p | float | No | Nucleus sampling parameter (default: 1.0) |
stream | boolean | No | Enable streaming responses (default: false) |
stop | array | No | Stop sequences to halt generation |
frequency_penalty | float | No | Penalty for frequent tokens (-2.0 to 2.0) |
presence_penalty | float | No | Penalty for present tokens (-2.0 to 2.0) |
Text Completions
POST /v1/completions
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID or "auto" |
prompt | string | Yes | Text prompt for completion |
max_tokens | integer | No | Maximum tokens to generate |
temperature | float | No | Sampling temperature |
List Models
GET /v1/models
Returns the list of available models on this endpoint.
curl https://dash.packet.ai/api/v1/models \
-H "Authorization: Bearer pk_live_YOUR_API_KEY"Health Check
GET /health
Check if the inference server is running and ready.
curl http://YOUR-IP:PORT/healthResponse Format
Chat Completion Response
{
"id": "chatcmpl-123abc",
"object": "chat.completion",
"created": 1705651234,
"model": "meta-llama/Llama-3.1-70B-Instruct",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 9,
"total_tokens": 24
}
}Streaming Response
When stream: true, responses are sent as Server-Sent Events:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Error Handling
Error Response Format
{
"error": {
"message": "Description of the error",
"type": "error_type",
"code": "error_code"
}
}Common Error Codes
| HTTP Status | Error Code | Description | Resolution |
|---|---|---|---|
| 400 | invalid_request | Malformed request body | Check JSON syntax and required fields |
| 401 | invalid_api_key | Missing or invalid API key | Check Authorization header |
| 403 | insufficient_quota | Account balance depleted | Add funds in Billing section |
| 404 | model_not_found | Requested model not available | Check model ID or use "auto" |
| 429 | rate_limit_exceeded | Too many requests | Implement exponential backoff |
| 500 | internal_error | Server error | Retry request or contact support |
| 503 | service_unavailable | No running inference endpoint | Start your GPU and deploy vLLM |
Python Error Handling Example
from openai import OpenAI, APIError, RateLimitError, AuthenticationError
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
try:
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
except AuthenticationError as e:
print(f"Authentication failed: {e}")
# Check your API key
except RateLimitError as e:
print(f"Rate limited: {e}")
# Wait and retry with exponential backoff
except APIError as e:
print(f"API error: {e.status_code} - {e.message}")
# Handle based on error codeJavaScript Error Handling Example
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://dash.packet.ai/api/v1',
apiKey: 'pk_live_YOUR_API_KEY',
});
try {
const response = await client.chat.completions.create({
model: 'auto',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(response.choices[0].message.content);
} catch (error) {
if (error instanceof OpenAI.AuthenticationError) {
console.error('Invalid API key');
} else if (error instanceof OpenAI.RateLimitError) {
console.error('Rate limited, retrying...');
// Implement retry logic
} else if (error instanceof OpenAI.APIError) {
console.error(`API error: ${error.status} - ${error.message}`);
}
}Rate Limits
Rate limits depend on your account tier and current server load:
| Tier | Requests/min | Tokens/min |
|---|---|---|
| Free | 20 | 40,000 |
| Standard | 60 | 150,000 |
| Premium | 200 | 500,000 |
Rate Limit Headers
Responses include rate limit information:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1705651300Handling Rate Limits
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
def make_request_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="auto",
messages=messages
)
except RateLimitError:
wait_time = 2 ** attempt # Exponential backoff
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")Best Practices
- Set max_tokens - Always specify to control response length and costs
- Use streaming - Better UX for long responses, shows progress immediately
- Handle rate limits - Implement retry logic with exponential backoff
- Monitor latency - First request may be slow while model loads
- Use system prompts - Guide model behavior consistently
- Cache responses - For identical queries, cache to reduce costs
- Batch when possible - Use batch API for non-real-time workloads
Troubleshooting
"No running inference endpoint found"
- Make sure you have an active GPU subscription with a running pod
- Deploy a model via Hugging Face deployment or manually start vLLM
- Expose port 8000 as a service (Dashboard → Your Pod → Expose Service)
- Wait for the vLLM server to fully start (check deployment logs)
Connection Refused / Timeout
- Check if pod status is "Running" in your dashboard
- Verify vLLM is running on port 8000 inside your pod
- Wait for model to finish loading (check deployment logs)
Slow First Response
- First request triggers model loading into GPU memory
- Subsequent requests will be much faster
- Enable persistent storage to cache models between restarts
Authentication Failed
- Make sure you're using a valid Packet.ai API key (starts with
pk_live_) - Include the
Authorization: Bearer YOUR_KEYheader - Check that your API key hasn't been revoked in the dashboard
Model Not Found
- Use
"model": "auto"to automatically detect - Check available models with
GET /v1/models - Ensure the model ID matches exactly (case-sensitive)
Need Help?
Contact us at support@packet.ai
