Back to Docs

OpenAI-Compatible API

Popular

Use your deployed models with existing OpenAI SDKs and tools

OpenAI-Compatible API Gateway

Use your deployed models with existing OpenAI SDKs and tools. Drop-in replacement for OpenAI APIs.

Overview

Packet.ai provides an OpenAI-compatible API proxy that routes requests to your deployed vLLM instance. This means you can use the same code, SDKs, and tools you use with OpenAI—just change the base URL and API key.

Prerequisites

Before using the API Gateway, ensure you have:

  1. An active GPU subscription with a running pod
  2. vLLM deployed and running on your pod (via Hugging Face deployment or manual setup)
  3. Port 8000 exposed as a service using the "Expose Service" feature in your dashboard
  4. A Packet.ai API key created in your dashboard under API Keys

Key Features

FeatureEndpointDescription
Chat Completions/v1/chat/completionsFull OpenAI-compatible chat API with streaming
Text Completions/v1/completionsLegacy completions endpoint
StreamingAll endpointsReal-time Server-Sent Events (SSE)
Model Listing/v1/modelsList available models on your instance
Auto-Discovery-Automatically finds your running vLLM instance

Quick Start

1. Get Your API Key

Create an API key in your dashboard under Settings → API Keys. Your key will look like:

pk_live_abc123...

2. Use the Packet.ai Proxy Endpoint

Point your OpenAI SDK to the Packet.ai API gateway:

https://dash.packet.ai/api/v1

3. Make Your First Request

curl https://dash.packet.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Note: Use "model": "auto" or omit the model field to automatically use whichever model is running on your instance.

SDK Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="auto",  # Uses your deployed model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Python with Streaming

stream = client.chat.completions.create(
    model="your-model-id",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: 'pk_live_YOUR_API_KEY',
});

const response = await client.chat.completions.create({
  model: 'auto',  // Uses your deployed model
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Hello!' }
  ],
});

console.log(response.choices[0].message.content);

JavaScript with Streaming

const stream = await client.chat.completions.create({
  model: 'auto',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

cURL with Streaming

curl https://dash.packet.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -N \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'

LangChain Integration

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://dash.packet.ai/api/v1",
    model="auto",
    api_key="pk_live_YOUR_API_KEY",
    temperature=0.7
)

response = llm.invoke("What is the capital of France?")
print(response.content)

LlamaIndex Integration

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="https://dash.packet.ai/api/v1",
    model="auto",
    api_key="pk_live_YOUR_API_KEY"
)

response = llm.complete("Hello, how are you?")

API Reference

Chat Completions

POST /v1/chat/completions

ParameterTypeRequiredDescription
modelstringYesModel ID or "auto" for auto-detection
messagesarrayYesArray of message objects with role and content
max_tokensintegerNoMaximum tokens to generate (default: model max)
temperaturefloatNoSampling temperature 0-2 (default: 1.0)
top_pfloatNoNucleus sampling parameter (default: 1.0)
streambooleanNoEnable streaming responses (default: false)
stoparrayNoStop sequences to halt generation
frequency_penaltyfloatNoPenalty for frequent tokens (-2.0 to 2.0)
presence_penaltyfloatNoPenalty for present tokens (-2.0 to 2.0)

Text Completions

POST /v1/completions

ParameterTypeRequiredDescription
modelstringYesModel ID or "auto"
promptstringYesText prompt for completion
max_tokensintegerNoMaximum tokens to generate
temperaturefloatNoSampling temperature

List Models

GET /v1/models

Returns the list of available models on this endpoint.

curl https://dash.packet.ai/api/v1/models \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY"

Health Check

GET /health

Check if the inference server is running and ready.

curl http://YOUR-IP:PORT/health

Response Format

Chat Completion Response

{
  "id": "chatcmpl-123abc",
  "object": "chat.completion",
  "created": 1705651234,
  "model": "meta-llama/Llama-3.1-70B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 9,
    "total_tokens": 24
  }
}

Streaming Response

When stream: true, responses are sent as Server-Sent Events:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Error Handling

Error Response Format

{
  "error": {
    "message": "Description of the error",
    "type": "error_type",
    "code": "error_code"
  }
}

Common Error Codes

HTTP StatusError CodeDescriptionResolution
400invalid_requestMalformed request bodyCheck JSON syntax and required fields
401invalid_api_keyMissing or invalid API keyCheck Authorization header
403insufficient_quotaAccount balance depletedAdd funds in Billing section
404model_not_foundRequested model not availableCheck model ID or use "auto"
429rate_limit_exceededToo many requestsImplement exponential backoff
500internal_errorServer errorRetry request or contact support
503service_unavailableNo running inference endpointStart your GPU and deploy vLLM

Python Error Handling Example

from openai import OpenAI, APIError, RateLimitError, AuthenticationError

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

try:
    response = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(response.choices[0].message.content)

except AuthenticationError as e:
    print(f"Authentication failed: {e}")
    # Check your API key

except RateLimitError as e:
    print(f"Rate limited: {e}")
    # Wait and retry with exponential backoff

except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")
    # Handle based on error code

JavaScript Error Handling Example

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: 'pk_live_YOUR_API_KEY',
});

try {
  const response = await client.chat.completions.create({
    model: 'auto',
    messages: [{ role: 'user', content: 'Hello!' }],
  });
  console.log(response.choices[0].message.content);
} catch (error) {
  if (error instanceof OpenAI.AuthenticationError) {
    console.error('Invalid API key');
  } else if (error instanceof OpenAI.RateLimitError) {
    console.error('Rate limited, retrying...');
    // Implement retry logic
  } else if (error instanceof OpenAI.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  }
}

Rate Limits

Rate limits depend on your account tier and current server load:

TierRequests/minTokens/min
Free2040,000
Standard60150,000
Premium200500,000

Rate Limit Headers

Responses include rate limit information:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1705651300

Handling Rate Limits

import time
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

def make_request_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="auto",
                messages=messages
            )
        except RateLimitError:
            wait_time = 2 ** attempt  # Exponential backoff
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

Best Practices

  • Set max_tokens - Always specify to control response length and costs
  • Use streaming - Better UX for long responses, shows progress immediately
  • Handle rate limits - Implement retry logic with exponential backoff
  • Monitor latency - First request may be slow while model loads
  • Use system prompts - Guide model behavior consistently
  • Cache responses - For identical queries, cache to reduce costs
  • Batch when possible - Use batch API for non-real-time workloads

Troubleshooting

"No running inference endpoint found"

  • Make sure you have an active GPU subscription with a running pod
  • Deploy a model via Hugging Face deployment or manually start vLLM
  • Expose port 8000 as a service (Dashboard → Your Pod → Expose Service)
  • Wait for the vLLM server to fully start (check deployment logs)

Connection Refused / Timeout

  • Check if pod status is "Running" in your dashboard
  • Verify vLLM is running on port 8000 inside your pod
  • Wait for model to finish loading (check deployment logs)

Slow First Response

  • First request triggers model loading into GPU memory
  • Subsequent requests will be much faster
  • Enable persistent storage to cache models between restarts

Authentication Failed

  • Make sure you're using a valid Packet.ai API key (starts with pk_live_)
  • Include the Authorization: Bearer YOUR_KEY header
  • Check that your API key hasn't been revoked in the dashboard

Model Not Found

  • Use "model": "auto" to automatically detect
  • Check available models with GET /v1/models
  • Ensure the model ID matches exactly (case-sensitive)

Need Help?

Contact us at support@packet.ai