Token Factory - Inference API
OpenAI-compatible inference API with real-time chat, batch processing, embeddings, structured outputs, function calling, and LoRA fine-tuning.
What is Token Factory?
Token Factory provides managed inference for large language models. Instead of managing your own GPU instance, you pay per token and we handle the infrastructure. Perfect for production APIs, batch processing, embeddings, and custom model fine-tuning.
Table of Contents
- Pricing
- Authentication
- Real-Time Chat Completions
- Structured Outputs (JSON Mode)
- Function Calling (Tools)
- Embeddings
- Batch Processing
- LoRA Fine-Tuning
- API Reference
- Error Handling
- Rate Limits
Pricing
Token Factory offers competitive pricing across different tiers:
| Tier | Input (per 1M) | Output (per 1M) | Best For |
|---|---|---|---|
| Real-time | $0.10 | $0.10 | Interactive apps, chatbots |
| Batch 1h | $0.07 | $0.07 | Time-sensitive bulk work |
| Batch 24h | $0.05 | $0.05 | Maximum cost savings |
| Embeddings | $0.02 | - | Semantic search, RAG |
Authentication
All API requests require an API key. Get your key from Dashboard → API Keys.
Using Your API Key
# Include in Authorization header
Authorization: Bearer pk_live_YOUR_API_KEY
# Or as a query parameter (less secure)
?api_key=pk_live_YOUR_API_KEYReal-Time Chat Completions
Use for interactive applications where users expect immediate responses.
cURL Example
curl -X POST https://dash.packet.ai/api/v1/chat/completions \
-H "Authorization: Bearer pk_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"max_tokens": 500,
"temperature": 0.7
}'Python Example
from openai import OpenAI
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
max_tokens=500,
temperature=0.7
)
print(response.choices[0].message.content)JavaScript/TypeScript Example
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://dash.packet.ai/api/v1',
apiKey: 'pk_live_YOUR_API_KEY',
});
const response = await client.chat.completions.create({
model: 'meta-llama/Llama-3.1-8B-Instruct',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is machine learning?' }
],
max_tokens: 500,
temperature: 0.7,
});
console.log(response.choices[0].message.content);Streaming Responses
For better UX, stream tokens as they're generated:
Python Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Tell me a story about a robot"}],
max_tokens=1000,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)JavaScript Streaming
const stream = await client.chat.completions.create({
model: 'meta-llama/Llama-3.1-8B-Instruct',
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}Request Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | Model ID (e.g., "meta-llama/Llama-3.1-8B-Instruct") |
messages | array | required | Conversation history with role/content objects |
max_tokens | integer | 1024 | Maximum tokens to generate (1-4096) |
temperature | number | 0.7 | Sampling randomness (0-2). Lower = more deterministic |
top_p | number | 1.0 | Nucleus sampling parameter (0-1) |
stream | boolean | false | Enable Server-Sent Events streaming |
stop | array | null | Stop sequences (up to 4) |
presence_penalty | number | 0 | Penalize repeated topics (-2 to 2) |
frequency_penalty | number | 0 | Penalize repeated tokens (-2 to 2) |
Structured Outputs (JSON Mode)
Force the model to output valid JSON that matches a specific schema. Perfect for API responses, data extraction, and programmatic processing.
When to Use Structured Outputs
- • Extracting structured data from text (names, dates, entities)
- • Building APIs that need consistent response formats
- • Data validation and form filling
- • Converting natural language to structured commands
JSON Mode (Simple)
Request any valid JSON response:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant that outputs JSON."},
{"role": "user", "content": "Extract the person's name and age from: John Smith is 35 years old."}
],
response_format={"type": "json_object"}
)
# Output: {"name": "John Smith", "age": 35}JSON Schema (Strict)
Enforce a specific schema for guaranteed structure:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "Extract product information from the user's message."},
{"role": "user", "content": "I want to buy 3 apples at $2 each"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "product_order",
"strict": True,
"schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"},
"total_price": {"type": "number"}
},
"required": ["product_name", "quantity", "unit_price", "total_price"]
}
}
}
)
# Output: {"product_name": "apples", "quantity": 3, "unit_price": 2.0, "total_price": 6.0}JavaScript JSON Schema Example
const response = await client.chat.completions.create({
model: 'meta-llama/Llama-3.1-8B-Instruct',
messages: [
{ role: 'system', content: 'Extract event information.' },
{ role: 'user', content: 'Meeting with John tomorrow at 3pm in Conference Room A' }
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'calendar_event',
strict: true,
schema: {
type: 'object',
properties: {
title: { type: 'string' },
attendees: { type: 'array', items: { type: 'string' } },
date: { type: 'string' },
time: { type: 'string' },
location: { type: 'string' }
},
required: ['title', 'attendees', 'date', 'time', 'location']
}
}
}
});Function Calling (Tools)
Enable the model to call functions/tools that you define. The model determines when to call functions and with what arguments.
Function Calling Use Cases
- • Connecting LLMs to external APIs (weather, databases, etc.)
- • Building AI agents that can take actions
- • Multi-step workflows with tool use
- • Structured data retrieval and manipulation
Defining Tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search for products in the database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"category": {"type": "string", "description": "Product category"},
"max_results": {"type": "integer", "default": 10}
},
"required": ["query"]
}
}
}
]Complete Function Calling Example (Python)
import json
from openai import OpenAI
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
# Define your tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}
]
# Your actual function implementation
def get_weather(location: str, unit: str = "celsius") -> dict:
# In production, call a real weather API
return {"location": location, "temperature": 22, "unit": unit, "condition": "sunny"}
# First API call - model decides to use a tool
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools,
tool_choice="auto" # Let model decide when to use tools
)
# Check if model wants to call a function
message = response.choices[0].message
if message.tool_calls:
# Execute the function(s)
tool_results = []
for tool_call in message.tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
if function_name == "get_weather":
result = get_weather(**function_args)
tool_results.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
# Second API call - send results back to model
final_response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "What's the weather in Paris?"},
message, # Include the assistant's tool call
*tool_results # Include tool results
]
)
print(final_response.choices[0].message.content)
# Output: "The weather in Paris is currently sunny with a temperature of 22°C."Tool Choice Options
| Value | Behavior |
|---|---|
"auto" | Model decides whether to call functions (default) |
"none" | Never call functions |
"required" | Must call at least one function |
{"type": "function", "function": {"name": "get_weather"}} | Force a specific function |
Embeddings
Generate vector embeddings for text. Use for semantic search, RAG (Retrieval Augmented Generation), clustering, and similarity matching.
Basic Embedding Request
curl -X POST https://dash.packet.ai/api/v1/embeddings \
-H "Authorization: Bearer pk_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-ada-002",
"input": "The quick brown fox jumps over the lazy dog."
}'Python Embeddings Example
from openai import OpenAI
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
# Single text embedding
response = client.embeddings.create(
model="text-embedding-ada-002",
input="Machine learning is a subset of artificial intelligence."
)
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}") # e.g., 1536
# Multiple texts at once
response = client.embeddings.create(
model="text-embedding-ada-002",
input=[
"First document about machine learning",
"Second document about deep learning",
"Third document about neural networks"
]
)
for i, item in enumerate(response.data):
print(f"Document {i}: {len(item.embedding)} dimensions")Semantic Search Example
import numpy as np
from openai import OpenAI
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
# Your document corpus
documents = [
"Python is a programming language known for simplicity",
"JavaScript is used for web development",
"Machine learning uses algorithms to learn from data",
"React is a JavaScript library for building UIs",
"TensorFlow is a machine learning framework"
]
# Create embeddings for all documents
doc_response = client.embeddings.create(
model="text-embedding-ada-002",
input=documents
)
doc_embeddings = [d.embedding for d in doc_response.data]
# Create embedding for search query
query = "How do I build AI models?"
query_response = client.embeddings.create(
model="text-embedding-ada-002",
input=query
)
query_embedding = query_response.data[0].embedding
# Calculate cosine similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Find most similar documents
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)
print("Most relevant documents:")
for idx, score in ranked[:3]:
print(f" {score:.3f}: {documents[idx]}")Embedding Parameters
| Parameter | Type | Description |
|---|---|---|
model | string | Embedding model (e.g., "text-embedding-ada-002") |
input | string | array | Text(s) to embed. Max 8192 tokens per input. |
encoding_format | string | "float" (default) or "base64" |
Batch Processing
Process large volumes of requests with significant cost savings (up to 50% off).
Step 1: Prepare JSONL File
Create a file with one request per line:
{"custom_id": "req-001", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: AI is transforming industries..."}], "max_tokens": 200}}
{"custom_id": "req-002", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Machine learning enables..."}], "max_tokens": 200}}
{"custom_id": "req-003", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Deep learning uses neural..."}], "max_tokens": 200}}Step 2: Submit Batch Job
curl -X POST https://dash.packet.ai/api/v1/batch \
-H "Authorization: Bearer pk_live_YOUR_API_KEY" \
-F "file=@batch_requests.jsonl" \
-F "sla=24h"
# Response:
# {
# "id": "batch_abc123",
# "status": "queued",
# "created_at": "2024-01-15T10:00:00Z",
# "total_requests": 3
# }Step 3: Check Status
curl https://dash.packet.ai/api/v1/batch/batch_abc123 \
-H "Authorization: Bearer pk_live_YOUR_API_KEY"
# Response:
# {
# "id": "batch_abc123",
# "status": "completed",
# "completed_requests": 3,
# "failed_requests": 0
# }Step 4: Download Results
curl https://dash.packet.ai/api/v1/batch/batch_abc123/results \
-H "Authorization: Bearer pk_live_YOUR_API_KEY" \
-o results.jsonlBatch SLA Options
| SLA | Completion Time | Discount |
|---|---|---|
1h | Within 1 hour | 30% off |
24h | Within 24 hours | 50% off |
LoRA Fine-Tuning
Customize models for your specific use case with efficient LoRA (Low-Rank Adaptation) fine-tuning.
What is LoRA?
LoRA adds small trainable parameters to a base model, enabling faster training, lower costs, and compact adapters (30-100MB) that can be loaded dynamically without modifying the base model.
Step 1: Create Adapter
Create LoRA adapters from the Token Factory tab in your dashboard, or via API:
curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
-H "Authorization: Bearer YOUR_DASHBOARD_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "customer-support-v1",
"display_name": "Customer Support Model",
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"epochs": 3,
"learning_rate": 0.0002,
"rank": 16
}'Step 2: Upload Training Data
Create a JSONL file with conversation examples:
{"messages": [{"role": "system", "content": "You are a helpful customer service agent."}, {"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "I'd be happy to help you track your order. Could you please provide your order number?"}]}
{"messages": [{"role": "user", "content": "I want to return this item"}, {"role": "assistant", "content": "I can help you with that return. Our return policy allows returns within 30 days. Would you like me to start the return process?"}]}
{"messages": [{"role": "user", "content": "How long does shipping take?"}, {"role": "assistant", "content": "Standard shipping typically takes 3-5 business days. Express shipping is 1-2 business days. Would you like to know the shipping cost for your location?"}]}Training Parameters
| Parameter | Default | Description |
|---|---|---|
epochs | 3 | Training passes over data (1-10) |
learning_rate | 0.0002 | How fast model adapts (0.0001-0.001) |
rank | 16 | LoRA dimension - higher = more capacity (8, 16, 32, 64) |
alpha | 32 | Scaling factor (typically 2x rank) |
dropout | 0.05 | Regularization to prevent overfitting |
API Reference
Base URL
https://dash.packet.ai/api/v1Endpoints
| Endpoint | Method | Description |
|---|---|---|
/models | GET | List available models |
/chat/completions | POST | Create chat completion |
/completions | POST | Create text completion |
/embeddings | POST | Create text embeddings |
/batch | GET | List batch jobs |
/batch | POST | Create batch job |
/batch/:id | GET | Get batch job status |
/batch/:id/results | GET | Download batch results |
/batch/:id | DELETE | Cancel batch job |
Error Handling
The API returns standard HTTP status codes with detailed error messages.
Error Response Format
{
"error": {
"message": "Invalid API key provided",
"type": "authentication_error",
"code": "invalid_api_key"
}
}Common Error Codes
| HTTP Status | Error Type | Description |
|---|---|---|
| 400 | invalid_request | Malformed request or invalid parameters |
| 401 | authentication_error | Missing or invalid API key |
| 402 | insufficient_balance | Not enough credits in wallet |
| 404 | not_found | Model or resource not found |
| 429 | rate_limit_exceeded | Too many requests |
| 500 | server_error | Internal server error |
| 503 | service_unavailable | Model is loading or service is down |
Python Error Handling Example
from openai import OpenAI, APIError, RateLimitError, AuthenticationError
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="pk_live_YOUR_API_KEY"
)
try:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}]
)
except AuthenticationError:
print("Invalid API key - check your credentials")
except RateLimitError:
print("Rate limited - implement exponential backoff")
except APIError as e:
print(f"API error: {e.message}")Rate Limits
Rate limits protect the service and ensure fair usage.
| Tier | Requests/min | Tokens/min |
|---|---|---|
| Free | 20 | 40,000 |
| Standard | 60 | 150,000 |
| Pro | 200 | 500,000 |
Rate limit headers are included in responses:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1705309200Best Practices
For Real-Time Chat
- Use streaming for better user experience on long responses
- Set max_tokens to control response length and costs
- Include system prompts for consistent behavior
- Use temperature 0 for deterministic outputs
For Structured Outputs
- Use JSON schema when you need guaranteed structure
- Include examples in system prompt for complex schemas
- Validate output even with strict mode enabled
For Function Calling
- Write clear descriptions for functions and parameters
- Use tool_choice="auto" to let model decide
- Handle edge cases where model may not call expected functions
For Batch Processing
- Use 24h SLA when possible for 50% savings
- Include custom_ids to match results to requests
- Validate JSONL before uploading
- Monitor job status via webhook or polling
Troubleshooting
"Model not found"
- Check exact model name from
/v1/modelsendpoint - Ensure model is available and loaded
"Invalid training data format"
- Each line must be valid JSON
- Each example needs a
messagesarray - Minimum 10 examples required
"Insufficient wallet balance"
- Add funds in Dashboard → Billing
- Estimate costs before large batch jobs
"Context length exceeded"
- Reduce input tokens or use a model with larger context
- Summarize or truncate conversation history
Need Help?
Contact us at support@packet.ai
