Token Factory: How We Built a 98% Cheaper OpenAI Alternative
Token Factory: How We Built a 98% Cheaper OpenAI Alternative
Token Factory is our managed inference API. It's OpenAI-compatible, which means you can literally swap out your base URL and keep using the OpenAI SDK. But here's the interesting part: we're charging $0.10-0.15 per million tokens, compared to OpenAI's $2.50-10.00. That's not a typo.
This post explains how it works, with real code examples.
The Architecture
Token Factory runs on vLLM, arguably the fastest open-source inference engine available. vLLM implements continuous batching, PagedAttention for efficient KV-cache management, and tensor parallelism for multi-GPU setups.
Here's what happens when you make a request:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Your Code │────▶│ Token Factory │────▶│ vLLM Cluster │
│ (OpenAI SDK) │◀────│ Load Balancer │◀────│ (GPU Servers) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ Usage Tracking │
│ & Billing │
└──────────────────┘
- Your request hits our API (OpenAI-compatible format)
- We authenticate via API key, check your wallet balance
- Request is routed to the optimal vLLM server based on model and load
- vLLM generates tokens using continuous batching
- We count tokens, deduct from your wallet, return the response
The key insight: open-source models on optimised infrastructure can match GPT-3.5 quality for most tasks at a fraction of the cost.
Using the OpenAI SDK (Drop-In Replacement)
If you're already using OpenAI, migration takes about 30 seconds:
from openai import OpenAI
# Just change the base URL and API key
client = OpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="your-packet-api-key"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
max_tokens=500,
temperature=0.7
)
print(response.choices[0].message.content)
That's it. Same SDK, same response format, same streaming support. Just different (and cheaper) backend.
Streaming Responses
For chatbots and real-time applications, streaming is essential:
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about GPUs"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Tokens arrive as they're generated. First token latency is typically 100-200ms, then tokens flow at 50-100 tokens/second depending on the model.
Using with LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://dash.packet.ai/api/v1",
api_key="your-packet-api-key",
model="meta-llama/Llama-3.1-8B-Instruct"
)
response = llm.invoke("What is machine learning?")
Using with JavaScript/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://dash.packet.ai/api/v1',
apiKey: process.env.PACKET_API_KEY
});
const completion = await client.chat.completions.create({
model: 'meta-llama/Llama-3.1-8B-Instruct',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(completion.choices[0].message.content);
Batch Processing: 50% Off for Async Workloads
Not everything needs real-time responses. If you're processing documents, generating training data, or running evaluations, batch processing saves you serious money.
How Batch Pricing Works
| Tier | Price per 1M tokens | Turnaround |
|---|---|---|
| Real-time | $0.10 | Instant |
| Batch (1h SLA) | $0.07 | Within 1 hour |
| Batch (24h SLA) | $0.05 | Within 24 hours |
That's up to 50% savings over real-time pricing.
Creating a Batch Job
First, prepare a JSONL file with your requests:
{"custom_id": "doc-001", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: AI is transforming industries..."}], "max_tokens": 200}}
{"custom_id": "doc-002", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Machine learning enables..."}], "max_tokens": 200}}
{"custom_id": "doc-003", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Neural networks are..."}], "max_tokens": 200}}
Each line is a separate request. The custom_id field lets you match results back to your original requests.
Submit the batch:
curl -X POST https://dash.packet.ai/api/v1/batch \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@requests.jsonl" \
-F "sla=24h"
Response:
{
"id": "batch_abc123",
"object": "batch",
"status": "queued",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"sla": "24h",
"total_requests": 3,
"estimated_cost_cents": 15,
"deadline": "2025-01-30T12:00:00Z"
}
Checking Batch Status
curl https://dash.packet.ai/api/v1/batch/batch_abc123 \
-H "Authorization: Bearer YOUR_API_KEY"
Response shows progress:
{
"id": "batch_abc123",
"status": "processing",
"total_requests": 3,
"completed_requests": 2,
"failed_requests": 0
}
Downloading Results
When status is completed:
curl https://dash.packet.ai/api/v1/batch/batch_abc123/output \
-H "Authorization: Bearer YOUR_API_KEY" \
-o results.jsonl
Results come back as JSONL:
{"custom_id": "doc-001", "response": {"choices": [{"message": {"content": "AI is revolutionizing..."}}]}, "usage": {"prompt_tokens": 45, "completion_tokens": 120}}
{"custom_id": "doc-002", "response": {"choices": [{"message": {"content": "Machine learning provides..."}}]}, "usage": {"prompt_tokens": 42, "completion_tokens": 115}}
Python Batch Client
import requests
import time
API_KEY = "your-api-key"
BASE_URL = "https://dash.packet.ai/api/v1"
def submit_batch(filepath, sla="24h"):
with open(filepath, "rb") as f:
response = requests.post(
f"{BASE_URL}/batch",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": f},
data={"sla": sla}
)
return response.json()
def wait_for_batch(batch_id):
while True:
response = requests.get(
f"{BASE_URL}/batch/{batch_id}",
headers={"Authorization": f"Bearer {API_KEY}"}
)
data = response.json()
if data["status"] in ["completed", "failed"]:
return data
print(f"Progress: {data['completed_requests']}/{data['total_requests']}")
time.sleep(30)
def download_results(batch_id, output_path):
response = requests.get(
f"{BASE_URL}/batch/{batch_id}/output",
headers={"Authorization": f"Bearer {API_KEY}"}
)
with open(output_path, "w") as f:
f.write(response.text)
# Usage
batch = submit_batch("requests.jsonl", sla="1h")
print(f"Batch {batch['id']} submitted")
result = wait_for_batch(batch["id"])
print(f"Batch completed: {result['completed_requests']} successful")
download_results(batch["id"], "results.jsonl")
LoRA Fine-Tuning: Custom Models Without the Infrastructure
LoRA (Low-Rank Adaptation) lets you customize a base model for your specific use case. Instead of training all 8 billion parameters, LoRA trains small adapter matrices (~30-100MB) that modify the model's behavior.
Why this matters:
- Fast training: Hours instead of days
- Low cost: $5-50 per training run instead of thousands
- Portable: Adapters are small files you can swap dynamically
- No infrastructure: We handle the GPUs
Creating a LoRA Adapter
curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "customer-support-v1",
"display_name": "Customer Support Specialist",
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"epochs": 3,
"learning_rate": 0.0002,
"rank": 16
}'
Response:
{
"id": "lora_xyz789",
"name": "customer-support-v1",
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"training_status": "pending",
"created_at": "2025-01-29T10:00:00Z"
}
Training Data Format
Prepare a JSONL file with conversation examples:
{"messages": [{"role": "system", "content": "You are a helpful customer support agent."}, {"role": "user", "content": "I want to return my order"}, {"role": "assistant", "content": "I'd be happy to help you with a return. Could you please provide your order number? Our return policy allows returns within 30 days of purchase."}]}
{"messages": [{"role": "user", "content": "My package hasn't arrived"}, {"role": "assistant", "content": "I apologize for the delay. Let me look up your tracking information. Could you share your order number so I can investigate this for you?"}]}
{"messages": [{"role": "user", "content": "Do you have this in blue?"}, {"role": "assistant", "content": "I'd be happy to check our inventory for you. Could you tell me which product you're interested in? I can then confirm color availability and sizes."}]}
Quality tips:
- 50-500 high-quality examples typically work well
- Be consistent in tone and format
- Include edge cases and variations
- Each example should demonstrate the behavior you want
Uploading Training Data
curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora/lora_xyz789/training-data \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "file=@training_data.jsonl"
Starting Training
curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora/lora_xyz789/train \
-H "Authorization: Bearer YOUR_TOKEN"
Training typically takes 10-60 minutes depending on dataset size and epochs.
Using Your Fine-Tuned Model
Once training completes (training_status: "ready"), use it via the lora_adapter parameter:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "I need to return something"}
],
extra_body={
"lora_adapter": "lora_xyz789"
}
)
The base model + your LoRA adapter combine at inference time. No model reloading required.
Training Parameters Explained
| Parameter | Default | Description |
|---|---|---|
epochs | 3 | Number of passes through your data. More epochs = more learning, but risk of overfitting. Start with 3, increase if underfitting. |
learning_rate | 0.0002 | How fast the model adapts. Lower = more stable but slower. Higher = faster but risk of instability. |
rank | 16 | LoRA dimension. Higher = more capacity but larger adapter. 8, 16, 32 are common choices. |
alpha | 32 | Scaling factor. Usually 2x the rank works well. |
The Economics: Why We're Cheaper
Let's do the math.
OpenAI GPT-4o-mini pricing:
- Input: $0.15 per 1M tokens
- Output: $0.60 per 1M tokens
OpenAI GPT-4o pricing:
- Input: $2.50 per 1M tokens
- Output: $10.00 per 1M tokens
Token Factory pricing (all tokens):
- Real-time: $0.10-0.15 per 1M tokens (varies by model)
- Batch 1h: $0.07-0.10 per 1M tokens
- Batch 24h: $0.05-0.08 per 1M tokens
For a typical chatbot processing 100M tokens/month (mixed input/output):
| Provider | Monthly Cost |
|---|---|
| OpenAI GPT-4o | $6,250 |
| OpenAI GPT-4o-mini | $375 |
| Token Factory Real-time | $12 |
| Token Factory Batch 24h | $6 |
That's 98% cheaper than GPT-4o and 97% cheaper than GPT-4o-mini.
How We Achieve This
- Open-source models: Llama 3.1 8B matches GPT-3.5 quality for most tasks
- vLLM efficiency: Continuous batching means higher GPU utilization
- No margin stacking: We pass infrastructure savings directly to you
- Batch scheduling: 24h SLA lets us optimise GPU utilization further
API Reference
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/chat/completions | POST | OpenAI-compatible chat |
/api/v1/models | GET | List available models |
/api/v1/batch | POST | Create batch job |
/api/v1/batch | GET | List batch jobs |
/api/v1/batch/:id | GET | Get batch status |
/api/v1/batch/:id/output | GET | Download results |
Authentication
All endpoints require an API key in the Authorization header:
Authorization: Bearer YOUR_API_KEY
Get your API key from Dashboard → API Keys.
Rate Limits
| Tier | Requests/min | Tokens/min |
|---|---|---|
| Free | 60 | 100K |
| Pro | 600 | 1M |
| Enterprise | Custom | Custom |
Error Handling
Errors follow OpenAI's format:
{
"error": {
"message": "Invalid API key",
"type": "authentication_error",
"param": null,
"code": "invalid_api_key"
}
}
Getting Started
- Sign up at dash.packet.ai
- Add funds to your wallet (start with $5)
- Create an API key in Dashboard → API Keys
- Start making requests using the OpenAI SDK
First 10,000 tokens are free. No credit card required to try.
Conclusion
Token Factory is what happens when you combine open-source models, optimized inference engines, and honest pricing. Same API you already know, 98% cheaper.
We're not trying to replace OpenAI for everything—GPT-4 is still unmatched for complex reasoning tasks. But for the 80% of use cases where Llama 3.1 is good enough, you shouldn't be paying enterprise prices.
Try it out. If it works for your use case, you'll save a lot of money. If it doesn't, you've lost nothing but a few minutes.
Questions? Email support@packet.ai or ping us on Twitter @packetai.
