Token Factory: How We Built a 98% Cheaper OpenAI Alternative

Token Factory is our managed inference API. It's OpenAI-compatible, which means you can literally swap out your base URL and keep using the OpenAI SDK. But here's the interesting part: we're charging $0.10-0.15 per million tokens, compared to OpenAI's $2.50-10.00. That's not a typo.

This post explains how it works, with real code examples.

The Architecture

Token Factory runs on vLLM, arguably the fastest open-source inference engine available. vLLM implements continuous batching, PagedAttention for efficient KV-cache management, and tensor parallelism for multi-GPU setups.

Here's what happens when you make a request:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Your Code     │────▶│  Token Factory   │────▶│  vLLM Cluster   │
│  (OpenAI SDK)   │◀────│   Load Balancer  │◀────│  (GPU Servers)  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                               │
                               ▼
                        ┌──────────────────┐
                        │  Usage Tracking  │
                        │  & Billing       │
                        └──────────────────┘

Your request hits our API (OpenAI-compatible format)
We authenticate via API key, check your wallet balance
Request is routed to the optimal vLLM server based on model and load
vLLM generates tokens using continuous batching
We count tokens, deduct from your wallet, return the response

The key insight: open-source models on optimised infrastructure can match GPT-3.5 quality for most tasks at a fraction of the cost.

Using the OpenAI SDK (Drop-In Replacement)

If you're already using OpenAI, migration takes about 30 seconds:

from openai import OpenAI

# Just change the base URL and API key
client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="your-packet-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

That's it. Same SDK, same response format, same streaming support. Just different (and cheaper) backend.

Streaming Responses

For chatbots and real-time applications, streaming is essential:

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about GPUs"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Tokens arrive as they're generated. First token latency is typically 100-200ms, then tokens flow at 50-100 tokens/second depending on the model.

Using with LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="your-packet-api-key",
    model="meta-llama/Llama-3.1-8B-Instruct"
)

response = llm.invoke("What is machine learning?")

Using with JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: process.env.PACKET_API_KEY
});

const completion = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(completion.choices[0].message.content);

Batch Processing: 50% Off for Async Workloads

Not everything needs real-time responses. If you're processing documents, generating training data, or running evaluations, batch processing saves you serious money.

How Batch Pricing Works

Tier	Price per 1M tokens	Turnaround
Real-time	$0.10	Instant
Batch (1h SLA)	$0.07	Within 1 hour
Batch (24h SLA)	$0.05	Within 24 hours

That's up to 50% savings over real-time pricing.

Creating a Batch Job

First, prepare a JSONL file with your requests:

{"custom_id": "doc-001", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: AI is transforming industries..."}], "max_tokens": 200}}
{"custom_id": "doc-002", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Machine learning enables..."}], "max_tokens": 200}}
{"custom_id": "doc-003", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Neural networks are..."}], "max_tokens": 200}}

Each line is a separate request. The custom_id field lets you match results back to your original requests.

Submit the batch:

curl -X POST https://dash.packet.ai/api/v1/batch \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@requests.jsonl" \
  -F "sla=24h"

Response:

{
  "id": "batch_abc123",
  "object": "batch",
  "status": "queued",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "sla": "24h",
  "total_requests": 3,
  "estimated_cost_cents": 15,
  "deadline": "2025-01-30T12:00:00Z"
}

Checking Batch Status

curl https://dash.packet.ai/api/v1/batch/batch_abc123 \
  -H "Authorization: Bearer YOUR_API_KEY"

Response shows progress:

{
  "id": "batch_abc123",
  "status": "processing",
  "total_requests": 3,
  "completed_requests": 2,
  "failed_requests": 0
}

Downloading Results

When status is completed:

curl https://dash.packet.ai/api/v1/batch/batch_abc123/output \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -o results.jsonl

Results come back as JSONL:

{"custom_id": "doc-001", "response": {"choices": [{"message": {"content": "AI is revolutionizing..."}}]}, "usage": {"prompt_tokens": 45, "completion_tokens": 120}}
{"custom_id": "doc-002", "response": {"choices": [{"message": {"content": "Machine learning provides..."}}]}, "usage": {"prompt_tokens": 42, "completion_tokens": 115}}

Python Batch Client

import requests
import time

API_KEY = "your-api-key"
BASE_URL = "https://dash.packet.ai/api/v1"

def submit_batch(filepath, sla="24h"):
    with open(filepath, "rb") as f:
        response = requests.post(
            f"{BASE_URL}/batch",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"sla": sla}
        )
    return response.json()

def wait_for_batch(batch_id):
    while True:
        response = requests.get(
            f"{BASE_URL}/batch/{batch_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        data = response.json()

        if data["status"] in ["completed", "failed"]:
            return data

        print(f"Progress: {data['completed_requests']}/{data['total_requests']}")
        time.sleep(30)

def download_results(batch_id, output_path):
    response = requests.get(
        f"{BASE_URL}/batch/{batch_id}/output",
        headers={"Authorization": f"Bearer {API_KEY}"}
    )
    with open(output_path, "w") as f:
        f.write(response.text)

# Usage
batch = submit_batch("requests.jsonl", sla="1h")
print(f"Batch {batch['id']} submitted")

result = wait_for_batch(batch["id"])
print(f"Batch completed: {result['completed_requests']} successful")

download_results(batch["id"], "results.jsonl")

LoRA Fine-Tuning: Custom Models Without the Infrastructure

LoRA (Low-Rank Adaptation) lets you customize a base model for your specific use case. Instead of training all 8 billion parameters, LoRA trains small adapter matrices (~30-100MB) that modify the model's behavior.

Why this matters:

Fast training: Hours instead of days
Low cost: $5-50 per training run instead of thousands
Portable: Adapters are small files you can swap dynamically
No infrastructure: We handle the GPUs

Creating a LoRA Adapter

curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "customer-support-v1",
    "display_name": "Customer Support Specialist",
    "base_model": "meta-llama/Llama-3.1-8B-Instruct",
    "epochs": 3,
    "learning_rate": 0.0002,
    "rank": 16
  }'

Response:

{
  "id": "lora_xyz789",
  "name": "customer-support-v1",
  "base_model": "meta-llama/Llama-3.1-8B-Instruct",
  "training_status": "pending",
  "created_at": "2025-01-29T10:00:00Z"
}

Training Data Format

Prepare a JSONL file with conversation examples:

{"messages": [{"role": "system", "content": "You are a helpful customer support agent."}, {"role": "user", "content": "I want to return my order"}, {"role": "assistant", "content": "I'd be happy to help you with a return. Could you please provide your order number? Our return policy allows returns within 30 days of purchase."}]}
{"messages": [{"role": "user", "content": "My package hasn't arrived"}, {"role": "assistant", "content": "I apologize for the delay. Let me look up your tracking information. Could you share your order number so I can investigate this for you?"}]}
{"messages": [{"role": "user", "content": "Do you have this in blue?"}, {"role": "assistant", "content": "I'd be happy to check our inventory for you. Could you tell me which product you're interested in? I can then confirm color availability and sizes."}]}

Quality tips:

50-500 high-quality examples typically work well
Be consistent in tone and format
Include edge cases and variations
Each example should demonstrate the behavior you want

Uploading Training Data

curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora/lora_xyz789/training-data \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@training_data.jsonl"

Starting Training

curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora/lora_xyz789/train \
  -H "Authorization: Bearer YOUR_TOKEN"

Training typically takes 10-60 minutes depending on dataset size and epochs.

Using Your Fine-Tuned Model

Once training completes (training_status: "ready"), use it via the lora_adapter parameter:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "I need to return something"}
    ],
    extra_body={
        "lora_adapter": "lora_xyz789"
    }
)

The base model + your LoRA adapter combine at inference time. No model reloading required.

Training Parameters Explained

Parameter	Default	Description
`epochs`	3	Number of passes through your data. More epochs = more learning, but risk of overfitting. Start with 3, increase if underfitting.
`learning_rate`	0.0002	How fast the model adapts. Lower = more stable but slower. Higher = faster but risk of instability.
`rank`	16	LoRA dimension. Higher = more capacity but larger adapter. 8, 16, 32 are common choices.
`alpha`	32	Scaling factor. Usually 2x the rank works well.

The Economics: Why We're Cheaper

Let's do the math.

OpenAI GPT-4o-mini pricing:

Input: $0.15 per 1M tokens
Output: $0.60 per 1M tokens

OpenAI GPT-4o pricing:

Input: $2.50 per 1M tokens
Output: $10.00 per 1M tokens

Token Factory pricing (all tokens):

Real-time: $0.10-0.15 per 1M tokens (varies by model)
Batch 1h: $0.07-0.10 per 1M tokens
Batch 24h: $0.05-0.08 per 1M tokens

For a typical chatbot processing 100M tokens/month (mixed input/output):

Provider	Monthly Cost
OpenAI GPT-4o	$6,250
OpenAI GPT-4o-mini	$375
Token Factory Real-time	$12
Token Factory Batch 24h	$6

That's 98% cheaper than GPT-4o and 97% cheaper than GPT-4o-mini.

How We Achieve This

Open-source models: Llama 3.1 8B matches GPT-3.5 quality for most tasks
vLLM efficiency: Continuous batching means higher GPU utilization
No margin stacking: We pass infrastructure savings directly to you
Batch scheduling: 24h SLA lets us optimise GPU utilization further

API Reference

Endpoints

Endpoint	Method	Description
`/api/v1/chat/completions`	POST	OpenAI-compatible chat
`/api/v1/models`	GET	List available models
`/api/v1/batch`	POST	Create batch job
`/api/v1/batch`	GET	List batch jobs
`/api/v1/batch/:id`	GET	Get batch status
`/api/v1/batch/:id/output`	GET	Download results

Authentication

All endpoints require an API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Get your API key from Dashboard → API Keys.

Rate Limits

Tier	Requests/min	Tokens/min
Free	60	100K
Pro	600	1M
Enterprise	Custom	Custom

Error Handling

Errors follow OpenAI's format:

{
  "error": {
    "message": "Invalid API key",
    "type": "authentication_error",
    "param": null,
    "code": "invalid_api_key"
  }
}

Getting Started

Sign up at dash.packet.ai
Add funds to your wallet (start with $5)
Create an API key in Dashboard → API Keys
Start making requests using the OpenAI SDK

First 10,000 tokens are free. No credit card required to try.

Conclusion

Token Factory is what happens when you combine open-source models, optimized inference engines, and honest pricing. Same API you already know, 98% cheaper.

We're not trying to replace OpenAI for everything—GPT-4 is still unmatched for complex reasoning tasks. But for the 80% of use cases where Llama 3.1 is good enough, you shouldn't be paying enterprise prices.

Try it out. If it works for your use case, you'll save a lot of money. If it doesn't, you've lost nothing but a few minutes.

Questions? Email support@packet.ai or ping us on Twitter @packetai.