Token Factory - Inference API

OpenAI-compatible inference API with real-time chat, batch processing, embeddings, structured outputs, function calling, and LoRA fine-tuning.

What is Token Factory?

Token Factory provides managed inference for large language models. Instead of managing your own GPU instance, you pay per token and we handle the infrastructure. Perfect for production APIs, batch processing, embeddings, and custom model fine-tuning.

Pricing
Authentication
Real-Time Chat Completions
Structured Outputs (JSON Mode)
Function Calling (Tools)
Embeddings
Batch Processing
LoRA Fine-Tuning
API Reference
Error Handling
Rate Limits

Pricing

Token Factory offers competitive pricing across different tiers:

Tier	Input (per 1M)	Output (per 1M)	Best For
Real-time	$0.10	$0.10	Interactive apps, chatbots
Batch 1h	$0.07	$0.07	Time-sensitive bulk work
Batch 24h	$0.05	$0.05	Maximum cost savings
Embeddings	$0.02	-	Semantic search, RAG

Authentication

All API requests require an API key. Get your key from Dashboard → API Keys.

Using Your API Key

# Include in Authorization header
Authorization: Bearer pk_live_YOUR_API_KEY

# Or as a query parameter (less secure)
?api_key=pk_live_YOUR_API_KEY

Real-Time Chat Completions

Use for interactive applications where users expect immediate responses.

cURL Example

curl -X POST https://dash.packet.ai/api/v1/chat/completions \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

Python Example

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

JavaScript/TypeScript Example

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: 'pk_live_YOUR_API_KEY',
});

const response = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is machine learning?' }
  ],
  max_tokens: 500,
  temperature: 0.7,
});

console.log(response.choices[0].message.content);

Streaming Responses

For better UX, stream tokens as they're generated:

Python Streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story about a robot"}],
    max_tokens=1000,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript Streaming

const stream = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Request Parameters

Parameter	Type	Default	Description
`model`	string	required	Model ID (e.g., "meta-llama/Llama-3.1-8B-Instruct")
`messages`	array	required	Conversation history with role/content objects
`max_tokens`	integer	1024	Maximum tokens to generate (1-4096)
`temperature`	number	0.7	Sampling randomness (0-2). Lower = more deterministic
`top_p`	number	1.0	Nucleus sampling parameter (0-1)
`stream`	boolean	false	Enable Server-Sent Events streaming
`stop`	array	null	Stop sequences (up to 4)
`presence_penalty`	number	0	Penalize repeated topics (-2 to 2)
`frequency_penalty`	number	0	Penalize repeated tokens (-2 to 2)

Structured Outputs (JSON Mode)

Force the model to output valid JSON that matches a specific schema. Perfect for API responses, data extraction, and programmatic processing.

When to Use Structured Outputs

• Extracting structured data from text (names, dates, entities)
• Building APIs that need consistent response formats
• Data validation and form filling
• Converting natural language to structured commands

JSON Mode (Simple)

Request any valid JSON response:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that outputs JSON."},
        {"role": "user", "content": "Extract the person's name and age from: John Smith is 35 years old."}
    ],
    response_format={"type": "json_object"}
)

# Output: {"name": "John Smith", "age": 35}

JSON Schema (Strict)

Enforce a specific schema for guaranteed structure:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Extract product information from the user's message."},
        {"role": "user", "content": "I want to buy 3 apples at $2 each"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "product_order",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"},
                    "total_price": {"type": "number"}
                },
                "required": ["product_name", "quantity", "unit_price", "total_price"]
            }
        }
    }
)

# Output: {"product_name": "apples", "quantity": 3, "unit_price": 2.0, "total_price": 6.0}

JavaScript JSON Schema Example

const response = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [
    { role: 'system', content: 'Extract event information.' },
    { role: 'user', content: 'Meeting with John tomorrow at 3pm in Conference Room A' }
  ],
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'calendar_event',
      strict: true,
      schema: {
        type: 'object',
        properties: {
          title: { type: 'string' },
          attendees: { type: 'array', items: { type: 'string' } },
          date: { type: 'string' },
          time: { type: 'string' },
          location: { type: 'string' }
        },
        required: ['title', 'attendees', 'date', 'time', 'location']
      }
    }
  }
});

Function Calling (Tools)

Enable the model to call functions/tools that you define. The model determines when to call functions and with what arguments.

Function Calling Use Cases

• Connecting LLMs to external APIs (weather, databases, etc.)
• Building AI agents that can take actions
• Multi-step workflows with tool use
• Structured data retrieval and manipulation

Defining Tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search for products in the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "category": {"type": "string", "description": "Product category"},
                    "max_results": {"type": "integer", "default": 10}
                },
                "required": ["query"]
            }
        }
    }
]

Complete Function Calling Example (Python)

import json
from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

# Define your tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

# Your actual function implementation
def get_weather(location: str, unit: str = "celsius") -> dict:
    # In production, call a real weather API
    return {"location": location, "temperature": 22, "unit": unit, "condition": "sunny"}

# First API call - model decides to use a tool
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"  # Let model decide when to use tools
)

# Check if model wants to call a function
message = response.choices[0].message
if message.tool_calls:
    # Execute the function(s)
    tool_results = []
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)

        if function_name == "get_weather":
            result = get_weather(**function_args)

        tool_results.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })

    # Second API call - send results back to model
    final_response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"},
            message,  # Include the assistant's tool call
            *tool_results  # Include tool results
        ]
    )

    print(final_response.choices[0].message.content)
    # Output: "The weather in Paris is currently sunny with a temperature of 22°C."

Tool Choice Options

Value	Behavior
`"auto"`	Model decides whether to call functions (default)
`"none"`	Never call functions
`"required"`	Must call at least one function
`{"type": "function", "function": {"name": "get_weather"}}`	Force a specific function

Embeddings

Generate vector embeddings for text. Use for semantic search, RAG (Retrieval Augmented Generation), clustering, and similarity matching.

Basic Embedding Request

curl -X POST https://dash.packet.ai/api/v1/embeddings \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The quick brown fox jumps over the lazy dog."
  }'

Python Embeddings Example

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

# Single text embedding
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="Machine learning is a subset of artificial intelligence."
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")  # e.g., 1536

# Multiple texts at once
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=[
        "First document about machine learning",
        "Second document about deep learning",
        "Third document about neural networks"
    ]
)

for i, item in enumerate(response.data):
    print(f"Document {i}: {len(item.embedding)} dimensions")

Semantic Search Example

import numpy as np
from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

# Your document corpus
documents = [
    "Python is a programming language known for simplicity",
    "JavaScript is used for web development",
    "Machine learning uses algorithms to learn from data",
    "React is a JavaScript library for building UIs",
    "TensorFlow is a machine learning framework"
]

# Create embeddings for all documents
doc_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=documents
)
doc_embeddings = [d.embedding for d in doc_response.data]

# Create embedding for search query
query = "How do I build AI models?"
query_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=query
)
query_embedding = query_response.data[0].embedding

# Calculate cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Find most similar documents
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

print("Most relevant documents:")
for idx, score in ranked[:3]:
    print(f"  {score:.3f}: {documents[idx]}")

Embedding Parameters

Parameter	Type	Description
`model`	string	Embedding model (e.g., "text-embedding-ada-002")
`input`	string \| array	Text(s) to embed. Max 8192 tokens per input.
`encoding_format`	string	"float" (default) or "base64"

Batch Processing

Process large volumes of requests with significant cost savings (up to 50% off).

Step 1: Prepare JSONL File

Create a file with one request per line:

{"custom_id": "req-001", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: AI is transforming industries..."}], "max_tokens": 200}}
{"custom_id": "req-002", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Machine learning enables..."}], "max_tokens": 200}}
{"custom_id": "req-003", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Deep learning uses neural..."}], "max_tokens": 200}}

Step 2: Submit Batch Job

curl -X POST https://dash.packet.ai/api/v1/batch \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -F "file=@batch_requests.jsonl" \
  -F "sla=24h"

# Response:
# {
#   "id": "batch_abc123",
#   "status": "queued",
#   "created_at": "2024-01-15T10:00:00Z",
#   "total_requests": 3
# }

Step 3: Check Status

curl https://dash.packet.ai/api/v1/batch/batch_abc123 \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY"

# Response:
# {
#   "id": "batch_abc123",
#   "status": "completed",
#   "completed_requests": 3,
#   "failed_requests": 0
# }

Step 4: Download Results

curl https://dash.packet.ai/api/v1/batch/batch_abc123/results \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -o results.jsonl

Batch SLA Options

SLA	Completion Time	Discount
`1h`	Within 1 hour	30% off
`24h`	Within 24 hours	50% off

LoRA Fine-Tuning

Customize models for your specific use case with efficient LoRA (Low-Rank Adaptation) fine-tuning.

What is LoRA?

LoRA adds small trainable parameters to a base model, enabling faster training, lower costs, and compact adapters (30-100MB) that can be loaded dynamically without modifying the base model.

Step 1: Create Adapter

Create LoRA adapters from the Token Factory tab in your dashboard, or via API:

curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
  -H "Authorization: Bearer YOUR_DASHBOARD_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "customer-support-v1",
    "display_name": "Customer Support Model",
    "base_model": "meta-llama/Llama-3.1-8B-Instruct",
    "epochs": 3,
    "learning_rate": 0.0002,
    "rank": 16
  }'

Step 2: Upload Training Data

Create a JSONL file with conversation examples:

{"messages": [{"role": "system", "content": "You are a helpful customer service agent."}, {"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "I'd be happy to help you track your order. Could you please provide your order number?"}]}
{"messages": [{"role": "user", "content": "I want to return this item"}, {"role": "assistant", "content": "I can help you with that return. Our return policy allows returns within 30 days. Would you like me to start the return process?"}]}
{"messages": [{"role": "user", "content": "How long does shipping take?"}, {"role": "assistant", "content": "Standard shipping typically takes 3-5 business days. Express shipping is 1-2 business days. Would you like to know the shipping cost for your location?"}]}

Training Parameters

Parameter	Default	Description
`epochs`	3	Training passes over data (1-10)
`learning_rate`	0.0002	How fast model adapts (0.0001-0.001)
`rank`	16	LoRA dimension - higher = more capacity (8, 16, 32, 64)
`alpha`	32	Scaling factor (typically 2x rank)
`dropout`	0.05	Regularization to prevent overfitting

API Reference

Base URL

https://dash.packet.ai/api/v1

Endpoints

Endpoint	Method	Description
`/models`	GET	List available models
`/chat/completions`	POST	Create chat completion
`/completions`	POST	Create text completion
`/embeddings`	POST	Create text embeddings
`/batch`	GET	List batch jobs
`/batch`	POST	Create batch job
`/batch/:id`	GET	Get batch job status
`/batch/:id/results`	GET	Download batch results
`/batch/:id`	DELETE	Cancel batch job

Error Handling

The API returns standard HTTP status codes with detailed error messages.

Error Response Format

{
  "error": {
    "message": "Invalid API key provided",
    "type": "authentication_error",
    "code": "invalid_api_key"
  }
}

Common Error Codes

HTTP Status	Error Type	Description
400	`invalid_request`	Malformed request or invalid parameters
401	`authentication_error`	Missing or invalid API key
402	`insufficient_balance`	Not enough credits in wallet
404	`not_found`	Model or resource not found
429	`rate_limit_exceeded`	Too many requests
500	`server_error`	Internal server error
503	`service_unavailable`	Model is loading or service is down

Python Error Handling Example

from openai import OpenAI, APIError, RateLimitError, AuthenticationError

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
except AuthenticationError:
    print("Invalid API key - check your credentials")
except RateLimitError:
    print("Rate limited - implement exponential backoff")
except APIError as e:
    print(f"API error: {e.message}")

Rate Limits

Rate limits protect the service and ensure fair usage.

Tier	Requests/min	Tokens/min
Free	20	40,000
Standard	60	150,000
Pro	200	500,000

Rate limit headers are included in responses:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1705309200

Best Practices

For Real-Time Chat

Use streaming for better user experience on long responses
Set max_tokens to control response length and costs
Include system prompts for consistent behavior
Use temperature 0 for deterministic outputs

For Structured Outputs

Use JSON schema when you need guaranteed structure
Include examples in system prompt for complex schemas
Validate output even with strict mode enabled

For Function Calling

Write clear descriptions for functions and parameters
Use tool_choice="auto" to let model decide
Handle edge cases where model may not call expected functions

For Batch Processing

Use 24h SLA when possible for 50% savings
Include custom_ids to match results to requests
Validate JSONL before uploading
Monitor job status via webhook or polling

Troubleshooting

"Model not found"

Check exact model name from /v1/models endpoint
Ensure model is available and loaded

"Invalid training data format"

Each line must be valid JSON
Each example needs a messages array
Minimum 10 examples required

"Insufficient wallet balance"

Add funds in Dashboard → Billing
Estimate costs before large batch jobs

"Context length exceeded"

Reduce input tokens or use a model with larger context
Summarize or truncate conversation history

Token Factory