Back to Docs

Token Factory

New

Managed inference API with batch processing and LoRA fine-tuning

Token Factory - Inference API

OpenAI-compatible inference API with real-time chat, batch processing, embeddings, structured outputs, function calling, and LoRA fine-tuning.

What is Token Factory?

Token Factory provides managed inference for large language models. Instead of managing your own GPU instance, you pay per token and we handle the infrastructure. Perfect for production APIs, batch processing, embeddings, and custom model fine-tuning.

Table of Contents

Pricing

Token Factory offers competitive pricing across different tiers:

TierInput (per 1M)Output (per 1M)Best For
Real-time$0.10$0.10Interactive apps, chatbots
Batch 1h$0.07$0.07Time-sensitive bulk work
Batch 24h$0.05$0.05Maximum cost savings
Embeddings$0.02-Semantic search, RAG

Authentication

All API requests require an API key. Get your key from Dashboard → API Keys.

Using Your API Key

# Include in Authorization header
Authorization: Bearer pk_live_YOUR_API_KEY

# Or as a query parameter (less secure)
?api_key=pk_live_YOUR_API_KEY

Real-Time Chat Completions

Use for interactive applications where users expect immediate responses.

cURL Example

curl -X POST https://dash.packet.ai/api/v1/chat/completions \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

Python Example

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

JavaScript/TypeScript Example

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://dash.packet.ai/api/v1',
  apiKey: 'pk_live_YOUR_API_KEY',
});

const response = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is machine learning?' }
  ],
  max_tokens: 500,
  temperature: 0.7,
});

console.log(response.choices[0].message.content);

Streaming Responses

For better UX, stream tokens as they're generated:

Python Streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story about a robot"}],
    max_tokens=1000,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript Streaming

const stream = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Request Parameters

ParameterTypeDefaultDescription
modelstringrequiredModel ID (e.g., "meta-llama/Llama-3.1-8B-Instruct")
messagesarrayrequiredConversation history with role/content objects
max_tokensinteger1024Maximum tokens to generate (1-4096)
temperaturenumber0.7Sampling randomness (0-2). Lower = more deterministic
top_pnumber1.0Nucleus sampling parameter (0-1)
streambooleanfalseEnable Server-Sent Events streaming
stoparraynullStop sequences (up to 4)
presence_penaltynumber0Penalize repeated topics (-2 to 2)
frequency_penaltynumber0Penalize repeated tokens (-2 to 2)

Structured Outputs (JSON Mode)

Force the model to output valid JSON that matches a specific schema. Perfect for API responses, data extraction, and programmatic processing.

When to Use Structured Outputs

  • • Extracting structured data from text (names, dates, entities)
  • • Building APIs that need consistent response formats
  • • Data validation and form filling
  • • Converting natural language to structured commands

JSON Mode (Simple)

Request any valid JSON response:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that outputs JSON."},
        {"role": "user", "content": "Extract the person's name and age from: John Smith is 35 years old."}
    ],
    response_format={"type": "json_object"}
)

# Output: {"name": "John Smith", "age": 35}

JSON Schema (Strict)

Enforce a specific schema for guaranteed structure:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Extract product information from the user's message."},
        {"role": "user", "content": "I want to buy 3 apples at $2 each"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "product_order",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"},
                    "total_price": {"type": "number"}
                },
                "required": ["product_name", "quantity", "unit_price", "total_price"]
            }
        }
    }
)

# Output: {"product_name": "apples", "quantity": 3, "unit_price": 2.0, "total_price": 6.0}

JavaScript JSON Schema Example

const response = await client.chat.completions.create({
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  messages: [
    { role: 'system', content: 'Extract event information.' },
    { role: 'user', content: 'Meeting with John tomorrow at 3pm in Conference Room A' }
  ],
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'calendar_event',
      strict: true,
      schema: {
        type: 'object',
        properties: {
          title: { type: 'string' },
          attendees: { type: 'array', items: { type: 'string' } },
          date: { type: 'string' },
          time: { type: 'string' },
          location: { type: 'string' }
        },
        required: ['title', 'attendees', 'date', 'time', 'location']
      }
    }
  }
});

Function Calling (Tools)

Enable the model to call functions/tools that you define. The model determines when to call functions and with what arguments.

Function Calling Use Cases

  • • Connecting LLMs to external APIs (weather, databases, etc.)
  • • Building AI agents that can take actions
  • • Multi-step workflows with tool use
  • • Structured data retrieval and manipulation

Defining Tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search for products in the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "category": {"type": "string", "description": "Product category"},
                    "max_results": {"type": "integer", "default": 10}
                },
                "required": ["query"]
            }
        }
    }
]

Complete Function Calling Example (Python)

import json
from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

# Define your tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

# Your actual function implementation
def get_weather(location: str, unit: str = "celsius") -> dict:
    # In production, call a real weather API
    return {"location": location, "temperature": 22, "unit": unit, "condition": "sunny"}

# First API call - model decides to use a tool
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"  # Let model decide when to use tools
)

# Check if model wants to call a function
message = response.choices[0].message
if message.tool_calls:
    # Execute the function(s)
    tool_results = []
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)

        if function_name == "get_weather":
            result = get_weather(**function_args)

        tool_results.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })

    # Second API call - send results back to model
    final_response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"},
            message,  # Include the assistant's tool call
            *tool_results  # Include tool results
        ]
    )

    print(final_response.choices[0].message.content)
    # Output: "The weather in Paris is currently sunny with a temperature of 22°C."

Tool Choice Options

ValueBehavior
"auto"Model decides whether to call functions (default)
"none"Never call functions
"required"Must call at least one function
{"type": "function", "function": {"name": "get_weather"}}Force a specific function

Embeddings

Generate vector embeddings for text. Use for semantic search, RAG (Retrieval Augmented Generation), clustering, and similarity matching.

Basic Embedding Request

curl -X POST https://dash.packet.ai/api/v1/embeddings \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The quick brown fox jumps over the lazy dog."
  }'

Python Embeddings Example

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

# Single text embedding
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="Machine learning is a subset of artificial intelligence."
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")  # e.g., 1536

# Multiple texts at once
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=[
        "First document about machine learning",
        "Second document about deep learning",
        "Third document about neural networks"
    ]
)

for i, item in enumerate(response.data):
    print(f"Document {i}: {len(item.embedding)} dimensions")

Semantic Search Example

import numpy as np
from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

# Your document corpus
documents = [
    "Python is a programming language known for simplicity",
    "JavaScript is used for web development",
    "Machine learning uses algorithms to learn from data",
    "React is a JavaScript library for building UIs",
    "TensorFlow is a machine learning framework"
]

# Create embeddings for all documents
doc_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=documents
)
doc_embeddings = [d.embedding for d in doc_response.data]

# Create embedding for search query
query = "How do I build AI models?"
query_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=query
)
query_embedding = query_response.data[0].embedding

# Calculate cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Find most similar documents
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

print("Most relevant documents:")
for idx, score in ranked[:3]:
    print(f"  {score:.3f}: {documents[idx]}")

Embedding Parameters

ParameterTypeDescription
modelstringEmbedding model (e.g., "text-embedding-ada-002")
inputstring | arrayText(s) to embed. Max 8192 tokens per input.
encoding_formatstring"float" (default) or "base64"

Batch Processing

Process large volumes of requests with significant cost savings (up to 50% off).

Step 1: Prepare JSONL File

Create a file with one request per line:

{"custom_id": "req-001", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: AI is transforming industries..."}], "max_tokens": 200}}
{"custom_id": "req-002", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Machine learning enables..."}], "max_tokens": 200}}
{"custom_id": "req-003", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Summarize: Deep learning uses neural..."}], "max_tokens": 200}}

Step 2: Submit Batch Job

curl -X POST https://dash.packet.ai/api/v1/batch \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -F "file=@batch_requests.jsonl" \
  -F "sla=24h"

# Response:
# {
#   "id": "batch_abc123",
#   "status": "queued",
#   "created_at": "2024-01-15T10:00:00Z",
#   "total_requests": 3
# }

Step 3: Check Status

curl https://dash.packet.ai/api/v1/batch/batch_abc123 \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY"

# Response:
# {
#   "id": "batch_abc123",
#   "status": "completed",
#   "completed_requests": 3,
#   "failed_requests": 0
# }

Step 4: Download Results

curl https://dash.packet.ai/api/v1/batch/batch_abc123/results \
  -H "Authorization: Bearer pk_live_YOUR_API_KEY" \
  -o results.jsonl

Batch SLA Options

SLACompletion TimeDiscount
1hWithin 1 hour30% off
24hWithin 24 hours50% off

LoRA Fine-Tuning

Customize models for your specific use case with efficient LoRA (Low-Rank Adaptation) fine-tuning.

What is LoRA?

LoRA adds small trainable parameters to a base model, enabling faster training, lower costs, and compact adapters (30-100MB) that can be loaded dynamically without modifying the base model.

Step 1: Create Adapter

Create LoRA adapters from the Token Factory tab in your dashboard, or via API:

curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
  -H "Authorization: Bearer YOUR_DASHBOARD_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "customer-support-v1",
    "display_name": "Customer Support Model",
    "base_model": "meta-llama/Llama-3.1-8B-Instruct",
    "epochs": 3,
    "learning_rate": 0.0002,
    "rank": 16
  }'

Step 2: Upload Training Data

Create a JSONL file with conversation examples:

{"messages": [{"role": "system", "content": "You are a helpful customer service agent."}, {"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "I'd be happy to help you track your order. Could you please provide your order number?"}]}
{"messages": [{"role": "user", "content": "I want to return this item"}, {"role": "assistant", "content": "I can help you with that return. Our return policy allows returns within 30 days. Would you like me to start the return process?"}]}
{"messages": [{"role": "user", "content": "How long does shipping take?"}, {"role": "assistant", "content": "Standard shipping typically takes 3-5 business days. Express shipping is 1-2 business days. Would you like to know the shipping cost for your location?"}]}

Training Parameters

ParameterDefaultDescription
epochs3Training passes over data (1-10)
learning_rate0.0002How fast model adapts (0.0001-0.001)
rank16LoRA dimension - higher = more capacity (8, 16, 32, 64)
alpha32Scaling factor (typically 2x rank)
dropout0.05Regularization to prevent overfitting

API Reference

Base URL

https://dash.packet.ai/api/v1

Endpoints

EndpointMethodDescription
/modelsGETList available models
/chat/completionsPOSTCreate chat completion
/completionsPOSTCreate text completion
/embeddingsPOSTCreate text embeddings
/batchGETList batch jobs
/batchPOSTCreate batch job
/batch/:idGETGet batch job status
/batch/:id/resultsGETDownload batch results
/batch/:idDELETECancel batch job

Error Handling

The API returns standard HTTP status codes with detailed error messages.

Error Response Format

{
  "error": {
    "message": "Invalid API key provided",
    "type": "authentication_error",
    "code": "invalid_api_key"
  }
}

Common Error Codes

HTTP StatusError TypeDescription
400invalid_requestMalformed request or invalid parameters
401authentication_errorMissing or invalid API key
402insufficient_balanceNot enough credits in wallet
404not_foundModel or resource not found
429rate_limit_exceededToo many requests
500server_errorInternal server error
503service_unavailableModel is loading or service is down

Python Error Handling Example

from openai import OpenAI, APIError, RateLimitError, AuthenticationError

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="pk_live_YOUR_API_KEY"
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
except AuthenticationError:
    print("Invalid API key - check your credentials")
except RateLimitError:
    print("Rate limited - implement exponential backoff")
except APIError as e:
    print(f"API error: {e.message}")

Rate Limits

Rate limits protect the service and ensure fair usage.

TierRequests/minTokens/min
Free2040,000
Standard60150,000
Pro200500,000

Rate limit headers are included in responses:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1705309200

Best Practices

For Real-Time Chat

  • Use streaming for better user experience on long responses
  • Set max_tokens to control response length and costs
  • Include system prompts for consistent behavior
  • Use temperature 0 for deterministic outputs

For Structured Outputs

  • Use JSON schema when you need guaranteed structure
  • Include examples in system prompt for complex schemas
  • Validate output even with strict mode enabled

For Function Calling

  • Write clear descriptions for functions and parameters
  • Use tool_choice="auto" to let model decide
  • Handle edge cases where model may not call expected functions

For Batch Processing

  • Use 24h SLA when possible for 50% savings
  • Include custom_ids to match results to requests
  • Validate JSONL before uploading
  • Monitor job status via webhook or polling

Troubleshooting

"Model not found"

  • Check exact model name from /v1/models endpoint
  • Ensure model is available and loaded

"Invalid training data format"

  • Each line must be valid JSON
  • Each example needs a messages array
  • Minimum 10 examples required

"Insufficient wallet balance"

  • Add funds in Dashboard → Billing
  • Estimate costs before large batch jobs

"Context length exceeded"

  • Reduce input tokens or use a model with larger context
  • Summarize or truncate conversation history

Need Help?

Contact us at support@packet.ai