Cascade routing vs. direct pricing

Cascade routing starts every request at the cheapest provider. If the response quality is low, it escalates to the next tier. You only pay premium prices when the cheap answer was genuinely not good enough.


Provider costs per 1K tokens

tierprovidermodelcost / 1K tokens
1GroqLlama 3.3 70B$0.0003
2GeminiGemini 2.0 Flash$0.0005
3AnthropicClaude Sonnet$0.0030
4OpenAIGPT-4o$0.0050

Tier 1 costs 17x less than tier 4. Cascade routing sends 70-80% of requests to tier 1.


1,000 requests: cascade vs. direct

Simulated distribution based on typical agent workloads (~800 tokens/request average).

tierproviderrequests% of totalcost
1Groq75075%$0.1800
2Gemini15015%$0.0600
3Anthropic707%$0.1680
4OpenAI303%$0.1200
CASCADE$0.5280
ALL GPT-4o$4.000087% more expensive
ALL CLAUDE$2.400078% more expensive

How cascade routing decides

After each tier responds, four signals determine whether to accept or escalate.

signalweightwhat it checks
Refusal detection30%"As an AI...", "I cannot help with...", content policy triggers
Completeness30%Did the response actually address the question?
Hedging language25%"It depends", "I'm not sure", "generally speaking"
Length ratio15%Response length vs. expected for the task complexity

Combined score above 0.7 = accept. Below 0.7 = escalate to the next tier. Maximum 3 escalations.


Try it

One env var change. OpenAI SDK compatible.

pythonpython
from openai import OpenAI

client = OpenAI(
    base_url="https://api.pura.xyz/v1",
    api_key="pura_YOUR_KEY"  # free for 5,000 requests
)

response = client.chat.completions.create(
    model="auto",  # cascade routing picks the model
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

# Check response headers for cost info:
# X-Pura-Model, X-Pura-Cost, X-Pura-Tier
shellcurl
curl https://api.pura.xyz/v1/chat/completions \
  -H "Authorization: Bearer pura_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'
get an API key →quickstart docs →