5 things that actually matter when picking an LLM API in 2026

Headline price doesn't matter much anymore. What actually moves the bill in 2026 is context efficiency, reasoning-token billing, cache hit rate, output speed, and the open-vs-closed source decision.

1. Sticker price isn't the bill

The headline number on a pricing page — say, "$0.50 per 1M input tokens" — is the part of your bill you have the least control over. Cheap input pricing has more or less converged across the major providers in 2026: most flagship models now sit between $1 and $5 per million input tokens, and the mid-tier models (GPT-4o mini, Claude Haiku, Gemini Flash) are all within a few cents of each other.

The interesting variable is output price, which can be 3–5× the input rate, and which scales linearly with how chatty your prompts make the model. A model that costs $3 input / $15 output is dramatically cheaper than $1 input / $20 output if your workload is a RAG system returning short answers from long context. Always model your actual input/output ratio first, then pick a model — not the other way round.

2. Reasoning tokens are the new gotcha

The OpenAI o-series and Claude's extended-thinking modes both bill for reasoning tokens — the chain-of-thought the model generates internally before producing the visible answer. These tokens count as output. For a typical "think hard about this" prompt, the reasoning trace can be 5–20× longer than the final answer. So a question with 100 visible output tokens can cost the same as 2,000 — and you don't see it in the response.

Two implications: (1) for cost-sensitive workloads, the cheapest reasoning model is often more expensive than the most expensive non-reasoning model; (2) when comparing prices, you must compare "all-in cost per task" benchmarks, not just $/M output tokens. The provider's own pricing page rarely makes this clear — you have to test it on your own prompts.

3. Cache hit rate beats raw price

Prompt caching went from "nice to have" in 2024 to the single biggest cost lever in 2026. Cached input tokens are now priced at 10–25% of the standard rate at every major provider. For agent workloads — long system prompts repeated across many turns — cache hits can routinely knock 60–80% off your monthly bill.

The catch: cache hit rate depends on the structure of your prompts, not the model. Two providers quoting identical $/M rates can produce wildly different bills depending on how aggressively they deduplicate prefixes and how long they keep entries warm. Before committing to a provider for anything at scale, run a week of production traffic through their API and pull the cache-hit metric from your usage dashboard. The provider with the "more expensive" sticker price often wins.

4. Output speed = unit economics

For chat UX, every doubling of tokens-per-second cuts perceived response time roughly in half. But for batch and background work, output speed determines how many requests one machine can fan out before hitting concurrency limits — which means it's a hard ceiling on your unit economics, not just a polish issue.

The fastest models in 2026 (Groq-hosted Llama variants, Gemini Flash, DeepSeek V4 at high throughput) can push 200–400 tokens per second. The slowest reasoning models can drop to 20–40 tps once the thinking trace kicks in. That's a 10× swing in throughput. If you're building anything real-time, measure tokens-per-second on a workload that looks like yours, not on the provider's marketing chart.

5. Open vs closed source: the gap is now operational, not capability

In 2026, the strongest open-weight models (Llama 4, DeepSeek V4, Qwen 3) are within striking distance of GPT-5 and Claude Opus 4.6 on most benchmarks. The interesting question is no longer "is open source good enough" — for most workloads, it is. The interesting question is whether you want to run inference yourself.

The break-even point: at sustained throughput above roughly 50–100 million tokens per day, a dedicated GPU cluster running an open-weight model is cheaper than API access to a comparable closed model. Below that, the operational burden — uptime, scaling, model updates, quantisation choices — almost always tilts the maths back toward managed APIs. If your traffic is spiky or unpredictable, stay on APIs even if your peak rate makes self-hosting look cheap on paper.

Putting it together

Pick a model based on cost only after you've measured: your input/output ratio, your prompt structure (cache hit potential), whether you need reasoning, and your latency budget. The model tables on this site are sorted by raw input price — that's a starting point, not the answer. The answer comes from your own production logs.

Written by Allen Pan. Corrections or questions welcome — allen@xyzsleep.com.