Introduction
LLM integration is deceptively easy. Anyone can call an API and render a string. Shipping an LLM feature that stays fast, cheap, reliable, and accurate as you scale is a different problem — and it is mostly a boring engineering problem. This article covers the patterns we apply across every LLM integration we ship in 2026.
The underlying theme: treat LLM calls as unreliable, expensive, remote function calls with strict contracts and extensive observability. Every pattern here flows from that stance.
Model routing — pick the right model per task
Different tasks have different cost/quality sweet spots. Classification and routing run cheaply on Haiku or GPT-4.1-mini. Reasoning-heavy extraction and agentic decisions use Claude 3.5 Sonnet or GPT-4.1. Light drafting can use Gemini Flash. Building a simple router that picks model per task reduces your average cost 40–70% without hurting quality.
The router is a thin layer in front of every LLM call: take the task type, pick the model, pick the system prompt, pick the tool set, and hand off to the provider. Keep it typed. Keep it testable. Keep it swappable per-tenant for enterprise customers who want to use their own model credentials.
- Route cheap tasks to small models; reserve frontier models for reasoning
- A typed router saves 40–70% of average inference cost
- Support per-tenant model credentials for enterprise
Structured output — always, everywhere
Free-form text outputs from LLMs are a debugging nightmare in production. Every LLM call that feeds downstream logic should use structured output (JSON mode, tool call schemas, or Anthropic's structured output primitives). Define a Zod or Pydantic schema, validate on the way out, and log the delta between raw model output and validated output to catch schema drift.
When the model produces invalid output, retry once with a strict 'produce valid JSON matching this schema' prompt. If it fails again, log and fall back gracefully. Never assume the model will return what you asked for — but design your system to tolerate the exceptions.
- Use JSON mode or tool schemas for anything feeding downstream logic
- Validate with Zod (TypeScript) or Pydantic (Python) on the way out
- Retry once on schema violation, then log and degrade gracefully
Caching — the cheapest performance win
Many LLM calls are redundant. The same summary request, the same classification, the same RAG query. Add a content-addressed cache (Redis keyed by a hash of input + model + system prompt + parameters) in front of every LLM call and you will see 20–50% cost reduction on typical workloads with no quality loss.
Anthropic and OpenAI both support prompt caching natively for long system prompts and retrieved context. Use it. A 5-minute cache on retrieved chunks typically cuts cost per RAG call in half while improving latency.
Be careful with caching for generative or user-specific outputs — cache only deterministic, non-personalized calls. The rule of thumb: if two users with the same input should get the same answer, cache it. Otherwise, don't.
- Content-addressed cache on deterministic calls: 20–50% cost reduction
- Use native prompt caching (Anthropic, OpenAI) for long system prompts
- Cache rule: same input = same answer → cache; otherwise don't
Evals — the production quality gate
Without an eval harness, you cannot tell whether a prompt change improved or regressed quality. Build a harness from day one. It has three tiers: deterministic tests (for structured outputs and classification), LLM-as-judge (for open-ended quality), and regression corpus (for real-world failures you have seen).
Run evals on every prompt change, every model version upgrade, and every new feature. Block merges that regress any eval below threshold. Tools like Promptfoo, Braintrust, and Langfuse evals are all reasonable; you can also roll your own in 300 lines of TypeScript.
- Three eval tiers: deterministic, LLM-as-judge, regression corpus
- Run on every prompt/model change. Block regressions.
- Use Promptfoo, Braintrust, Langfuse — or roll your own
Retries, timeouts, and failure modes
LLM providers have outages, rate limits, and intermittent slow responses. Treat every call like a remote service call: reasonable timeout (30–60 seconds for most calls, longer for explicit reasoning), exponential backoff on retryable errors (429, 5xx), circuit breaker on sustained failure, and a documented degradation path (fall back to a smaller model, a cached answer, or a friendly error message).
Separate retry logic for safety-relevant failures. If the model returns an unsafe or policy-violating output, do not retry blindly — log and route to human review. Retrying a safety failure with a different prompt is a common mistake that silently produces worse outcomes.
- Timeout + exponential backoff + circuit breaker on every LLM call
- Fallback path: smaller model → cached answer → friendly error
- Safety failures go to human review, not blind retry
Observability — the non-negotiable layer
Every LLM call should log: request ID, user ID (hashed where required), model, prompt, tool calls, response, latency, token usage, cost. Use Langfuse, Helicone, or OpenTelemetry. Build a dashboard that shows cost per feature, latency P50/P95/P99, and error rate by error class.
The highest-leverage practice: a trace viewer your product team uses. Product teams that can read real traces find and fix quality issues weeks faster than those who rely on eval scores alone.
- Log: request ID, model, prompt, tools, response, latency, tokens, cost
- Langfuse, Helicone, or OpenTelemetry pipeline
- Product-accessible trace viewer catches what dashboards miss
Conclusion
LLM integration quality is mostly a function of engineering discipline. Route intelligently, structure your outputs, cache aggressively, evaluate continuously, retry with care, and observe everything. The patterns are not exotic. The discipline of applying them consistently is what separates production LLM systems from permanent prototypes.
