ADR-009: Task-Based Model Routing
Why Owlat supports per-task LLM model selection instead of using a single model for all pipeline steps.
- Status: Accepted
- Date: 2026-03-24
Context
The Agent Pipeline, Knowledge Graph, and file system all make LLM calls, but with different requirements:
| Task | Needs | Volume |
|---|---|---|
| Classification | Speed, structured output | Every inbound message |
| Security filter | Speed, low latency | Every inbound message |
| Knowledge extraction | Structured output, cost efficiency | Every processed message |
| File tagging | Summarization, cost efficiency | Every uploaded file |
| Action planning | Reasoning, tool use | Per-message (after classification) |
| Draft generation | Writing quality, tone matching | Per-message (when draft needed) |
| Context compaction | Summarization | When context exceeds budget |
Running everything on the most capable model (GPT-4o, Claude Sonnet) wastes resources — classification produces a simple enum and does not need the reasoning capacity of a frontier model. Running everything on the cheapest model (GPT-4o-mini, Llama 3) saves money but produces poor-quality drafts.
The options considered:
- Single model — simplest configuration, but forces a choice between quality and cost
- Per-step model configuration — five separate model environment variables, complex to configure
- Two-tier routing — fast model for high-volume structured tasks, capable model for reasoning and writing
Decision
Extend the pluggable LLM provider with two-tier model routing:
LLM_MODEL=gpt-4o # Fallback for all tasks
LLM_MODEL_CAPABLE=gpt-4o # Drafting, planning, reasoning
LLM_MODEL_FAST=gpt-4o-mini # Classification, extraction, summarization
The getLLMProvider() function accepts a task parameter that selects the appropriate tier:
type ModelTask = 'classify' | 'draft' | 'extract' | 'plan' | 'guard' | 'summarize'
export function getLLMProvider(task: ModelTask = 'draft') {
const isFastTask = ['classify', 'extract', 'guard', 'summarize'].includes(task)
const model = isFastTask
? process.env.LLM_MODEL_FAST ?? process.env.LLM_MODEL ?? 'gpt-4o-mini'
: process.env.LLM_MODEL_CAPABLE ?? process.env.LLM_MODEL ?? 'gpt-4o'
// ... provider setup
}
Both LLM_MODEL_CAPABLE and LLM_MODEL_FAST fall back to LLM_MODEL, which means:
- Minimal configuration: set only
LLM_MODELand everything uses that model (same behavior as ADR-007) - Cost optimization: set
LLM_MODEL_FAST=gpt-4o-miniandLLM_MODEL_CAPABLE=gpt-4oto split by task - Self-hosted simplicity: self-hosters running a single Ollama model set
LLM_MODEL=llama3and both tiers use it
Consequences
Enables:
- Cost reduction of 60-80% on high-volume tasks (classification, extraction) by using smaller models
- Quality preservation on tasks that need it (drafting, planning) by using capable models
- Simple upgrade path — start with one model, split later as volume grows
- Self-hosters can run a single model without any routing configuration
- Per-step cost tracking becomes meaningful (fast model calls are cheap, capable model calls are the real cost center)
Trade-offs:
- Two additional environment variables to document and support
- Model compatibility assumptions — structured output behavior varies across models, may need per-model prompt adjustments
- Self-hosters using a single small model will see quality limitations on drafting tasks (same as ADR-007's existing trade-off)
- No per-organization model override yet — routing is system-wide, not tenant-configurable