ADR-009: Task-Based Model Routing | Owlat Docs

Why Owlat supports per-task LLM model selection instead of using a single model for all pipeline steps.

Status: Accepted
Date: 2026-03-24

Context

The Agent Pipeline, Knowledge Graph, and file system all make LLM calls, but with different requirements:

Task	Needs	Volume
Classification	Speed, structured output	Every inbound message
Security filter	Speed, low latency	Every inbound message
Knowledge extraction	Structured output, cost efficiency	Every processed message
File tagging	Summarization, cost efficiency	Every uploaded file
Action planning (planned)	Reasoning, tool use	Per-message (after classification)
Draft generation	Writing quality, tone matching	Per-message (when draft needed)
Context compaction	Summarization	When context exceeds budget

Wired vs. reserved tiers

The plan (Action planning) task is defined in the type and tier-mapped in taskTier(), but it has no call sites yet. The wired tasks are classify, draft, extract, summarize, and guard — guard runs in production at apps/api/convex/agent/steps/security_scan/index.ts:88, and the others are invoked in apps/api/convex/agent/steps/, knowledge/extraction.ts, semanticFileProcessing.ts, and translate.ts. The plan row describes intended usage, not current behavior.

Running everything on the most capable model (GPT-4o, Claude Sonnet) wastes resources — classification produces a simple enum and does not need the reasoning capacity of a frontier model. Running everything on the cheapest model (GPT-4o-mini, Llama 3) saves money but produces poor-quality drafts.

The options considered:

Single model — simplest configuration, but forces a choice between quality and cost
Per-step model configuration — five separate model environment variables, complex to configure
Two-tier routing — fast model for high-volume structured tasks, capable model for reasoning and writing

Decision

Extend the pluggable LLM provider with two-tier model routing:

LLM_MODEL=gpt-4o              # Fallback for all tasks
LLM_MODEL_CAPABLE=gpt-4o      # Drafting, planning, reasoning
LLM_MODEL_FAST=gpt-4o-mini    # Classification, extraction, summarization

Callers request a model by task, and the routing rule maps each task to the fast or capable tier:

// apps/api/convex/lib/llmProvider.ts (illustrative — actual tier rule)
type LLMTask = 'classify' | 'extract' | 'guard' | 'summarize' | 'draft' | 'plan'

function taskTier(task: LLMTask): 'fast' | 'capable' {
  // classify / extract / guard / summarize → fast
  // draft / plan → capable
}

// OpenAI-compatible provider resolves the model ID for the tier
// (env reads go through the typed `getOptional()` wrapper, not raw process.env):
const modelId = tier === 'fast'
  ? getOptional('LLM_MODEL_FAST') ?? getOptional('LLM_MODEL') ?? 'gpt-4o-mini'
  : getOptional('LLM_MODEL_CAPABLE') ?? getOptional('LLM_MODEL') ?? 'gpt-4o'

getLLMProvider(task) in apps/api/convex/lib/llmProvider.ts is the implementation callers use, and the fast/capable decision lives in the private taskTier() function in that same file. The env-fallback semantics live in modelIdForTier(), also in apps/api/convex/lib/llmProvider.ts.

Both LLM_MODEL_CAPABLE and LLM_MODEL_FAST fall back to LLM_MODEL, which means:

Minimal configuration: set only LLM_MODEL and everything uses that model (same behavior as ADR-007)
Cost optimization: set LLM_MODEL_FAST=gpt-4o-mini and LLM_MODEL_CAPABLE=gpt-4o to split by task
Self-hosted simplicity: self-hosters running a single Ollama model set LLM_MODEL=llama3 and both tiers use it

The model IDs above (gpt-4o-mini / gpt-4o) are the defaults of the OpenAI-compatible client. LLM_PROVIDER (openai, openrouter, or ollama) selects the base URL; every client is OpenAI-compatible. To use Claude, leave LLM_PROVIDER at its openai default and point LLM_BASE_URL at an OpenAI-compatible Anthropic proxy, then set LLM_MODEL_FAST / LLM_MODEL_CAPABLE to the Claude model IDs you want. See ADR-007 for the provider layer.

Consequences

Enables:

Cost reduction of 60-80% on high-volume tasks (classification, extraction) by using smaller models
Quality preservation on tasks that need it (drafting, planning) by using capable models
Simple upgrade path — start with one model, split later as volume grows
Self-hosters can run a single model without any routing configuration
Per-step cost tracking becomes meaningful (fast model calls are cheap, capable model calls are the real cost center)

Trade-offs:

Two additional environment variables to document and support
Model compatibility assumptions — structured output behavior varies across models, may need per-model prompt adjustments
Self-hosters using a single small model will see quality limitations on drafting tasks (same as ADR-007's existing trade-off)
No per-organization model override yet — routing is system-wide, not tenant-configurable