ADR-009: Task-Based Model Routing

Why Owlat supports per-task LLM model selection instead of using a single model for all pipeline steps.

  • Status: Accepted
  • Date: 2026-03-24

Context

The Agent Pipeline, Knowledge Graph, and file system all make LLM calls, but with different requirements:

TaskNeedsVolume
ClassificationSpeed, structured outputEvery inbound message
Security filterSpeed, low latencyEvery inbound message
Knowledge extractionStructured output, cost efficiencyEvery processed message
File taggingSummarization, cost efficiencyEvery uploaded file
Action planningReasoning, tool usePer-message (after classification)
Draft generationWriting quality, tone matchingPer-message (when draft needed)
Context compactionSummarizationWhen context exceeds budget

Running everything on the most capable model (GPT-4o, Claude Sonnet) wastes resources — classification produces a simple enum and does not need the reasoning capacity of a frontier model. Running everything on the cheapest model (GPT-4o-mini, Llama 3) saves money but produces poor-quality drafts.

The options considered:

  1. Single model — simplest configuration, but forces a choice between quality and cost
  2. Per-step model configuration — five separate model environment variables, complex to configure
  3. Two-tier routing — fast model for high-volume structured tasks, capable model for reasoning and writing

Decision

Extend the pluggable LLM provider with two-tier model routing:

LLM_MODEL=gpt-4o              # Fallback for all tasks
LLM_MODEL_CAPABLE=gpt-4o      # Drafting, planning, reasoning
LLM_MODEL_FAST=gpt-4o-mini    # Classification, extraction, summarization

The getLLMProvider() function accepts a task parameter that selects the appropriate tier:

type ModelTask = 'classify' | 'draft' | 'extract' | 'plan' | 'guard' | 'summarize'

export function getLLMProvider(task: ModelTask = 'draft') {
  const isFastTask = ['classify', 'extract', 'guard', 'summarize'].includes(task)
  const model = isFastTask
    ? process.env.LLM_MODEL_FAST ?? process.env.LLM_MODEL ?? 'gpt-4o-mini'
    : process.env.LLM_MODEL_CAPABLE ?? process.env.LLM_MODEL ?? 'gpt-4o'
  // ... provider setup
}

Both LLM_MODEL_CAPABLE and LLM_MODEL_FAST fall back to LLM_MODEL, which means:

  • Minimal configuration: set only LLM_MODEL and everything uses that model (same behavior as ADR-007)
  • Cost optimization: set LLM_MODEL_FAST=gpt-4o-mini and LLM_MODEL_CAPABLE=gpt-4o to split by task
  • Self-hosted simplicity: self-hosters running a single Ollama model set LLM_MODEL=llama3 and both tiers use it

Consequences

Enables:

  • Cost reduction of 60-80% on high-volume tasks (classification, extraction) by using smaller models
  • Quality preservation on tasks that need it (drafting, planning) by using capable models
  • Simple upgrade path — start with one model, split later as volume grows
  • Self-hosters can run a single model without any routing configuration
  • Per-step cost tracking becomes meaningful (fast model calls are cheap, capable model calls are the real cost center)

Trade-offs:

  • Two additional environment variables to document and support
  • Model compatibility assumptions — structured output behavior varies across models, may need per-model prompt adjustments
  • Self-hosters using a single small model will see quality limitations on drafting tasks (same as ADR-007's existing trade-off)
  • No per-organization model override yet — routing is system-wide, not tenant-configurable