ADR-008: Agent Process Architecture | Owlat Docs

Why Owlat processes inbound messages with a self-scheduling step walker plus a lifecycle coordinator instead of one sequential function.

Status: Accepted
Date: 2026-03-24
Refined by: ADR-0010 (lifecycle module as sole status writer), ADR-0014 (walker step-module loop; plan step dropped)

Refined after the original decision

The original framing of this ADR described three named process types (Receiver / Analyzer / Worker) coordinated by a state mutation. That shape was consolidated pre-prod into a single self-scheduling walker plus a lifecycle coordinator by two later internal ADRs — ADR-0010 (docs/adr/0010-inbox-processing-lifecycle-module.md) and ADR-0014 (docs/adr/0014-agent-step-module.md). This page documents the design as it actually ships.

Context

The Agent Pipeline processes inbound messages through five steps: security scan, context retrieval, classification, draft generation, and routing. These are the members of the AgentStepKind union in apps/api/convex/agent/steps/types.ts (security_scan, context_retrieval, classify, draft, route). A standalone "action planning" step existed in an early draft but was dropped pre-prod with ADR-0014 — the classifier never transitioned into it and the drafter wrote a plan row whose payload was just literal JSON construction.

The simplest implementation is a single function that runs all five steps in sequence. But this creates problems:

No partial retry — if draft generation fails (LLM timeout, rate limit), the entire pipeline restarts from the security scan, wasting LLM calls.
No concurrency — one long-running pipeline blocks the next message from starting. A 10-second draft generation delays all subsequent messages.
No per-step state — security results, classification, and the draft each need to be persisted at their own boundary so a crash mid-pipeline doesn't lose finished work.
No observability — a monolithic function provides one timing metric, not a per-step breakdown.

Convex's serverless model naturally supports independent function execution — an internalAction can be scheduled, can re-schedule itself, and runs in its own transaction-free context, while mutations give atomic writes. The architecture splits along exactly that line: the LLM/compute work lives in actions, and every database write that drives the state machine lives in one mutation.

Decision

Process inbound messages with one self-scheduling step walker (an internalAction) plus one lifecycle coordinator (an internalMutation). The walker runs the per-step compute; the coordinator owns every write to the state machine.

Component	Convex primitive	Source	Responsibility
Receiver	`internalMutation`	`apps/api/convex/inbox/messages.ts` (`receiveMessage`)	Resolves contact + thread, stores the message at `processingStatus: 'received'`, then schedules the walker at `security_scan`.
Walker	`internalAction`	`apps/api/convex/agent/walker.ts` (`start`, `runStep`)	Runs exactly one step module per invocation, then re-enqueues itself via `scheduler.runAfter(0, …runStep)` for the next step.
Lifecycle coordinator	`internalMutation`	`apps/api/convex/inbox/processingLifecycle.ts` (`transition`, `recordStepBegin/End/Fail`)	The sole writer of `inboundMessages.processingStatus`, the matching `agentActions` rows, and `conversationThreads.latestDraftStatus`. Enforces the legal-edge graph.

How a message flows

Receive

internal.inbox.messages.receiveMessage (an internalMutation, not an HTTP action — the inbound MTA webhook lands on the HTTP router, which runs the webhook dispatcher; its inbound.received handler calls this mutation via ctx.runMutation) resolves the contact and conversation thread, inserts the inboundMessages row at processingStatus: 'received', and schedules the walker: ctx.scheduler.runAfter(0, internal.agent.walker.start, { inboundMessageId }).

Start the walker

walker.start kicks the pipeline off at the first step: it schedules walker.runStep with kind: 'security_scan'. (The lifecycle's schedule_pipeline_start effect also calls start when a message is released from quarantine or retried by cron.)

Run one step

walker.runStep looks up the step module for kind via stepModuleFor, records an agentActions row through recordStepBegin, runs module.execute(ctx, input), then applies the pure module.route(output, input, runCtx) decision.

Route to the next step

route returns one of three shapes (apps/api/convex/agent/steps/types.ts):

in_state — record the step end and schedule the next step within the same processingStatus.
transition — call lifecycle.transition to advance processingStatus (which atomically completes the step's agentAction), then optionally schedule the next step.
done — record the step end and stop.

Reach a terminal state

The route step is the last one. It either transitions the message to approved (when auto-reply is enabled, confidence clears the threshold, and the daily cap isn't hit) or to draft_ready (awaiting human review). On approved, the lifecycle fires its schedule_send_approved effect, which sends the reply via internal.agent.agentPipeline.sendApprovedReply.

The walker carries the next step's input forward directly — there is no shared accumulator (ADR-0014, choice D1). The lifecycle is the only module that touches processingStatus, so a crash between the status patch and the per-step write can no longer leave a row half-updated.

Why two modules instead of one big action

Some steps run inside a single processingStatus rather than at a state boundary — for example context_retrieval and classify both run while the message is classifying. Splitting recordStepBegin / recordStepEnd / recordStepFail out of transition lets each step atomically mark its own agentAction row without the lifecycle needing to know which step is in flight, while still keeping a single owner for processingStatus itself.

Retry semantics

Retry is uniform across all step kinds, not differentiated per step. There is no exponential backoff and no per-type retry-count.

On any exception in a step's execute or route, the walker transitions the message to failed and marks the in-flight agentAction as failed with errorMessage (apps/api/convex/agent/walker.ts).
A cron — retry failed agent actions, every 5 minutes (apps/api/convex/crons.ts) — calls internal.inbox.processingLifecycle.retryFailedActions. That mutation takes failed agentActions whose retryCount < maxRetries (where maxRetries = 3 for every step kind), resets each one to pending, and brings its message back to received so the walker re-runs the pipeline from security_scan.
Once retryCount reaches 3, the action is left failed and the message stays in its failed state for a human to inspect.

Multi-intent branching

Not yet active

The original ADR described the coordinator spawning parallel worker processes — one per detected intent — and merging their results at the routing step. This is not implemented. The walker is strictly sequential and single-path: classify produces exactly one Classification, and route makes exactly one decision (apps/api/convex/agent/steps/route/index.ts). There is no fork/spawn-parallel logic anywhere in apps/api/convex/agent/. Treat parallel multi-intent handling as a future direction, not a shipped capability — the Agent Pipeline vision page sketches where it could go.

Consequences

Enables:

Independent step retry — a failed step is reset and re-run by the cron; finished steps before it keep their persisted output.
Concurrent processing — each message's walker schedules independently, so multiple messages process at once across the deployment.
Per-step observability — every step writes an agentActions row with status, durationMs, modelUsed, and tokenUsage.
Crash-safe state — because the lifecycle coordinator is the sole writer of processingStatus and patches companion fields in the same mutation, a crash can't leave a status/data mismatch.

Trade-offs:

State lives across two tables — the message-level state machine is inboundMessages.processingStatus (see apps/api/convex/schema/inbox.ts), and per-step tracking is in agentActions rows (status, retryCount, durationMs, modelUsed, tokenUsage). There is no single stepTimings blob — per-step timing is read by joining agentActions for a message.
Coordinator logic adds complexity versus a simple sequential function — the legal-edge graph and effect reducer need careful testing to avoid stuck messages.
Slightly higher latency than a single function call due to scheduling overhead between steps (each runStep re-enqueues the next).

Agent Pipeline (vision)

The fuller vision for the inbound-message agent, including directions not yet shipped.

Convex Backend

How actions, mutations, and the scheduler are used across the backend.