Architecture of the Owlat desktop shell, visualization agent, adaptive dashboard, agent health, graduated autonomy, and coding agents.

Desktop App & Advanced Agents

This part of the roadmap brings Owlat to the desktop as a native shell, adds specialized agents for visualization and coding, and introduces graduated autonomy where the deployment fine-tunes how much decision-making it delegates to AI.

Most of it is already built, and much of what used to be schema-only is now driven end-to-end. The desktop app ships as a Tauri shell (apps/desktop); the visualization agent (apps/api/convex/visualizationAgent.ts) now reads real data for an allowlisted set of datasets; the adaptive dashboard (apps/api/convex/analytics/adaptiveDashboard.ts) enforces role conditions; agent health monitoring (apps/api/convex/agentHealth.ts) populates six of seven metrics and evaluates all three circuit breakers; graduated-autonomy rules (apps/api/convex/autonomy.ts) gate the live routing step with a working feedback loop; and the coding-agent sidecar (apps/code-worker) is now fed by an inbound auto-trigger and a GitHub merge webhook. The sections below describe what is wired today and call out, with callouts, the pieces that remain on the way.

Single deployment, single org

Owlat now runs exactly one organization per deployment, so the tables below carry no organizationId. Settings like the agent config and autonomy rules are deployment-wide singletons.

Desktop App

Why Tauri

The desktop app is built with Tauri v2:

	Tauri v2	Electron
Binary size	~5–10 MB	~150–200 MB
Memory usage	Native webview	Bundled Chromium
License	MIT	MIT
Backend	Rust	Node.js
Auto-update	Built-in updater	electron-updater
Dock/taskbar badge	Native API	Native API

Tauri uses the OS native webview (WebKit on macOS, WebView2 on Windows, WebKitGTK on Linux) — no bundled Chromium. The app shell does not duplicate any UI: it loads the same apps/web SPA (built with bun run generate:desktop, with no Convex URL baked in) and points it at the active workspace's URLs at runtime.

Auto-update wiring

The updater plugin is wired (src/updater.ts) and the release workflow signs bundles. plugins.updater.pubkey in tauri.conf.json is now populated with a minisign public key, and the endpoint points at the GitHub releases host (https://github.com/wolvesdotink/owlat/releases/latest/download/latest.json). The committed key may be a sample/non-production key — operators publishing their own builds should generate their own signing keypair and replace it. See apps/desktop/README.md for the signing/release steps.

Architecture

apps/desktop/
  src-tauri/                 # Rust backend
    src/
      main.rs                # Tauri app setup, plugin wiring
      menu.rs                # Application menu
      notifications.rs       # Native OS notifications + dock/taskbar badge
      shortcuts.rs           # Global keyboard shortcuts
      window.rs              # Window create/show/focus helpers
      secrets.rs             # OS-keychain commands (set/get/delete)
    Cargo.toml
    tauri.conf.json          # Window config, CSP, deep-link schemes, updater
  src/                       # TypeScript bridge (runs in the webview)
    deeplink.ts              # Cold-start + live deep-link plumbing
    workspace.ts             # Workspace list/active-id persistence
    keychain.ts              # Keychain bridge for session tokens
    compose.ts               # Compose-window opener (mailto: handling)
    notifications.ts         # Notification bridge
    shortcuts.ts             # Shortcut registration
    menu.ts                  # Menu wiring
    window.ts                # Window helpers
    shell.ts                 # System-browser open
    dialog.ts                # Native dialog bridge
    ssh.ts                   # SSH transport for guided VPS setup
    autostart.ts             # Launch-at-login toggle
    updater.ts               # Tauri updater plugin wrapper
  package.json

There is no updater.rs Rust file — auto-update is handled by the Tauri updater plugin and wrapped from src/updater.ts. Session secrets are the one thing the Rust side owns directly: secrets.rs exposes secret_set/secret_get/secret_delete commands backed by the keyring crate (com.owlat.desktop service).

Key features

Native notifications — when a new item enters the review queue or a colleague sends a chat message, the OS notification system triggers via the Tauri notification plugin (src/notifications.ts). The unread dock/taskbar badge (src-tauri/src/notifications.rs) is driven by the Convex reactive queries the webview already subscribes to. Closing the main window quits the app — there is no menu-bar tray.

Deep links — registered for the owlat and mailto schemes (tauri.conf.json → plugins.deep-link). There is no single bridge entrypoint; the web app imports the per-concern modules directly, and deep-link bootstrapping lives in apps/desktop/src/deeplink.ts, driven from the web app's desktop plugin. The handler in apps/web/app/lib/desktop/deepLink.client.ts routes:

Deep link	Opens
`owlat://thread/{id}`	`/dashboard/inbox/{id}`
`owlat://chat/{id}`	`/dashboard/chat/{id}`
`owlat://knowledge/{id}`	`/dashboard/knowledge/{id}`
`owlat://file/{id}`	`/dashboard/files/{id}`
`owlat://contact/{id}`	`/dashboard/audience/contacts/{id}`
`owlat://auth?ott=…&state=…`	redeems the workspace sign-in handshake
`mailto:…`	opens a compose window seeded with `to`/`subject`

Any unmatched owlat:// path falls back to /dashboard/{path}.

Owlat as a channel — internal chat flows through the same messaging tables and the same agent pipeline as email. Delivery is Convex real-time subscriptions — no WebSocket server beyond what Convex already provides. See Communication Channels for the live channels (email and chat) versus the inbound-only ones (SMS/WhatsApp).

Multi-workspace & auth

The desktop app connects to one or more remote Owlat instances. The model is documented in apps/desktop/README.md and implemented across apps/desktop/src/workspace.ts, apps/web/app/composables/useDesktopWorkspaces.ts, and apps/web/app/lib/desktop/:

Add a workspace

"Add workspace" opens the instance's /desktop/connect page in the system browser, not the embedded webview. Auth happens against the real instance with its full server-side flow.

Redeem a one-time token

On success the browser hands control back via the owlat://auth?ott=…&state=… deep link. The app redeems that one-time token for a header-based, cookieless cross-domain session (completeConnection in useDesktopWorkspaces.ts).

Store the session in the OS keychain

The session blob is written to the native secret store via the Rust secrets.rs commands, keyed per workspace. The non-secret workspace list and active-workspace id live in workspaces.json (tauri-plugin-store), referencing the secret only by tokenRef.

Switch workspaces

Selecting a different workspace reloads the webview so the auth and Convex client singletons re-seed from the newly-active workspace's URLs.

Internal chat

Team communication within the desktop app:

Direct messages — one-to-one conversations between organization members
Channels — topic-based group conversations (similar to Slack channels)
Thread-linked — every conversation can reference a conversationThread, linking internal discussion to customer communication

Internal chat messages are unifiedMessages rows (apps/api/convex/schema/messaging.ts) with channel: 'chat'. The schema carries both an optional memberId (the internal sender) and an optional contactId; the one shipped chat writer, sendChatMessage (apps/api/convex/unifiedMessages.ts), records the contactId of the thread it posts on (member identity is the authenticated caller). The vision is for the Knowledge Graph to extract facts from internal conversations the same way it does from customer email. Live per-message extraction is now wired for the inbound pipeline (email and inbound channel messages), but internal member-to-member chat does not yet trigger extraction — unifiedMessages.sendChatMessage is not on the extraction path.

Quick queries

Team members can ask the system questions directly from the desktop app:

"How many contacts joined the newsletter topic this month?" → queries audience data
"When did we last talk to Acme Corp?" → queries conversation threads
"Show me the contract we signed with them" → semantic file search

Quick queries retrieve context from the Knowledge Graph and file system, generate an answer with source citations, and render it inline.

Shipped: ask-anything is live

The ask-anything surface is wired end-to-end. The backend is quickQuery.ts → ask, an authedAction gated on both the ai.knowledge feature flag and the knowledge:read permission (asserted in that order via the quickQueryGate internal query, since a 'use node' action cannot touch the database directly). It fans out over both retrieval seams — hybrid vector + full-text search over knowledgeEntries (knowledge/retrieval.semanticSearch) and the semantic file store (semanticFileProcessing.semanticSearch, vector_files) — then calls the LLM (lib/llm/dispatch.runLlmText) to synthesize a grounded answer that cites each retrieved source by number. The returned sources span both knowledge entries and files; retrieved titles/bodies are scrubbed for prompt-injection and fenced as untrusted data before the model sees them. This is a separate read surface from the inbound-mail classifier — it does not need a "query" classification category.

Visualization Agent

A specialized agent (apps/api/convex/visualizationAgent.ts) that takes a natural-language prompt and produces a self-contained HTML/CSS/JS visualization via the LLM provider. The admin-gated createFromPrompt mutation inserts a placeholder row and schedules the internal generate action; outputs can render in conversations or be pinned to the dashboard.

Real data for allowlisted datasets, illustrative otherwise

generate now reads real account numbers — but only through a fixed allowlist of four named datasets: email_delivery_30d, agent_health, contact_growth, and campaign_performance. Each maps to a hand-written, read-only internal query (dataEmailDelivery30d, dataAgentHealth, …); there is no free-form query channel. Real data is used only when the caller passes an explicit allowlisted dataset argument — free-form prompts are never inferred to a dataset. When a dataset is supplied, the system prompt flips from "illustrative only" to "use these exact numbers" and the chosen dataset key is persisted in dataQuery so a future refresh can re-fetch the same allowlisted slice. With no explicit dataset — or when a fetch fails — it falls back to clearly-labeled illustrative sample data (the visible "Illustrative example — not your account data" caption). The Visualizations page exposes a dataset picker (opt a chart into a live dataset) and a per-chart refresh action that re-runs the fetch on demand — both read the dataQuery column (the persisted allowlisted key).

How it works (vision)

User: "Show me our email delivery rates for the last 30 days"
  → createFromPrompt (admin) inserts a placeholder + schedules generate()
  → generate() uses the explicitly-selected allowlisted dataset, if any
    (email_delivery_30d) and fetches the REAL numbers via a read-only
    internal query (no dataset → clearly-labeled illustrative sample data)
  → generate() asks the LLM provider for self-contained HTML/CSS/JS
    built from those exact numbers
  → Frontend renders the html in a sandboxed iframe
  → User can interact: hover for details, filter, change time range

The visualization agent generates raw HTML, CSS, and JavaScript — giving it full creative flexibility to produce any visual output. Unlike constrained charting libraries, this approach lets the agent build exactly what the data needs: charts, dashboards, data tables, animated progress trackers, interactive maps, or completely custom visualizations.

Full flexibility — the agent writes HTML/CSS/JS directly, not limited to a charting library's vocabulary
Interactive — JavaScript enables hover tooltips, click filtering, animated transitions, real-time updates
Portable — visualizations are self-contained HTML bundles that can be saved, shared, embedded in reports, or pinned to dashboards
Sandboxed — rendered in a sandboxed <iframe> with sandbox="allow-scripts" — no access to the parent page, Convex client, cookies, or navigation. The iframe communicates only via postMessage for resize events

Sandboxing is critical

Agent-generated code runs in a sandboxed iframe with no access to the host application. The sandbox attribute blocks top-navigation, form submission, popups, and same-origin access. Only allow-scripts is enabled so the visualization's own JavaScript can execute. This prevents any injected code from accessing user sessions, Convex data, or the DOM of the parent application.

Schema

// apps/api/convex/schema/dashboard.ts
visualizations: defineTable({
  title: v.string(),
  description: v.optional(v.string()),
  html: v.string(),                // Self-contained HTML document (HTML + CSS + JS)
  dataQuery: v.optional(v.string()), // Allowlisted dataset key the viz was built from (for refresh); unset when illustrative
  pinned: v.boolean(),             // Pinned to dashboard
  createdBy: v.string(),           // User or agent ID
  threadId: v.optional(v.id('conversationThreads')),
  createdAt: v.number(),
  updatedAt: v.number(),
})
  .index('by_pinned', ['pinned'])
  .index('by_created_at', ['createdAt'])

The agent action lives in apps/api/convex/visualizationAgent.ts: list/get/listPinned queries, the admin-gated generate action (which fetches an allowlisted dataset, if matched, before calling the LLM provider to produce the self-contained HTML), the read-only dataset fetchers, and pin/unpin mutations.

Rendering

The frontend renders visualizations via a sandboxed iframe:

<iframe
  :srcdoc="visualization.html"
  sandbox="allow-scripts"
  referrerpolicy="no-referrer"
  style="width: 100%; border: none;"
/>

The iframe uses srcdoc (no network fetch) and sandbox="allow-scripts" (JS executes, but no DOM access to the parent). A postMessage listener handles resize events so the iframe height adapts to content.

Adaptive Dashboard

The dashboard is not a static grid of widgets. It adapts to what the user needs right now — different in the morning than in the evening, different on Monday than on Friday, different for a support lead than for a marketing manager.

Context signals

The dashboard assembles itself from context signals:

Signal	What it tells us	Example effect
Time of day	Morning = planning, evening = review	Morning: today's scheduled campaigns, overnight inbound queue. Evening: today's performance summary, pending items for tomorrow
Day of week	Monday = catch-up, Friday = wrap-up	Monday: weekend inbound backlog, week's campaign schedule. Friday: weekly metrics, unresolved threads
Role	What the user is responsible for	Support lead sees queue depth and SLA status. Marketing manager sees campaign performance and audience growth
Recent activity	What the user has been working on	If you spent the last hour on a campaign, the dashboard surfaces its real-time delivery stats
Pending items	What needs attention	Verification queue items, campaigns waiting for approval, threads assigned to you
Anomalies	What's unusual right now	Bounce rate spike, unusual inbound volume, delivery issues with a specific ISP

How it works

The dashboard is composed of cards — each card is a self-contained unit that fetches its own data via Convex reactive queries. The dashboard layout engine decides which cards to show, in what order, based on the context signals above.

// apps/api/convex/schema/dashboard.ts
dashboardLayouts: defineTable({
  userId: v.string(),             // Per-user layout (BetterAuth user id)
  // Context-driven layout rules
  rules: v.array(v.object({
    condition: v.object({
      timeRange: v.optional(v.object({    // e.g., { start: '06:00', end: '12:00' }
        start: v.string(),
        end: v.string(),
      })),
      dayOfWeek: v.optional(v.array(v.number())),  // 0=Sun, 1=Mon, etc.
      role: v.optional(v.string()),
    }),
    cards: v.array(v.object({
      type: v.string(),           // 'verification_queue', 'campaign_performance', 'inbound_summary', etc.
      size: v.union(v.literal('small'), v.literal('medium'), v.literal('large')),
    })),
    priority: v.number(),         // Higher priority rules override lower ones
  })),
  // Pinned cards always show regardless of context
  pinnedCards: v.optional(v.array(v.object({
    type: v.string(),
    size: v.union(v.literal('small'), v.literal('medium'), v.literal('large')),
  }))),
  updatedAt: v.number(),
})
  .index('by_user', ['userId'])

The layout engine in apps/api/convex/analytics/adaptiveDashboard.ts resolves the live layout: getLayout resolves the member's role (via getBetterAuthSessionWithRole) and evaluates the rules by priority against the current time-of-day, day-of-week, and role, merges in pinned cards, and falls back to a sensible default when no rule matches. saveLayout lets each member manage only their own layout (keyed by_user on the session user id) — it persists the full edited card set as pinnedCards and, optionally, the adaptive rules. The getAvailableCards query exposes the registered card types and getRawLayout returns the stored layout (rules included) for the editor.

Role conditions are enforced

Rule conditions support a role field and getLayout now honors it: it resolves the member's role and matchesCondition only matches a role-scoped rule when the role matches (a rule with no role still matches everyone). Role resolution is best-effort — a null role simply skips role-scoped rules rather than denying the layout. Time-of-day and day-of-week conditions are honored alongside it.

Card types

The registered card types come from DEFAULT_CARDS in apps/api/convex/analytics/adaptiveDashboard.ts:

Card	What it shows
`verification_queue`	Pending agent drafts needing review
`campaign_performance`	Recent campaign metrics
`channel_health`	Status of all communication channels
`agent_health`	AI agent pipeline metrics
`recent_contacts`	Newly added or active contacts
`queue_depth`	Inbound message processing queue
`delivery_rates`	Email delivery success rates
`pinned_visualizations`	Pinned data visualizations
`knowledge_graph`	Recent knowledge entries
`upcoming_campaigns`	Scheduled campaigns
`cost_by_step`	LLM token cost per agent pipeline step
`accuracy_trend`	Auto-approve vs rejection trend over time

The last two (cost_by_step, accuracy_trend) are fully wired: they are registered card types with apps/web renderers (CostByStepCard.vue reading api.agentHealth.getCostByStep, AccuracyTrendCard.vue reading getAccuracyTrend) — cost_by_step breaks LLM spend down per pipeline step, and accuracy_trend plots the auto-approve ratio against the rejection rate over time.

Smart defaults

A member who has not saved a layout gets the built-in default (getDefaultLayout): review queue, campaign performance, channel health, agent health, delivery rates, recent contacts. From there they save rules, pin cards, and tune their own layout.

Vision: layout suggestions

The longer-term goal is an agent that suggests layout changes ("you check campaign performance every morning but it's at the bottom — want me to move it up?") and learns from usage patterns. The schema and resolver support hand-authored rules today; usage-driven learning and agent suggestions are not yet implemented.

Agent-generated dashboard cards

The Visualization Agent can produce a pinned_visualizations card. When a member pins a visualization, it surfaces on the dashboard, re-rendering its stored html in a sandboxed iframe. (The dataQuery column now records which allowlisted dataset the viz was built from, and an on-demand live-data refresh action re-runs the fetch from the chart card — see the note under Visualization Agent.)

Agent Health & Monitoring

As the agent pipeline processes inbound messages, the deployment needs centralized monitoring — not just for debugging, but for safety. A misbehaving LLM provider or a surge in failures is a signal that requires automated response. This is live in apps/api/convex/agentHealth.ts: a rollupMetrics action runs every five minutes by cron, records data points into agentMetrics, and evaluates circuit breakers; getDashboardMetrics/getMetricHistory/getCircuitBreakers back the monitoring UI.

Metrics

agentMetrics defines seven metric types. The five-minute rollup now records all seven.

Metric (`metricType`)	What it measures	Populated by rollup today?
`queue_depth`	Unprocessed inbound messages (`received`)	Yes
`processing_latency`	Average completed-action duration in the window	Yes
`error_rate`	Failed actions / total actions in the window	Yes
`auto_approve_ratio`	Auto-approved vs human-reviewed routes in the window	Yes
`rejection_rate`	Rejections / (approvals + rejections) over the last 24h	Yes
`llm_cost`	Total LLM token usage across the window's actions	Yes
`classification_accuracy`	Model quality over time	Yes (confidence proxy — mean self-reported `classify` confidence, no human ground truth)

Seven of seven metrics are now populated

The scheduled rollup now derives auto_approve_ratio (from the window's route-step decisions), rejection_rate (from the human verification queue's feedback over the last 24h, counting rejections over approvals+rejections), and llm_cost (summed tokenUsage across the window's actions) on top of queue_depth, processing_latency, and error_rate. classification_accuracy is now recorded as a confidence proxy — the mean self-reported confidence of the window's classify actions — since there is no human ground-truth signal to score against.

Per-deployment agent config

Operational tuning for the pipeline lives in a single agentConfig row (apps/api/convex/schema/inbox.ts). It is a deployment-wide singleton; the master on/off for the whole agent is the ai.agent feature flag, not a column here:

agentConfig: defineTable({
  isAutoReplyEnabled: v.boolean(),
  confidenceThreshold: v.number(),       // minimum confidence for auto-approval
  toneDescription: v.optional(v.string()),
  signatureTemplate: v.optional(v.string()),
  // Auto-reply rate limiting
  maxDailyAutoReplies: v.optional(v.number()),
  dailyAutoReplyCount: v.optional(v.number()),
  dailyAutoReplyResetAt: v.optional(v.number()),
  createdAt: v.number(),
  updatedAt: v.number(),
})

The only built-in cost guard is maxDailyAutoReplies — a cap on automatic replies, tracked by dailyAutoReplyCount and reset by dailyAutoReplyResetAt. It bounds auto-sends, not raw LLM calls. See AI Agent & Autonomy for the user-facing controls.

Circuit breakers

The agentCircuitBreakers table defines three breaker types, each with a default threshold from getDefaultThreshold in apps/api/convex/agentHealth.ts:

Breaker (`breakerType`)	Default threshold	Status
`llm_failure`	`0.20` (20% error rate)	Live — evaluated each rollup
`confidence_degradation`	`0.30` (fraction of low-confidence classifications)	Live — evaluated each rollup
`rejection_spike`	`0.40` (24h human rejection rate)	Live — evaluated each rollup

evaluateCircuitBreakers runs in every five-minute rollup and now evaluates all three breakers off that window's signals — the windowed error rate (llm_failure), the fraction of recent classifications scoring below 0.5 confidence (confidence_degradation), and the 24h human rejection rate (rejection_spike). Each follows the same hysteresis state machine: when its value exceeds the breaker's threshold it trips open; once the value recovers it steps open → half_open → closed across successive rollups, confirming recovery before auto-approval resumes. State transitions write trippedAt/recoveredAt. The route step refuses to auto-approve while any breaker is open.

All three breakers are evaluated

confidence_degradation and rejection_spike are no longer schema-only — both are now wired into the rollup's evaluator alongside llm_failure. (Separately, the weekly autonomy threshold-adjustment job below also reacts to a 40% per-category rejection rate, but that is the autonomy cron nudging thresholds, distinct from the rejection_spike breaker tripping.)

There is no LLM-call rate-limit object on the agent config. The agentConfig singleton's only rate limiting is the auto-reply cap (maxDailyAutoReplies) covered under Per-deployment agent config; per-category daily caps live on autonomyRules. LLM-call/concurrency budgeting is not implemented.

Dashboard integration

Agent health surfaces on the Adaptive Dashboard as the registered agent_health card (pipeline metrics) backed by getDashboardMetrics, alongside the queue_depth card. The richer breakdown cards cost_by_step (LLM spend per pipeline step, backed by getCostByStep) and accuracy_trend (auto-approve ratio vs. rejection rate over time, backed by getAccuracyTrend) are registered card types with built apps/web renderers.

Graduated Autonomy

The vision: the deployment controls how much decision-making it delegates to the agent, per category, and the system earns trust incrementally. The data model and the decision functions exist (apps/api/convex/autonomy.ts, gated behind the ai.autonomy feature flag) and now gate the live routing step.

Wired into the routing step

The route step (apps/api/convex/agent/steps/route/index.ts) now decides in three tiers, safest-first: (1) any open circuit breaker → human review (a degraded pipeline never auto-sends); (2) if the ai.autonomy flag is on, internal.autonomy.checkPermissionInternal applies the per-category rule (threshold + daily cap + open-breaker check) — a category with no rule is never auto-approved, and an auto-approval charges the category's daily cap via incrementDailyCount; (3) otherwise the legacy global agentConfig auto-reply toggle + confidence threshold + daily cap. So per-category autonomy is no longer storage-only — it gates auto-approval end-to-end when the flag is on.

Per-category rules

// apps/api/convex/schema/inbox.ts
autonomyRules: defineTable({
  category: v.string(),               // "support", "sales", "billing", etc.
  autoApproveThreshold: v.number(),   // Confidence threshold (0-1)
  maxDailyAutoActions: v.number(),    // Safety cap
  currentDailyCount: v.optional(v.number()),
  dailyCountResetAt: v.optional(v.number()),
  requiresHumanAbove: v.optional(v.number()), // e.g., dollar amount triggers human review
  isEnabled: v.boolean(),
  createdAt: v.number(),
  updatedAt: v.number(),
})
  .index('by_category', ['category'])

Rules are upserted by category through upsertRule (owner/admin only) and read back by checkPermissionInternal (the route step calls it as internal.autonomy.checkPermissionInternal), which returns { mode, allowed, reason } after checking the threshold, the per-category daily cap (currentDailyCount vs maxDailyAutoActions, reset on a rolling 24h window), and whether any circuit breaker is open.

Example configuration

Category	Threshold	Daily Cap	Notes
Simple acknowledgments	0.95	50	"Thanks, we'll look into it"
Support FAQ	0.90	30	Standard answers with data lookup
Billing questions	0.85	20	Account-specific responses
Sales inquiries	—	—	No rule = always human review
Complaints	—	—	No rule = always human review

A category with no rule (or isEnabled: false) is never auto-approved.

Feedback loop

Human feedback is recorded into autonomyFeedback (recordFeedback: approved/rejected/edited plus the agent's confidence). A weekly cron (adjustThresholds) then recalibrates each enabled rule:

If the rejection rate for a category exceeds 40% over the last week (and there are at least 5 data points), it tightens the threshold by +0.10 (max 0.99).
If the rejection rate is below 10% with at least 20 data points, it loosens the threshold by -0.05 (min 0.50).

The loop has live input

The verification queue now emits recordFeedback on every decision: approveDraft, rejectDraft, and editDraft in apps/api/convex/inbox/mutations.ts each append an autonomyFeedback row (a message with no classification yet records under other so the signal isn't lost). The weekly adjustThresholds cron consumes those rows to recalibrate per-category thresholds, and the agent-health rollup uses the same feedback for the rejection_spike breaker. There is still no per-message "confidence calibration" beyond this category-level threshold nudge.

Coding Agents

The most experimental part: an agent that takes a task description and produces a pull request. It runs as a Docker sidecar (apps/code-worker) — not inside Convex — because it needs a filesystem and git.

Task model

Tasks are tracked in their own table, not in agentActions:

// apps/api/convex/schema/codeWork.ts
codeWorkTasks: defineTable({
  description: v.string(),
  inboundMessageId: v.optional(v.id('inboundMessages')), // source context
  branch: v.optional(v.string()),
  prUrl: v.optional(v.string()),
  status: v.union(
    v.literal('queued'),
    v.literal('running'),
    v.literal('testing'),
    v.literal('review'),
    v.literal('merged'),
    v.literal('failed'),
  ),
  testResults: v.optional(v.string()),
  errorMessage: v.optional(v.string()),
  llmCost: v.optional(v.number()),
  createdAt: v.number(),
  updatedAt: v.number(),
})
  .index('by_status', ['status'])
  .index('by_created_at', ['createdAt'])

Worker flow

The worker polls Convex for the next queued task and runs it end-to-end (apps/code-worker/src/taskRunner.ts):

getNextQueued()  →  claim()        status: queued → running
  setupWorkspace()                 git clone --depth 1, checkout code-worker/{taskId}
  runCodingAgent()                 spawn the OpenCode CLI (OPENCODE_BIN) on the workspace
  (no diff?)        → markFailed    status: failed
  git commit
  markTesting()                    status: running → testing
  runTests()                       npx vitest run
  git push origin {branch}
  createPullRequest() (if GITHUB_* set, via @octokit/rest)
  completeWithPR()                 status: testing → review, stores prUrl + testResults

A failure at any point calls markFailed with the error. The result lands in review with a PR link for a human to inspect and merge — and when that PR is merged on GitHub, a webhook advances the task to merged automatically (see below).

Auto-trigger and merge-webhook are wired

Inbound feature requests now become code tasks automatically: when a message classifies as feature_request, the inbox lifecycle (apps/api/convex/inbox/processingLifecycle.ts) schedules internal.codeWorkTasks.createFromInbound, gated on the inbox.codeTasks flag and idempotent on inboundMessageId. (Manual creation via Code Tasks still works.) The merged status is also wired: a GitHub webhook (apps/api/convex/webhooks/githubHttp.ts, route POST /webhooks/github, HMAC-verified against GITHUB_WEBHOOK_SECRET) handles pull_request closed+merged events and matches the PR's html_url against codeWorkTasks.prUrl via markMergedByPrUrl — so a human still merges in GitHub, but the task transitions itself rather than staying stuck at review.

Running the agent still needs the sidecar

The auto-trigger and merge-webhook are wired, but actually running a task — cloning, invoking the coding agent, opening the PR — still requires the optional code-worker Docker sidecar (below) with OpenCode present in the image and GITHUB_*/GIT_* configured. Without the sidecar, tasks queue but nothing picks them up.

Compose service

The worker is an optional sidecar gated by the inbox.codeTasks feature flag (profile inbox-codetasks, also enabled under dev):

# infra/templates/docker-compose.vps.yml
code-worker:
  image: ghcr.io/wolvesdotink/code-worker:${OWLAT_VERSION}
  volumes:
    - code-workspace:/workspace
  environment:
    CONVEX_URL: http://convex:3210
    LLM_PROVIDER: ${LLM_PROVIDER:-openai}
    LLM_BASE_URL: ${LLM_BASE_URL:-}
    LLM_API_KEY: ${LLM_API_KEY:-}
    LLM_MODEL: ${LLM_MODEL:-gpt-4o}
    GITHUB_TOKEN: ${GITHUB_TOKEN:-}
    GITHUB_OWNER: ${GITHUB_OWNER:-}
    GITHUB_REPO: ${GITHUB_REPO:-}
    GIT_REPO_URL: ${GIT_REPO_URL:-}
    GIT_BASE_BRANCH: ${GIT_BASE_BRANCH:-main}
  profiles:
    - inbox-codetasks
    - dev

Experimental

Coding agents are the furthest-out part of the vision. The plumbing — task table, worker, PR creation — is in place, but it is gated off by default and will evolve with AI code-generation capabilities.