Automation Internals
How the automation run engine works: the step walker, the lifecycle state machine, trigger fanout, the three step types, and the resilience cron.
This page is the developer reference for how trigger-based automations execute under the hood. For the product-level walkthrough of building automations in the UI, see Automations. Everything here lives in apps/api/convex/automations/.
An automation is one row in the automations table: a triggerType, an optional triggerConfig, a status, and an ordered list of automationSteps. When a trigger fires for a contact, the Trigger fanout inserts an automationRuns row and schedules the Step walker, which executes one step at a time, recording each as an automationStepRuns row.
The run engine (step walker)
The walker is apps/api/convex/automations/stepWalker.ts — a 'use node' module that owns the actual execution of an automation. It is the single entry point for running steps; per-step logic lives in step modules it dispatches to.
A run advances one step per executeStep invocation. There is no if (step.kind === ...) branching in the walker — it looks up the step's module via stepModuleFor(step.stepType) and calls parseConfig then execute. The module returns a StepOutcome of { status: 'completed', emailSendId?, nextStepIndex? } or { status: 'failed', error }.
Atomic step claim
Two independent schedulers can target the same pending step: the original ctx.scheduler.runAfter(...) from when the step was scheduled, and the process pending delays cron (which re-dispatches any pending step whose delay has elapsed). To stop a delay-gated email going out twice, a fresh dispatch (retryCount === 0) must atomically claim the step before executing.
The claim is markStepExecuting in apps/api/convex/automations/stepExecutorQueries.ts — a single pending → executing compare-and-set. Only the first caller wins; a second invocation sees executing (or a later status), gets { claimed: false }, and silently drops. Retries (retryCount > 0) re-enter on a step run this same chain already owns, so they skip the claim.
Retries
A step whose module returns failed (or throws) is retried with a fixed backoff schedule before the run is abandoned:
| Attempt | Delay before retry |
|---|---|
| 1st retry | 1s |
| 2nd retry | 5s |
| 3rd retry | 30s |
| After 3 retries | step marked failed, run cancelled |
MAX_RETRY_ATTEMPTS is 3 and RETRY_DELAYS_MS is [1000, 5000, 30000] (ms), both in apps/api/convex/lib/constants.ts. On the final failure the walker calls markStepFailed and then cancelAutomationRun — a step that exhausts its retries cancels the whole run. The walker also calls internal.automations.lifecycle.recordRunFailure; after 5 consecutive run failures (AUTOMATION_FAILURE_BREAKER_THRESHOLD) the automation is auto-paused via the circuit breaker.
Loop cap
A condition step can branch to any target index, including an earlier step (the editor allows it). Without a guard, a backward branch would loop forever and re-send the email step on every pass. The walker enforces MAX_STEPS_PER_RUN = 100: each successful claim bumps automationRuns.stepsExecuted, and when that count exceeds the cap the run is marked failed and cancelled with the message Automation exceeded 100 step executions — cancelled to prevent a loop. 100 is far above any legitimate linear automation length.
Active-status guard
executeStep re-reads the parent automation on every step. If automation.status !== 'active', it marks the current step failed and cancels the run.
Pausing (or reverting) an automation does not retroactively touch in-flight runs, but the next step that comes due for any running run will find the automation non-active and cancel that run. In practice, pausing an automation drains its running cohort as each contact's next step fires — it is not a freeze-and-resume.
Automation lifecycle state machine
The automations.status machine lives in apps/api/convex/automations/lifecycle.ts — the single writer of status and its companion fields activatedAt, pausedAt, and updatedAt (ADR-0024). Three states, four legal edges:
draft → active (activate; validates trigger config + ≥1 step)
active → paused (pause)
paused → active (resume; re-validates trigger config + ≥1 step)
paused → draft (revertToDraft)
active → draft is refused as illegal_edge — admins must pause first. The legal-edge graph is a single LEGAL_EDGES constant rather than scattered if (status !== ...) checks.
The public mutations (activate, pause, resume, revertToDraft in automations/automations.ts) are thin auth shells: they check the automations:manage permission and dispatch to internal.automations.lifecycle.transition, which returns a typed outcome rather than throwing. The shell's reasonToMessage maps the typed reason to a user-facing string.
Transition outcomes
reason (on ok: false) | Meaning | Shell message |
|---|---|---|
automation_not_found | Row missing | "Automation not found" |
illegal_edge | Edge not in LEGAL_EDGES | "Automation is not in a state that allows this transition" |
no_steps | → active with zero steps | "Automation must have at least one step to be activated" |
invalid_trigger_config | → active with missing trigger config | "Automation trigger is missing required configuration" |
A duplicate same-state attempt (from === to) is idempotent: it returns { ok: true, applied: 'recorded' }, writes an audit-log row marked no_op: true, and emits no PostHog event and no patch.
Effects per transition
Every real transition writes an audit-log row and a PostHog track_event; self-loops write only the audit row.
| Transition | Audit action | PostHog event |
|---|---|---|
draft → active | automation.activated | automation_activated |
active → paused | automation.paused | automation_paused |
paused → active | automation.resumed | automation_resumed |
paused → draft | automation.reverted_to_draft | automation_reverted_to_draft |
→ active runs two preconditions before patching: at least one automationSteps row, and a valid trigger config for the trigger type. Both run on draft → active and paused → active, so a resume cannot silently re-enter a broken active state.
Lifetime stats counters (statsEntered, statsActive, statsCompleted) are not touched by the lifecycle — they are owned by the trigger fanout and by run completion/cancellation in stepExecutorQueries.ts.
Trigger fanout
Triggers are wired through per-kind modules registered in apps/api/convex/automations/triggers.ts. Four trigger kinds ship:
| Trigger kind | Fires when | triggerConfig |
|---|---|---|
contact_created | A contact is created | none (always matches) |
contact_updated | A watched property changes | { propertyKey } |
event_received | A named event is sent for a contact | { eventName } |
topic_subscribed | A contact subscribes to a topic | { topicId } |
The shared fireTrigger walker runs the fanout pipeline for one kind:
Fetch matching automations
Query active automations with this trigger via the by_status_trigger index (status = 'active', triggerType = kind).
Evaluate each
Per automation: narrow triggerConfig through the module's parseConfig, then evaluate module.matches(input, config). contact_created always matches; the other three compare the fired input against the persisted config.
Skip duplicates and empties
Skip the automation if the contact already has a running run for it (the by_automation_and_contact index filtered to status: 'running'), and skip if the automation has no steps.
Insert and schedule
Insert an automationRuns row at currentStepIndex: 0 with status: 'running', bump the statsEntered shard (statsActive is derived as entered − completed − cancelled by the rollup, not incremented), attach any triggerData the module built, and schedule internal.automations.stepWalker.startAutomationRun.
The per-kind wrapper mutations (fireContactCreatedTrigger, fireContactUpdatedTrigger, fireEventReceivedTrigger, fireTopicSubscribedTrigger) are internal — they are called from the contact, topic, and event code paths. Events specifically come in through sendEvent (also internal), whose only public entry point is the API-key-authenticated POST /api/v1/events route; it is deliberately not on the public Convex client API so an anonymous caller cannot fabricate events.
startAutomationRun schedules step 0 (honoring its entry delay) and creates the first automationStepRuns row. From there, each completed step calls advanceToStep, which schedules the next step or, when the index runs past the last step, marks the run completed.
Step types
Three step kinds ship, registered in apps/api/convex/automations/steps.ts and each living under steps/<kind>/. The walker dispatches to them uniformly; only their parseConfig, execute, and optional entryDelay/enrichForQuery differ.
steps/email/index.ts. Config is { emailTemplateId, subjectOverride? }. On execute it loads the template, resolves the org's default sender, composes the subject and body for the automation send kind (no tracking, no footer), and enqueues a transactionalSends row onto the transactional pool via internal.delivery.enqueue.enqueueNonCampaignSend (returning the enqueued send id as emailSendId). Provider resolution, dispatch, and the Send lifecycle transition all happen asynchronously on the worker — completed here means the send was enqueued, not delivered.
Contacts that arrived via phone/SMS/WhatsApp/generic channels have no email. The email step fails explicitly with Contact has no email address for those contacts, so the run log records why the dispatch was skipped.
The email step is the only step that returns an emailSendId (the id of the enqueued Send row, not a provider message ID), which the walker stores on the step run via markStepCompleted. The step also implements enrichForQuery so getWithRelations can join the referenced template for the editor.
Delay
steps/delay/index.ts. Config is { duration, unit } where unit is minutes | hours | days | weeks. The delay step is the only kind that implements entryDelay: computeEntryDelay converts the config to milliseconds via delayConfigToMs, and the walker uses that as the look-ahead delay when scheduling the step.
The delay therefore happens before the step is dispatched, not during execution — by the time the delay step's execute runs, the wait has already elapsed, so execute is a no-op that immediately returns completed. The pending step's delayUntil is stamped on the automationStepRuns row so the cron can recover it.
Condition
steps/condition/index.ts. Config is { condition, yesBranchStepIndex, noBranchStepIndex }. On execute it serializes the canonical Condition and evaluates it against the contact via internal.automations.steps.condition.queries.evaluateConditionForContact (an action can't read the DB directly, so evaluation runs in a query).
A truthy result branches to yesBranchStepIndex, a falsy result to noBranchStepIndex. The chosen index is returned as the outcome's nextStepIndex, overriding the default sequential currentStepIndex + 1. A null branch target means "fall through to the next sequential step." Because a branch can point backward, the condition step is exactly why the walker enforces the MAX_STEPS_PER_RUN loop cap.
Crons and resilience
The walker schedules each step's next step directly via ctx.scheduler.runAfter(...), so under normal operation no cron is needed to advance a run. The single safety-net cron is in apps/api/convex/crons.ts:
crons.interval(
'process pending delays',
{ minutes: 5 },
internal.automations.stepWalker.processPendingDelays,
);
processPendingDelays queries automationStepRuns via the by_status_and_delay_until index for rows that are still pending with a delayUntil in the past, and re-dispatches executeStep for each. This catches delay steps whose original runAfter was lost — for example, if the deployment was offline when the delay elapsed. Because a re-dispatch and the original schedule could both fire, this is precisely the duplicate the atomic step claim defends against: only one of them claims the pending → executing transition; the loser drops.
Run and step-run status spaces
automationRuns and automationStepRuns have their own status enums, distinct from the parent automation's lifecycle:
| Table | Statuses |
|---|---|
automationRuns | running, completed, cancelled |
automationStepRuns | pending, executing, completed, failed, skipped |
These are sibling state spaces — an in-flight running run is not a state of the parent automations.status machine. Run completion (completeAutomationRun) bumps the statsCompleted shard; cancellation (cancelAutomationRun) bumps the statsCancelled shard. statsActive is never written directly — it is derived as entered − completed − cancelled by the rollup.
A skipped step run is terminal and written by markStepsSkipped: when a condition step branches forward (its target is beyond the next sequential index), the walker records one skipped step run for each bypassed step before scheduling the target. This is the only producer of skipped — a backward branch or a normal +1 advance bypasses nothing. The funnel (getStepAnalytics / getAutomationStats) sums each step's statSkipped counter so the analytics reflect the steps a contact jumped over.