Wiring the dispatcher

This guide walks you through wiring up your dispatcher to kassette.

Before going any farther, please read Why kassette. It explains how durable execution for agentic workflows decomposes into these two parts:

kassette provides at-most-once step journaling, and
your dispatcher provides at-least-once invocation keyed by runId

Together these two pieces give effectively-once step completion. The dispatcher may invoke the same run more than once, but kassette replays the journal and skips work that already completed.

The dispatcher contract

The dispatcher has one job:

Invoke a run, and if the invocation crashes or doesn’t acknowledge before its deadline, invoke the same runId again.

Why this works. The dispatcher’s delivery key and the journal’s state key are the same runId, so at-least-once delivery composed with at-most-once journaling yields effectively-once step completion.

When to ack

Ack when kassette tells you there is nothing useful to retry.

What happened	What it means	Should ack?
`result.status === 'success'`	The run completed successfully	yes
`result.status === 'suspended'`	The run paused and is awaiting an event	yes
`result.status === 'failed'`	The run failed	yes
`TerminalRunError`, `CancelledError`, `EventPendingError`	Nothing to do, re-invocation throws identically	yes
`VersionMismatchError`, `MetadataMismatchError`, `ReplayMismatchError`, `JournalCorruptionError`	Journal is corrupted so replay will always fail	yes
Anything else (network blip, isolate crash, runtime tear-down)	Transient. Replay on next invocation resumes from the journal	no, re-deliver

As you can see above, the only case in which we “do not ack” is when the run has crashed or hung. In that case, we simply retry with the same runId, which kassette makes safe to do.

Dispatcher loop

Written out verbosely, the loop is:

invoke the run,
ack every settled result,
report settled failures before acking, and
retry only unknown/transient errors.

try {
  const result = op === 'start' ? await workflow.start(input, { runId }) : await workflow.resume(runId, event);

  // success, suspended, and failed are all settled results
  if (result.status === 'failed') {
    await reportError(runId, result.error);
  }

  msg.ack();
} catch (err) {
  // terminal run states, re-delivery won't change anything
  if (err instanceof TerminalRunError || err instanceof CancelledError || err instanceof EventPendingError) {
    msg.ack();
  } else if (
    // irrecoverable journal/replay problems
    err instanceof VersionMismatchError ||
    err instanceof MetadataMismatchError ||
    err instanceof ReplayMismatchError ||
    err instanceof JournalCorruptionError
  ) {
    await reportError(runId, err);
    msg.ack();
  } else {
    // transient failure so we should retry
    msg.retry({ delaySeconds: 2 });
  }
}

Why retries are safe

Retrying the same runId is safe because duplicate invocations map to the same journal.

If an older process resumes after a newer retry has taken over, kassette fences the zombie so it cannot append new journal entries. This protects the journal, but it doesn’t automatically make external side effects idempotent. If a process crashes after performing an external side effect but before kassette has recorded the step, retry may perform that side effect again. You should use idempotency keys for unsafe external operations.

See Correctness boundaries for the full guarantee.

Per-dispatcher mapping

Here’s how various dispatchers will map:

Dispatcher	Ack	Retry
Queues	`ack()` / delete the message	retry, or let visibility timeout expire
Job processors	return from the handler	throw
Container jobs	`exit 0`	exit non-zero
Postgres queues	delete the row / return	let the lock expire

Delays, events, and schedules

Some workflows will require the dispatcher to invoke the run later or from a different source.

For ctx.sleep(T), the dispatcher should invoke the run again after T has passed.

After calling ctx.suspend(), an external event such as a webhook, UI approval, or a callbacks should run workflow.resume(runId, event).

You can kick off scheduled runs via cron of course as well.

Dispatcher checklist

Decide what you’ll use for the dispatcher, eg: queue, job processor, container job, Postgres queue, or similar.
Mint a runId before dispatching so the first invocation is already idempotent.
Invoke start(input, { runId }) or resume(runId, event).
Ack successful, suspended, failed, cancelled, and otherwise terminal outcomes.
Do not ack transient failures, the dispatcher should retry the same runId.
Tune the recovery window: visibility timeout, lock duration, or equivalent.
Set a retry cap so failed runs don’t retry forever.

Reference implementation

See examples/cloudflare-queue.

This example uses a Cloudflare Worker journaling to R2, with Cloudflare Queues providing redelivery. If the consumer crashes or Cloudflare terminates it for exceeding a runtime limit, the unacknowledged message is delivered to another invocation according to the queue’s retry configuration. That invocation replays completed steps from R2 and continues the run.

Wiring the dispatcher#

The dispatcher contract#

When to ack#

Dispatcher loop#

Why retries are safe#

Per-dispatcher mapping#

Delays, events, and schedules#

Dispatcher checklist#

Reference implementation#