Wiring the dispatcher

This guide walks you through wiring up your dispatcher to kassette.

Before going any farther, please read Why kassette. It explains how durable execution for agentic workflows decomposes into these two parts:

  1. kassette provides at-most-once step journaling, and
  2. your dispatcher provides at-least-once invocation keyed by runId

Together these two pieces give effectively-once step completion. The dispatcher may invoke the same run more than once, but kassette replays the journal and skips work that already completed.

The dispatcher contract

The dispatcher has one job:

Invoke a run, and if the invocation crashes or doesn’t acknowledge before its deadline, invoke the same runId again.

Why this works. The dispatcher’s delivery key and the journal’s state key are the same runId, so at-least-once delivery composed with at-most-once journaling yields effectively-once step completion.

When to ack

Ack when kassette tells you there is nothing useful to retry.

What happenedWhat it meansShould ack?
result.status === 'success'The run completed successfullyyes
result.status === 'suspended'The run paused and is awaiting an eventyes
result.status === 'failed'The run failedyes
TerminalRunError, CancelledError, EventPendingErrorNothing to do, re-invocation throws identicallyyes
VersionMismatchError, MetadataMismatchError, ReplayMismatchError, JournalCorruptionErrorJournal is corrupted so replay will always failyes
Anything else (network blip, isolate crash, runtime tear-down)Transient. Replay on next invocation resumes from the journalno, re-deliver

As you can see above, the only case in which we “do not ack” is when the run has crashed or hung. In that case, we simply retry with the same runId, which kassette makes safe to do.

Dispatcher loop

Written out verbosely, the loop is:

  1. invoke the run,
  2. ack every settled result,
  3. report settled failures before acking, and
  4. retry only unknown/transient errors.
try {
  const result = op === 'start' ? await workflow.start(input, { runId }) : await workflow.resume(runId, event);

  // success, suspended, and failed are all settled results
  if (result.status === 'failed') {
    await reportError(runId, result.error);
  }

  msg.ack();
} catch (err) {
  // terminal run states, re-delivery won't change anything
  if (err instanceof TerminalRunError || err instanceof CancelledError || err instanceof EventPendingError) {
    msg.ack();
  } else if (
    // irrecoverable journal/replay problems
    err instanceof VersionMismatchError ||
    err instanceof MetadataMismatchError ||
    err instanceof ReplayMismatchError ||
    err instanceof JournalCorruptionError
  ) {
    await reportError(runId, err);
    msg.ack();
  } else {
    // transient failure so we should retry
    msg.retry({ delaySeconds: 2 });
  }
}

Why retries are safe

Retrying the same runId is safe because duplicate invocations map to the same journal.

If an older process resumes after a newer retry has taken over, kassette fences the zombie so it cannot append new journal entries. This protects the journal, but it doesn’t automatically make external side effects idempotent. If a process crashes after performing an external side effect but before kassette has recorded the step, retry may perform that side effect again. You should use idempotency keys for unsafe external operations.

See Correctness boundaries for the full guarantee.

Per-dispatcher mapping

Here’s how various dispatchers will map:

DispatcherAckRetry
Queuesack() / delete the messageretry, or let visibility timeout expire
Job processorsreturn from the handlerthrow
Container jobsexit 0exit non-zero
Postgres queuesdelete the row / returnlet the lock expire

Delays, events, and schedules

Some workflows will require the dispatcher to invoke the run later or from a different source.

For ctx.sleep(T), the dispatcher should invoke the run again after T has passed.

After calling ctx.suspend(), an external event such as a webhook, UI approval, or a callbacks should run workflow.resume(runId, event).

You can kick off scheduled runs via cron of course as well.

Dispatcher checklist

  1. Decide what you’ll use for the dispatcher, eg: queue, job processor, container job, Postgres queue, or similar.
  2. Mint a runId before dispatching so the first invocation is already idempotent.
  3. Invoke start(input, { runId }) or resume(runId, event).
  4. Ack successful, suspended, failed, cancelled, and otherwise terminal outcomes.
  5. Do not ack transient failures, the dispatcher should retry the same runId.
  6. Tune the recovery window: visibility timeout, lock duration, or equivalent.
  7. Set a retry cap so failed runs don’t retry forever.

Reference implementation

See examples/cloudflare-queue.

This example uses a Cloudflare Worker journaling to R2, with Cloudflare Queues providing redelivery. The queue visibility timeout acts as the failure detector: if the worker does not ack, the message is redelivered to a fresh isolate, which replays completed steps from R2 and continues the run.