Wiring the dispatcher
This guide walks you through wiring up your dispatcher to kassette.
Before going any farther, please read Why kassette. It explains how durable execution for agentic workflows decomposes into these two parts:
- kassette provides at-most-once step journaling, and
- your dispatcher provides at-least-once invocation keyed by
runId
Together these two pieces give effectively-once step completion. The dispatcher may invoke the same run more than once, but kassette replays the journal and skips work that already completed.
The dispatcher contract
The dispatcher has one job:
Invoke a run, and if the invocation crashes or doesn’t acknowledge before its deadline, invoke the same
runIdagain.
Why this works. The dispatcher’s delivery key and the journal’s state key are the same runId, so at-least-once delivery composed with at-most-once journaling yields effectively-once step completion.
When to ack
Ack when kassette tells you there is nothing useful to retry.
| What happened | What it means | Should ack? |
|---|---|---|
result.status === 'success' | The run completed successfully | yes |
result.status === 'suspended' | The run paused and is awaiting an event | yes |
result.status === 'failed' | The run failed | yes |
TerminalRunError, CancelledError, EventPendingError | Nothing to do, re-invocation throws identically | yes |
VersionMismatchError, MetadataMismatchError, ReplayMismatchError, JournalCorruptionError | Journal is corrupted so replay will always fail | yes |
| Anything else (network blip, isolate crash, runtime tear-down) | Transient. Replay on next invocation resumes from the journal | no, re-deliver |
As you can see above, the only case in which we “do not ack” is when the run has crashed or hung. In that case, we simply retry with the same runId, which kassette makes safe to do.
Dispatcher loop
Written out verbosely, the loop is:
- invoke the run,
- ack every settled result,
- report settled failures before acking, and
- retry only unknown/transient errors.
try {
const result = op === 'start' ? await workflow.start(input, { runId }) : await workflow.resume(runId, event);
// success, suspended, and failed are all settled results
if (result.status === 'failed') {
await reportError(runId, result.error);
}
msg.ack();
} catch (err) {
// terminal run states, re-delivery won't change anything
if (err instanceof TerminalRunError || err instanceof CancelledError || err instanceof EventPendingError) {
msg.ack();
} else if (
// irrecoverable journal/replay problems
err instanceof VersionMismatchError ||
err instanceof MetadataMismatchError ||
err instanceof ReplayMismatchError ||
err instanceof JournalCorruptionError
) {
await reportError(runId, err);
msg.ack();
} else {
// transient failure so we should retry
msg.retry({ delaySeconds: 2 });
}
}
Why retries are safe
Retrying the same runId is safe because duplicate invocations map to the same journal.
If an older process resumes after a newer retry has taken over, kassette fences the zombie so it cannot append new journal entries. This protects the journal, but it doesn’t automatically make external side effects idempotent. If a process crashes after performing an external side effect but before kassette has recorded the step, retry may perform that side effect again. You should use idempotency keys for unsafe external operations.
See Correctness boundaries for the full guarantee.
Per-dispatcher mapping
Here’s how various dispatchers will map:
| Dispatcher | Ack | Retry |
|---|---|---|
| Queues | ack() / delete the message | retry, or let visibility timeout expire |
| Job processors | return from the handler | throw |
| Container jobs | exit 0 | exit non-zero |
| Postgres queues | delete the row / return | let the lock expire |
Delays, events, and schedules
Some workflows will require the dispatcher to invoke the run later or from a different source.
For ctx.sleep(T), the dispatcher should invoke the run again after T has passed.
After calling ctx.suspend(), an external event such as a webhook, UI approval, or a callbacks should run workflow.resume(runId, event).
You can kick off scheduled runs via cron of course as well.
Dispatcher checklist
- Decide what you’ll use for the dispatcher, eg: queue, job processor, container job, Postgres queue, or similar.
- Mint a
runIdbefore dispatching so the first invocation is already idempotent. - Invoke
start(input, { runId })orresume(runId, event). - Ack successful, suspended, failed, cancelled, and otherwise terminal outcomes.
- Do not ack transient failures, the dispatcher should retry the same
runId. - Tune the recovery window: visibility timeout, lock duration, or equivalent.
- Set a retry cap so failed runs don’t retry forever.
Reference implementation
See examples/cloudflare-queue.
This example uses a Cloudflare Worker journaling to R2, with Cloudflare Queues providing redelivery. The queue visibility timeout acts as the failure detector: if the worker does not ack, the message is redelivered to a fresh isolate, which replays completed steps from R2 and continues the run.