Why kassette

Your agent loop just crashed on turn 27 of 30. Before doing so, it has burned countless tokens, sent Slack messages, and even opened a PR. You need to restart the run, but how do you ensure you don’t retry work you’ve already done that’s expensive or unsafe?

Everyone would like their code to be more resilient, but doing so often requires a huge investment. Durable execution exists to solve this problem for you by allowing you to offload that responsibility to it rather than needing to build it yourself. Doing so means your code execution can be reliable and fault-tolerant, leaving you to focus on your business logic.

The catch is that durable execution engines are built for workloads sized very differently from the one an agent loop creates. They’re designed for huge numbers of very small operations running concurrently and coordinating distributed work. In an agent loop, we’re just trying to avoid repeating the expensive or unsafe steps we’ve already completed when we retry after a crash. In this narrow case, reaching for a general purpose engine means extra services, third-party dependencies, setup, and runtime cost that you have no use for.

The question is not whether we should make our agent’s retries safe, it’s what making them safe actually requires.

The agentic workload

An agent run is a sequence of often expensive, non-idempotent turns. Being agentic means each turn will depend on the previous one, and the model will determine that next turn at runtime. Even though this control flow is sequential, LLMs are non-deterministic, so each run might end up taking a different path.

This is what an agentic workload looks like:

  • A run usually has 10–100 turns
  • Each turn is LLM bound, taking roughly 1–10 seconds
  • The control flow is sequential
  • The retry state is small: KB to MB, not GB
  • A run lives for seconds to days, depending on whether it waits on humans or external events

Just as important is what it doesn’t look like. We don’t have tens of thousands of tiny steps that require sub millisecond latency, GB-sized storage needs, or concurrent multi-agent orchestration. An agentic workload is simple.

The design falls out

For agentic workflows, we don’t need a runtime that can replay every line of code. We just need to avoid repeating the expensive or unsafe steps that already finished — the LLM calls, tool executions, human approvals, and external events. We can record each step’s result after it completes but otherwise continue to own our own control flow (if, while, try/catch) without needing to shovel it into some new structure.

If retry safety only means “don’t repeat completed steps,” then the only state we need is the set of completed steps and their results and that can be reduced to a journal. If we need to re-run later, we simply replay already journaled steps instead of executing them again. We call this at-most-once journaling.

Because we recover one run at a time, each run can journal to its own append-only log. We’ll write a single entry into that log for each step whose result we record. On a crash, resume, or fork, we have no problem reading the entire log from that run, using it to rebuild the replay state, and then continuing from the first unfinished step.

This journal can’t live in memory since the whole point is the process may die. It needs persistent storage. Given the workload above — tens of writes per run and agent turns measured in seconds — we don’t care much about storage latency. A slow 30ms object-store write is just noise next to an LLM call and its tool executions so we can use the simplest storage that everyone has and can easily be accessed from anywhere: an object store (S3, R2, GCS).

For a given agent run that executes as a sequential control flow, only one process needs to be able to write to that run’s journal at a time. If a retry overlaps with an older process still running, we need to ensure that retry is the only active writer. The journal can use fencing to stop the older process such that if it tries to write again, that write will be rejected. That means coordination can be contained within the journal, without a separate lock service, lease system, heartbeat, or scheduler state.

A run advances only when it’s actually working. Sometimes it will need to pause and wait on a person, webhook, CI job, or timeout. Since the journal holds everything needed to resume, our process can exit.

Which raises the one thing a journal can’t do for itself: we still need a way to retry or resume a run after it has crashed or been paused. If you’ve been following along, you might have noticed two things: 1) anything that can access the journal can re-invoke a run and 2) that the journaling mechanism we need can actually run embedded within your own process, we do not need a separate runtime. This splits the system we need for durable execution naturally into two halves: one is the journal (with at-most-once journaling) and the other is a dispatcher to invoke the runs. But you already have a dispatcher in your stack that we can compose the journal with.

The dispatcher half is already in your stack

Now we need a way to re-invoke a run after the process crashes or has exited to wait on something external.

All we need is a mechanism to know that a run has stopped and re-invoke it safely. What does safely mean? We can’t guarantee exactly-once invocation since processes die unpredictably. But we don’t need to. Our journal already ensures we can’t redo any work we’ve already journaled through at-most-once journaling. That makes repeat invocations harmless so long as they name the same run using an idempotency key the journal keys replay on. That leaves just one requirement for the dispatcher: at-least-once invocation.

Here’s what we get when we compose the two halves:

at-least-once invocation combined with at-most-once journaling yields effectively-once step completion

So what does this dispatcher that re-invokes the run look like? Probably something you already have: a queue that retries a message until you acknowledge it, a background job retries until the handler returns, a container job retries until it exits successfully. As long as it re-invokes with the same run id, kassette will replay completed steps instead of repeating them.

If you can describe how a failed job gets retried in your stack, you already have the dispatcher. kassette is the only part you’re missing, see wiring the dispatcher.

Other durable execution systems bundle the dispatcher with the journal: a worker pool polling a queue, a built-in scheduler routing tasks, but nothing in this workload requires all that.

For agentic workflows, the right move is less machinery, not more.