Embeddable durability primitives
Retry agent workflows without repeating work.
kassette is a tiny, zero-dependency TypeScript library that makes agentic workflows durable. It journals completed steps, then replays them on retry after a crash, timeout, or redeploy.
What it does
It makes retries safe.
-
Replay finished steps
On retry, kassette replays the journal and returns recorded results instead of calling the model or tool again.
-
Wait without running
Suspend for a human, webhook, or CI job. The process exits, then replay continues when the event arrives.
-
Use your existing retry system
kassette is a library, not a runtime. Your queue, job runner, or webhook re-invokes the same runId.
-
Keep state in a plain journal
Each run is a readable JSONL journal on a filesystem or object store. It is the state, audit trail, and resume point.
Example
Normal async code, durable steps.
import { kassette, LocalStorage } from '@usekassette/kassette';
const agent = kassette(async (ctx, ticket) => {
const analysis = await ctx.step('analyze', () =>
llm.chat('Diagnose this issue and recommend a fix', { ticket })
);
if (analysis.destructive) {
// process can exit here; resume from anywhere via agent.resume()
const approval = await ctx.suspend('human-approval');
if (!approval.approved) return { outcome: 'skipped', reason: approval.notes };
}
const result = await ctx.step('apply-fix', () =>
executeTool(analysis.suggestedAction)
);
return { outcome: 'resolved', result };
}, { storage: new LocalStorage('.kassette') });
await agent.start(ticket); {"type":"start","session":1,"offset":0,"timestamp":"2026-05-08T14:21:03Z","metadata":{"ticket":{"id":"INC-4821","title":"Pod crashing on startup"}}} + 5 more lines
{"type":"step","session":1,"offset":1,"timestamp":"2026-05-08T14:21:08Z","stepId":"analyze","name":"analyze","result":{"destructive":true,"suggestedAction":"restart-pod-7f3c","rationale":"OOM during init; restart releases stuck handle"}}{"type":"suspend","session":1,"offset":2,"timestamp":"2026-05-08T14:21:08Z","reason":"Waiting for event: human-approval","waitingFor":"human-approval"}{"type":"resume","session":2,"offset":3,"timestamp":"2026-05-08T14:47:12Z","eventName":"human-approval","value":{"approved":true,"notes":""}}{"type":"step","session":2,"offset":4,"timestamp":"2026-05-08T14:47:14Z","stepId":"apply-fix","name":"apply-fix","result":{"ok":true,"podId":"7f3c"}}{"type":"complete","session":2,"offset":5,"timestamp":"2026-05-08T14:47:14Z"} Examples
Try the workflow closest to yours.
- Agent loop agent-loop A minimal durable think-act-observe loop. LLM calls and tool executions are wrapped in step(). After a crash, replay skips finished work and continues at the first unfinished step.
- Loan underwriting loan-underwriting Parallel data gathering with human approval gates. parallel() runs the credit check and property appraisal. Each branch can suspend() for reviewer sign-off. fork() can re-run the final decision with earlier analysis replayed.
- Deploy assistant deploy-assistant A webhook-driven deployment assistant. Uses suspend() for clarifying replies and production approvals, and sleep() while dispatched jobs take effect.
- Coding agent coding-agent Speculative branching and backtracking with fork(). Reuse a recorded plan while trying multiple implementations, or backtrack to planning if none pass.
- Vercel AI SDK vercel-ai-sdk Middleware for recording LLM calls through the Vercel AI SDK, including streaming responses. Replay returns the recorded response without calling the provider.
- Cloudflare Queue cloudflare-queue Adds a queue in front of the Worker. Queue redelivery acts as the crash detector, a fresh isolate replays from R2 and continues live.
Read next
The shortest path through the docs.
- 01 Quickstart docs/quickstart Build a local workflow that records an expensive step, pauses for review, inspects the JSONL journal, and resumes without repeating finished work.
- 02 Why kassette docs/why-kassette Why a simple journal fits agent runs: small state, expensive steps, sequential control flow, and your existing retry system.
- 03 Wiring the dispatcher docs/wiring-dispatcher Use a queue, job runner, webhook, or container job to retry the same runId, ack settled results, and let replay skip completed steps.
- 04 Storage backends docs/storage-backends Choose filesystem or object storage based on where the run may resume, then check the fencing, append, and retention tradeoffs.
Use it when
Skip the work you've already done.
Reach for kassette when the problem is not 'how do I run this again?' but 'how do I avoid doing the same work twice?' Your existing stack makes retries easy but not safe.