# Architecture

For the design rationale behind kassette, see [Why kassette](why-kassette.md).

## 1. The model

Each run has one journal, and that journal is the only source of truth.

Every invocation starts by reading the journal from the beginning to rebuild run state before running the workflow from the top. If a step is already recorded, kassette returns the recorded result instead of running it again. Execution goes live at the first unrecorded step, and new completed work is appended.

## 2. Concepts

A **run** is one execution of a task, identified by a `runId`. It owns exactly one journal, and that journal is isolated from other runs at the storage level. A run either reaches a terminal state (`completed`, `failed`, `cancelled`) or remains open so the caller can invoke it again with the same `runId` to pick up where it left off.

A **session** is one continuous execution attempt within a run, identified by a monotonically increasing session number. Every new session writes a session number greater than every number already in the run’s journal. An initial invocation, a retry after an interruption, and a resume after a suspend all open new sessions.

A **step** is one unit of recorded work inside a session. Each `record()` call is a step. The first time it runs, kassette runs the function and writes the result to the journal. In later sessions, kassette returns the journaled result instead, so the function does not run again.

## 3. Lifecycle

```
Run "task-abc"
  Session 1: step, step, step → crash
  Session 2: replay(3), step, step → suspend(waiting on CI pipeline)
             ↳ process exits, releases all resources
             ↳ caller re-invokes with the resume payload
  Session 3: replay(5), step → complete
```

A crash appends nothing. If the journal has no terminal entry, the run is still open and may be continued by a later session.

The state machine is derivable from the journal alone.

```
                                  ┌──step──┐
                                  │        │
                                  ▼        │
   ┌─────┐   start   ┌────────────────┐────┘  complete   ╔═══════════════╗
   │ new │──────────▶│   unsettled    │────────────────▶ ║   completed   ║
   └─────┘     ┌────▶└─┬───────────┬──┘                  ╚═══════════════╝
               │       │           │
               │       │ suspend   │ fail                ╔═══════════════╗
               │       │           └───────────────────▶ ║    failed     ║
        resume │       │                                 ╚═══════════════╝
               │       ▼
               │   ┌────────────────┐
               └───│   suspended    │
                   └───────┬────────┘
                           │  timeout                    ╔═══════════════╗
                           └───────────────────────────▶ ║   cancelled   ║
                                                         ╚═══════════════╝
```

Read the arrows as lifecycle transitions:

- `start` opens a session.
- `step` records completed work and leaves the run open.
- `suspend` ends the current session while the run waits for an event.
- `resume` opens a new session and records the event payload.
- `complete` writes a `complete` entry, making the run `completed`.
- `fail` writes an `error` entry, making the run `failed`.
- `timeout` is checked only when a later `start`, `resume`, or `fork` initializes the run; if the suspend deadline has expired, kassette writes `cancel`,
  making the run `cancelled`.

Only `completed`, `failed`, and `cancelled` are terminal states.

## 4. Replay

Each new session runs the workflow again from the top, but previously recorded work is not repeated.

When `record(name, fn)` is reached, kassette looks for the matching journaled step. If it finds one, it returns the recorded result and skips `fn`. If it does not, `fn` will run and its result is appended as a new `step`.

Each `waitForEvent(name)` call gets checked in the same way. If the journal already has a matching `resume` then kassette simply returns the recorded value. Otherwise, it writes `suspend` and unwinds the workflow so the process can exit.

Replay is correct only if each session reaches the same `record()` and `waitForEvent()` calls in the same order. Step IDs are positional (`name`, `name#2`, `name#3`), so removing, reordering, or conditionally skipping a call can attach an old result to the wrong code. Concurrent branches need to add scope-based namespacing, see [Concurrency](concurrency.md).

Any non-deterministic work, such as LLM calls or tool calls, should be wrapped in `record()` calls.

If the workflow's code has been modified between sessions then that can also shift the call order and break replay silently. The optional `version` field on `start()` is a deployment-level guard to protect against this, but it only signals a version mismatch without identifying the changed step or preventing all drift. See [Versioning](versioning.md) for more discussion on this.

## 5. Suspend, resume, and timeouts

When you need a workflow to wait on something, like human approval or a webhook, you can suspend it without keeping the process alive until you're ready to resume it. This is done through `waitForEvent` and `resume`.

To suspend, use `waitForEvent` to write a `suspend` entry under the current session. It'll throw an exception to return control to the caller so that the process can exit.

To resume, call `resume(runId, name, value)` to open a new session and record a `resume` entry with the event value passed in. On replay, `waitForEvent(name)` will find that entry and return its value instead of suspending again. (Calling `resume` more than once for the same event is safe because the first recorded value wins, and later values are ignored.)

A suspend may include a deadline, but note that kassette doesn't poll or run timers. Deadlines are only checked when a new session is opened by `start`, `resume`, or `fork`. If the deadline has passed and no matching `resume` exists, kassette writes a `cancel` entry and the run becomes terminal.

## 6. Properties of the journal

**Append-only.** Entries are never modified or deleted. Once work is recorded, it stays settled.

**Ordered.** Entries are read in append order. Replay depends on this because recorded results are matched to `record()` calls by position.

**Atomic.** Each entry is written completely or not at all.

**Fenced.** Only the current session can append. Writes from a superseded session are rejected.

**JSON.** The journal is newline-delimited JSON with one entry per line and one file per run. Inspection requires no library and no kassette dependency.

**Self-contained.** Everything needed for replay is in the journal. As long as you can access the journal, the run can continue.

**Per-run isolated.** Each `runId` has its own journal, so unrelated runs cannot interfere with each other.

## 7. The single-writer invariant

Only the newest session for a run may append to its journal.

Whenever a session is opened, it gets a higher session number than any earlier session, and each entry appended during that session will include that number. Before appending, the storage backend checks whether the journal already contains a `start` entry with a higher session number. If it does, the append gets rejected with a `FencedError`, so zombie sessions are unable to write.

The local backend enforces this with a per-run lock file (`{runId}.lock`) so only one local process can write at a time. If the lock owner has died, the next process can reclaim, but only after checking the journal for a newer session in order to prevent an old session from writing again after being superseded.

The remote object storage backend uses CAS instead of a lock. Each run is one object. In order to append, kassette must read the object and its ETag, then writes back the full journal with `If-Match: <etag>`. If another writer got there first, kassette retries; if the retry sees a higher session number, the writer is stale and exits with `FencedError`. For the full CAS and session-number fencing design, see [Object storage design](object-storage-design.md).

## 8. Correctness boundaries

kassette guarantees at-most-once journaling, not at-most-once execution.

If a step performs an external action and the process crashes before the result is written to the journal, kassette has no record of that action. On the next session, replay will run the step again.

With remote storage, an old session may also finish work it already started before it learns that a newer session has taken over. Its next journal write will be rejected, but the external action may already have happened.

What that means is that for irreversible external actions, you must make sure to use idempotency keys or another method of deduplication. kassette makes retries safe only after the result is journaled.
