# Operations

This guide covers the main production tasks for kassette: inspecting runs, debugging failures, forking runs, and cleaning up old journals.

## Agent observability

kassette is designed so an agent can inspect a run directly. The agent reads the journal and can answer:

- What already happened?
- Where did the run suspend or fail?
- Where can I safely fork?

### The bundled skill

The repo ships an agent skill at [`skills/kassette/SKILL.md`](../skills/kassette/SKILL.md). Use it as the operational guide for agents working with kassette. It covers storage discovery, journal basics, `jq` patterns, status interpretation, suspend/resume debugging, and safe forking. Use the skill together with the CLI. In practice, agents are the main expected users of the CLI.

### The CLI

`@usekassette/cli` ships a `kassette` command that works uniformly against both `file:` and `s3://` urls.

```bash
kassette --storage <url> <verb> [args]
```

`--storage` accepts the same URL formats the workflow API uses through `LocalStorage` and `RemoteStorage`:

| Scheme  | Form                                 | Backend                               |
| ------- | ------------------------------------ | ------------------------------------- |
| `file:` | `file:<path>` (relative or absolute) | `LocalStorage`                        |
| `s3://` | `s3://<bucket>[/<prefix>]`           | `RemoteStorage` via `@usekassette/s3` |

`@usekassette/s3` is optional. Install it alongside `@usekassette/cli` if you want to use `s3://` urls.

The CLI has four verbs:

| Verb     | Form                                                                        | Output                                                           |
| -------- | --------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| `list`   | `list [--limit N]`                                                          | One `{"runId":"..."}` per line                                   |
| `status` | `status <runId>`                                                            | One [`RunStatus`](reference.md#runstatus--runstatus) JSON object |
| `dump`   | `dump <runId> [--offset N]`                                                 | Journal entries as JSONL, one per line                           |
| `fork`   | `fork <srcRunId> (--from-offset N \| --from-step <id>) [--new-run-id <id>]` | One `{"runId":"<new>"}`                                          |

`fork` is the only verb that writes anything. It will create a new run, but never modifies the source run. You must pass exactly one fork point:

- `--from-offset N`
- `--from-step <id>`

`--from-step` cuts at the offset of the first step entry with that `stepId`.

```bash
# Inspect storage
kassette --storage file:.kassette list                       # one runId per line
kassette --storage file:.kassette status <runId>             # JSON status object
kassette --storage file:.kassette dump <runId>               # JSONL journal entries
kassette --storage file:.kassette dump <runId> --offset 7    # entries from offset 7

# Fork
kassette --storage file:.kassette fork <runId> --from-offset 7
kassette --storage s3://bucket fork <runId> --from-step llm#3
```

### Reading a journal with jq

`kassette dump` emits JSONL, so you can pipe it into `jq`:

```bash
# Step sequence with offsets
kassette --storage file:.kassette dump <runId> | jq '{offset, type, stepId}'

# Just step results
kassette --storage file:.kassette dump <runId> | jq 'select(.type == "step") | {stepId, result}'

# Count entries by type
kassette --storage file:.kassette dump <runId> | jq -r .type | sort | uniq -c

# Session boundaries (each `start` is a new session)
kassette --storage file:.kassette dump <runId> | jq 'select(.type == "start") | {offset, session}'
```

For programmatic access, use `storage.readAll(runId)`. It returns every journal entry with its offset:

```typescript
const storage = new LocalStorage('.kassette');
const entries = await storage.readAll('run-abc');

for (const entry of entries) {
  if (entry.type === 'step') {
    console.log(`${entry.stepId} → ${JSON.stringify(entry.result)}`);
  }
}
```

`runStatus(entries)` transforms the journal into a single status object. See [Reference](reference.md#runstatus--runstatus) for the shape.

## Interpreting run status

`kassette status <runId>` reads the journal and returns one status object. A run has one of five states.

Apply these rules in order:

1. If the last entry is `complete`, the run is **completed**. The previous `step` entry's `result` is usually the agent's final output.
2. If the last entry is `error`, the run is **failed**. Read `message`, `name`, `stack`, and the last `step` entry before the error.
3. If the last entry is `cancel`, the run is **cancelled**. If `reason` is `"suspend_timeout_expired"`, the suspend deadline passed before `resume()` was called.
4. Otherwise, walk backward through the journal and skip `start` entries. If the first non-`start` entry is `suspend`, the run is **suspended**. `waitingFor` tells you which event must arrive before the run can continue. Skipping `start` matters because a crash during resume can write a `start` entry after the `suspend`; that does not change the run's state.
5. Otherwise, the run is **unsettled**. It may be running, crashed, abandoned, or empty. The journal alone cannot tell the difference. Use an out-of-band liveness signal if you need to know.

### Debugging a stuck run

A stuck run is a run that should have finished but has not.

1. Run `kassette status <runId>`.
2. If the run is **suspended**, check why the resume event has not arrived. The journal shows the`waitingFor` event name. Trace that event through your webhook handler, queue consumer, scheduler, or dispatcher.
3. If the run is **unsettled** and no process is active, the process probably crashed before writing a terminal entry. Call `start()` again with the same run ID. kassette will open a new session, replay the journal, and continue from the first unfinished step.
4. If the run is **unsettled** and a process is active, wait or inspect that process. With `LocalStorage`, a `.lock` file with a live PID means a session is running. A lock with a dead PID will be reclaimed by the next process.
5. If the run is **failed** or **cancelled**, it is terminal. Read the journal for the reason. To try again, fork from before the failure or start a fresh run.

### Debugging a failed run

An `error` entry includes `message`, `name`, and `stack`.

To understand the failure, read the last `step` entry before the error. That shows what the agent had already done when it failed. Then search the workflow code for the next step to see what would have run next.

To retry part of the run, fork from before the failure, fix the code or input, and continue from the fork. See [Versioning](versioning.md) to decide whether the change needs a `version` bump.

## Forking workflows

Forking is useful for any long run, but it matters most for AI agents. Agent runs can be expensive, and re-running from the beginning may not reproduce the same behavior because LLM calls can return different results.

Forking lets you avoid that. It copies the journal up to a chosen point and starts a new run from there. Completed steps replay from the journal so you only pay for LLM calls after the fork.

### How it works

A programmatic `fork()` copies journal entries before the cut into a new run. It also handles session numbering and removes terminal entries such as `complete`, `error`, and `cancel`.

```typescript
import { fork, start } from '@usekassette/core';

const newRunId = await fork(storage, 'run-abc', { fromOffset: 13 });
const session = await start(storage, newRunId);
// replays memoized steps (offsets 0–12), then goes live from offset 13
```

The workflow API exposes the same operation on kassette:

```typescript
const result = await agent.fork({ runId: 'run-abc', fromOffset: 13 });
```

You can also fork from the CLI:

```bash
kassette --storage file:.kassette fork run-abc --from-offset 13
# {"runId":"<new-fork-runId>"}
```

### Choosing the fork point

Inspect the journal to find where things went wrong:

```bash
kassette --storage file:.kassette dump run-abc | jq '{offset, type, stepId}'
```

```
{"offset":0,"type":"start","stepId":null}
{"offset":1,"type":"step","stepId":"llm"}
{"offset":2,"type":"step","stepId":"tool:lookup_order"}
{"offset":3,"type":"step","stepId":"llm#2"}
{"offset":4,"type":"step","stepId":"tool:check_refund_policy"}
{"offset":5,"type":"step","stepId":"llm#3"}
{"offset":6,"type":"step","stepId":"tool:process_refund"}   <-- fork here?
{"offset":7,"type":"step","stepId":"llm#4"}
{"offset":8,"type":"step","stepId":"tool:send_confirmation_email"}
{"offset":9,"type":"complete"}
```

If the agent sent the wrong email at offset 8, fork from offset 7 to re-run the LLM call that decided what to send:

```bash
kassette --storage file:.kassette fork run-abc --from-offset 7
```

Or fork from offset 5 to let the LLM decide again after seeing the refund policy:

```bash
kassette --storage file:.kassette fork run-abc --from-offset 5
```

### Caveats

**The agent code must not have changed** between the original run and the fork. Replay depends on deterministic call order. If the code changed, step IDs may bind memoized results to the wrong calls. Pass `version` on `start` to catch this as a `VersionMismatchError`. See [Versioning](versioning.md).

## Comparing forked runs

Two runs forked from the same point share the same journal prefix. Diff the runs to see where they diverged.

```bash
diff <(kassette --storage file:.kassette dump run-fork-1 | jq '{offset,type,stepId}') \
     <(kassette --storage file:.kassette dump run-fork-2 | jq '{offset,type,stepId}')
```

This compares the journal structure (offsets, entry types, and step IDs). To compare the actual step results, including LLM responses, diff only the step entries:

```bash
diff <(kassette --storage file:.kassette dump run-fork-1 | jq 'select(.type == "step") | {stepId, result}') \
     <(kassette --storage file:.kassette dump run-fork-2 | jq 'select(.type == "step") | {stepId, result}')
```

## Using journals as test fixtures

A journal can act as a deterministic replay fixture. Copy a journal into your test suite, then run the agent against it. kassette replays the recorded steps and makes no LLM calls.

```typescript
import { copyFile } from 'node:fs/promises';

// Set up: copy a known-good journal into the test storage directory
await copyFile('fixtures/happy-path.jsonl', '.kassette-test/test-run.jsonl');

const storage = new LocalStorage('.kassette-test');
const session = await start(storage, 'test-run');

// The agent replays all memoized steps — zero LLM calls, deterministic output
const result = await agentLoop(session);
assert.equal(result, expectedOutput);
```

To create a fixture from a real run:

```bash
cp .kassette/run-abc.jsonl test/fixtures/happy-path.jsonl
```

To test a specific failure path, copy only the entries before the failure:

```bash
head -n 8 .kassette/run-abc.jsonl > test/fixtures/partial-run.jsonl
```

Now the test will replay the first 7 memoized steps before reaching the first missing step. Stub the LLM at that point, return a controlled response, and assert what the agent does next.

## Retention and cleanup

kassette does not delete journals for you. After a run completes, its journal stays in storage until you remove it. This is intentional since the journal is the audit trail.

Storage grows with every run. A typical journal is small, but total usage is unbounded. Set a retention policy before storage growth becomes a problem.

### LocalStorage

For local storage, use a cron job or scheduled task to delete old journals:

```bash
# Delete journals older than 30 days
find .kassette -name '*.jsonl' -mtime +30 -delete
find .kassette -name '*.lock' -mtime +30 -delete
```

If you only want to delete completed, failed, or cancelled runs, scan runs with `kassette list` and check each run with `kassette status` before deleting it:

```bash
kassette --storage file:.kassette list \
  | xargs -I{} sh -c 'kassette --storage file:.kassette status {} | jq --arg id {} ...'
```

### RemoteStorage (S3)

For S3 (or S3-compatible like R2/GCS), use a lifecycle policy on the bucket or prefix where journals are stored:

```json
{
  "Rules": [
    {
      "ID": "kassette-journal-retention",
      "Status": "Enabled",
      "Filter": { "Prefix": "kassette/" },
      "Expiration": { "Days": 90 }
    }
  ]
}
```

### Choosing a retention window

Choose a retention window based on these needs:

1. **Replay.** Do not expire suspended runs that may still resume. Keep journals longer than your longest expected suspend.
2. **Audit.** A journal records what the agent did. Keep terminal journals as long as your compliance, support, or postmortem process needs them.
3. **Forking.** Forks need the original journal. If you use forks for debugging or backtracking, keep terminal journals through that debugging window.

If you need long-term history but do not want it in active storage, archive before deleting. Use `kassette dump` to copy journals to colder storage, then expire them from the active path.
