# Object storage design

_This design relies on S3-style semantics: read-after-write consistency for object reads and atomic conditional writes (`If-Match` / `If-None-Match`) as supported on AWS S3 and Cloudflare R2._

kassette’s object storage backend, `RemoteStorage`, stores each run's journal as one JSONL object in an S3-compatible object store. Because object stores cannot append to objects, kassette emulates append using a CAS-protected read-modify-write loop.

## Why one object?

We could store the journal in other ways. Why store the full journal as one object?

As the [Why kassette](why-kassette.md#the-agentic-workload) doc explains, a one object design fits kassette's target agentic workload: 10–100 sequential turns, often needing to wait on an LLM call, tool call, webhook, or human. Journal entries are usually KB-sized, so object-store latency on the write hot path is usually hidden under the agent's work which is measured in seconds.

This design optimizes the read path because on start, resume, or fork, kassette needs to read the whole journal to rebuild run state. With one object, that is one read.

The tradeoff is write amplification. Now each append needs to upload the full journal so the total uploaded bytes over a run ends up as `O(N²)`. That’s acceptable for the workloads run with kassette.

Other designs optimize for different things, eg:

- A design like Delta Lake's where we have one object per entry makes appends cheap, but when it has to recover does a lot more work when it lists, fetches, orders, and validates many entry objects.
- A WAL-style layout would write entries as separate objects as well, along with a small `meta.json` file containing the active session and next offset. Writers would then CAS `meta.json` instead of the individual journal entries. That avoids having to rewrite the full journal, but the tradeoff is that a commit is now spread across two writes instead of one, potentially leaving us with partial commits we need to reconcile.

In kassette’s case, it makes sense to accept the larger write payload in exchange for fewer reads and an overall simpler design.

## Serializing concurrent writes

Each append uses a CAS-protected read-modify-write loop:

1. Read the journal and its etag.
2. Add one JSONL line locally.
3. Upload the full journal object back with `If-Match: <etag>`.

The `If-Match` enables compare-and-swap (CAS) so that object storage will only accept the upload if the object still has the same etag that we initially read. This allows us to serialize concurrent appends.

Suppose two processes try to append to the same run:

1. Process A and process B both read the same journal at etag `E1`.
2. A appends its line and uploads with `If-Match: E1`. It’s accepted and the etag is updated to `E2`.
3. B appends its line to its (now stale) copy and uploads with `If-Match: E1`. It will be rejected with `412 Precondition Failed` because the current etag is now `E2`.
4. B reads the latest journal, now including A's line, appends its line to that version, and this time succeeds.

From this example you can see that the rejection is only a signal that B's copy was stale, not a lost update. CAS gives kassette a strongly consistent, ordered journal without needing a lock service.

We’ve now established consistency for the journal, but that’s only one part of safely appending to it. The other danger we must confront is from zombie processes.

## Fencing zombie writers

A zombie writer is an old kassette process that is still running after a newer one has taken over the same run. This most commonly happens when a dispatcher redelivers a timed-out job.

kassette fences zombies with session numbers stored in the journal:

- Every new session writes a `start` entry.
- That `start` entry uses a session number higher than any session already in the journal.
- Every later entry from that invocation carries the same session number.

The highest session number in the journal is the current writer.

Here’s a simple example:

1. Session 1 starts and begins work.
2. Session 1 stalls before its next journal write.
3. The dispatcher starts session 2 for the same run.
4. Session 2 writes `start` with `session: 2`; that is now the highest session in the journal.
5. Session 1 wakes up and tries to append a `step` entry with `session: 1`.
6. Its CAS write fails because session 2 changed the object.
7. Session 1 rereads the journal, sees `start` with `session: 2`, and throws `FencedError` instead of retrying.

The important part is step 7. After a CAS failure, kassette does not blindly retry, it will always re-read the journal and run the fencing check again. That way, if a higher session has started, this writer knows it’s stale and must stop.

CAS and fencing are performing different jobs:

- CAS ensures data in the journal is always consistent.
- The session number decides whether this process is still allowed to append.

Together they give kassette a single ordered journal with a single writer, all without the complexity of locks, leases, heartbeats, or a coordinator.
