Skip to main content
← Back to list
01Issue
FeatureShippedSwamp CLI
Assigneeskeeb

#232 Extend DatastoreSyncService.markDirty() with optional relPath argument

Opened by keeb · 5/4/2026· Shipped 5/4/2026

Problem statement

DatastoreSyncService.markDirty() is currently a no-argument method. Its only job is to flip a fast-path bypass flag — the contract is "something in the cache changed; revalidate on the next sync."

This shape was sized for the existing first-party datastore extensions (@swamp/s3-datastore, @swamp/gcs-datastore), where the remote backend is an opaque blob store. Those extensions detect changes via a full cache walk (stat + mtime/size diff against a remote-stored .datastore-index.json). A single boolean dirty flag is enough — knowing which path was written wouldn't change what they have to do, because they still need the walk to compute the diff.

For datastore extensions backed by per-key-queryable storage — document stores, KV stores, content-addressed blob stores with key-level metadata APIs, etc. — the no-arg shape forces a fundamentally more expensive implementation than the backend requires. These extensions can validate or update a single file in O(1) without touching anything else, but markDirty() doesn't tell them which file. They're forced to either (a) walk the entire cache on every sync or (b) maintain their own filesystem watcher to recover information that swamp-core already had at the moment of the write.

The cost shows up at scale: cache walks that dominate sync time when the cache holds tens of thousands of files, even when only a handful changed.

Proposed solution

Extend the interface to:

markDirty(relPath?: string): Promise<void>

Behavior:

  • When relPath is omitted: identical to today's contract — "something dirty, revalidate everything." Defensive bulk-invalidate.
  • When relPath is provided: the specified cache-relative path was written or deleted. Extensions MAY use this to track per-path dirty state and skip the walk; they MAY also ignore it and treat the call as a bulk-invalidate (preserving today's behavior).

swamp-core's repository layer would thread the relPath into existing call sites where it's already in scope. From a quick survey, every existing markDirty() call site in swamp-core has the path available as a local variable — unified_data_repository.ts, yaml_output_repository.ts, yaml_workflow_run_repository.ts, yaml_evaluated_definition_repository.ts. Bulk-style operations (rename, cascade delete) decompose into per-path calls already.

Who benefits

  • Authors of custom DatastoreProvider extensions targeting queryable backends — document stores (e.g. MongoDB), KV stores (e.g. Redis, etcd), object stores with per-key metadata APIs. These authors can implement pushChanged as O(actually-changed-files) instead of O(everything-in-cache).
  • Operators running large shared swamp deployments where the cache grows past ~10k files. Today the per-sync cost grows with cache size, not change rate.
  • Future first-party extensions targeting backends with cheap per-key access (a hypothetical @swamp/redis-datastore, @swamp/dynamodb-datastore, etc.).

The existing @swamp/s3-datastore and @swamp/gcs-datastore extensions are unaffected — their slow-path walk is structural to the backend's API surface, and they can ignore the new optional arg.

Why this is strictly additive

  • No breaking change for extension authors. markDirty(relPath?) with optional arg is source-compatible with markDirty(). Existing implementations continue to work.
  • No breaking change for swamp-core. Today's call sites can keep calling markDirty() — passing the path is opportunistic.
  • No correctness risk. An extension that ignores the path still gets the bulk-invalidate semantics it had before.

The change is essentially: an optional information channel for extensions that can use it.

Alternatives considered

  1. A separate markDirtyPath(relPath) method. Adds a parallel API surface; extensions would have to implement both or risk one path being missed. The optional-arg shape keeps a single contract.
  2. A filesystem watcher inside the extension. Recovers the path information swamp-core already has, at the cost of OS-specific code, race windows on rapid writes, and bugs that an explicit hook would never have. Strictly worse than threading the path through the existing call.
  3. Extension-level write interception. Would require the extension to provide its own writeFile API and have swamp-core route through it. Much larger surface area; doesn't compose well with swamp-core's existing repository abstraction.

Implementation footprint

Rough sketch (not authoritative):

  • Update DatastoreSyncService interface to accept optional relPath arg.
  • Update existing first-party extension stubs / implementations to accept (and ignore) the new arg.
  • In swamp-core's repository layer, thread the existing path local into the markDirty call. Estimated ~10–15 call sites across 4 files.
  • Update design/datastores.md to document the optional-arg semantics.
  • Add an interface-level unit test that covers both call shapes.
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 11 MOREREVIEW+ 3 MOREPR_MERGEDSHIPPED

Shipped

5/4/2026, 11:57:50 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
keeb assigned keeb5/4/2026, 9:26:39 PM
Editable. Press Enter to edit.

keeb commented 5/4/2026, 9:48:04 PM

A few things worth tightening before this lands — happy to plan an implementation once they're resolved.

1. The current signature isn't no-argument

markDirty already takes an options bag for AbortSignal symmetry with pullChanged/pushChanged:

// src/domain/datastore/datastore_sync_service.ts:61
markDirty(options?: DatastoreSyncOptions): Promise<void>;

That changes the proposed shape. markDirty(relPath?: string) collides with the existing options-object position. Three landing options, none of them "purely additive" without qualification:

  • markDirty(options?: DatastoreSyncOptions & { relPath?: string }) — fully source-compatible but the proposed positional shape is gone.
  • markDirty(relPath?: string, options?: DatastoreSyncOptions) — type-breaks any extension implementing markDirty(opts) today (no in-tree caller does, but the canonical interface allowed it).
  • Drop the options bag — same break, plus loses the AbortSignal channel.

The PR author needs to commit to one and document the migration.

2. The contract mismatch is bigger than the signature

The JSDoc at datastore_sync_service.ts:46-60 is explicit about when and why markDirty fires:

Must be called before ... any write into the cache directory ... so the next pushChanged fast path cannot short-circuit past the write ... Implementations must be idempotent and cheap ... Called before the write begins so a crash mid-write still leaves the watermark dirty.

markDirty is a watermark invalidator — it flips a clean/dirty flag so the next pushChanged knows it has work to do. The actual per-file sync work happens in pushChanged.

The proposal's claim is that the new arg lets queryable backends "validate or update a single file in O(1)." That doesn't fit the contract: at markDirty-time the bytes haven't been written yet, so a Mongo-style backend has nothing to push or validate. Whatever optimization the new arg enables has to flow through pushChanged, not happen inside markDirty.

3. Pick which design you actually want

The proposal is conflating two distinct designs:

Design A — path-tracking via markDirty. markDirty stays cheap, records relPaths into a Set, and pushChanged consumes that Set instead of walking the cache. This matches "extensions MAY use this to track per-path dirty state" but the proposal's wording suggests something stronger.

Design B — path-level sync. pushChanged itself learns to take a dirty-paths list (pushChanged({ paths })); markDirty is unchanged or repurposed. This is what "validate or update a single file in O(1)" actually requires.

Pick one. The two have different surface areas, different test stories, and different implications for the s3/gcs fingerprint fast path.

4. The "10–15 call sites" undersells the cascade

The internal MarkDirtyHook indirection (type MarkDirtyHook = () => Promise<void> at datastore_sync_service.ts:76) also has to grow the arg, and every notifyDirty() helper in the four repositories needs the path threaded in:

  • unified_data_repository.ts — 8 sites
  • yaml_output_repository.ts — 2 sites
  • yaml_workflow_run_repository.ts — 2 sites
  • yaml_evaluated_definition_repository.ts — 3 sites

Plus the wiring at repo_context.ts:437 and :566 that bridges syncService.markDirty() into the hook (and would now own absolute → cache-relative path conversion). Plus the packages/testing/datastore_types.ts mirror, which is shape-checked against the canonical type by testing_package_compat_test.ts. Plus design/datastores.md and the swamp-extension-datastore skill (api.md + examples.md).

Mechanical work, but not "purely additive, optional" — every caller plus the hook type needs to move.

What I'd ask

  1. Fix the "no-argument" claim and pick a signature that composes with the existing options bag.
  2. Decide explicitly whether you want path-level invalidation tracking (Design A — markDirty records, pushChanged consumes) or path-level sync (Design B — pushChanged takes paths). State the answer in the proposal.
  3. Acknowledge the MarkDirtyHook + call-site + wiring cascade so the implementation footprint isn't a surprise.

Once those land I'll finish the plan against whichever design you pick.

keeb commented 5/4/2026, 9:55:16 PM

Thanks — all three points are valid. Revisions:

1. Signature — extend the existing options bag

Conceding the "no-argument" claim. The current shape markDirty(options?: DatastoreSyncOptions) rules out the positional relPath I proposed. The right landing is:

interface MarkDirtyOptions extends DatastoreSyncOptions {
  relPath?: string;
}
markDirty(options?: MarkDirtyOptions): Promise<void>;

Fully source-compatible with today's markDirty() and markDirty({signal}) calls. The other two landing options the reviewer listed (positional relPath first, drop options bag) both break existing callers and lose AbortSignal symmetry — non-starters.

2. Committing to Design A

markDirty records the relPath into an in-memory dirty set (persisted to a sidecar so it survives process restart); pushChanged consumes that set instead of walking the cache.

This matches what markDirty actually is — a watermark invalidator called before the write, when bytes aren't on disk yet. The path argument turns it into a path-tracking invalidator. Per-key sync work still happens inside pushChanged, just driven by the recorded set instead of a full-cache walk. The reviewer's framing is correct: the win flows through pushChanged; the new arg lets it know which paths to look at.

Design B (pushChanged({ paths })) was tempting but requires swamp-core's coordinator to also maintain a dirty-paths list and pipe it through to the extension — a bigger structural change for the same end result.

The original wording ("validate or update a single file in O(1)") was wrong about timing. Corrected: markDirty records O(1); pushChanged later does the per-file work, sized to what was recorded rather than to the cache.

3. Acknowledging the cascade

The implementation is strictly additive at the behavior level — extensions ignoring the new field keep working — but the mechanical footprint extends past the four repository files I named:

  • Canonical DatastoreSyncService interface + MarkDirtyHook type at datastore_sync_service.ts:76 both grow the arg
  • notifyDirty() helpers in the four persistence repos (~15 sites total: unified_data_repository.ts 8, yaml_output_repository.ts 2, yaml_workflow_run_repository.ts 2, yaml_evaluated_definition_repository.ts 3)
  • Wiring at repo_context.ts:437 and :566 — and that wiring layer takes on the abs→cache-relative path conversion (it's the layer that owns the cache root)
  • packages/testing/datastore_types.ts mirror, kept in sync via testing_package_compat_test.ts
  • design/datastores.md contract section
  • swamp-extension-datastore skill (api.md + examples.md)

Mechanical work, but not "forget about it" — every caller plus the hook type moves.

What I'd ask now

Replan against these three corrections. Once a v2 plan lands I'll review and we can go from there.

Sign in to post a ripple.