Lab: Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run

Problem

When swamp workflow run is invoked from a long-lived agent (e.g. Claude Code), the agent's process can be killed mid-run for reasons unrelated to swamp itself — most commonly when the model session is compacted or the parent terminal is closed. When that happens today:

The workflow-run YAML on disk stays at status: running with every step still pending, even though the CLI process is gone and no further work will happen.
The .swamp/workflow-runs/.../<run-id>.log simply stops mid-stream with no terminal entry.
swamp workflow history get <name> continues to report status: running indefinitely.
Subsequent swamp workflow run <name> invocations still succeed and start a new run, but the previous run's record is never reconciled, so operators have to look at log mtimes and ps to figure out what's really alive.
For workflows whose steps mutate external state (e.g. our dirtyfrag-mitigate-prod, which fans out a SSH/QEMU-guest-agent command across ~120 hosts), a partially-completed run is operationally significant: some hosts are mitigated, others aren't, and the report-rendering steps (probe-* + the mitigation-status report) never run, so the verdict report is empty even though state has changed in the world.

Concrete repro this came from

PLT-549 prod mitigation rollout, 2026-05-08:

Agent invoked swamp workflow run dirtyfrag-mitigate-prod (run 0e4adac1-…, 20:55:03Z). CLI process killed by the agent's parent at 20:56:47Z mid-mitigate-vms (90/110 VMs done). The workflow record was eventually marked failed — so this case actually was reconciled, at the moment the parent killed the CLI.
Agent invoked it again (run 0e1ac86b-…, 20:57:13Z). CLI was killed again at 20:57:35Z, this time mid-mitigate-vms (20/110 VMs done). Run record stayed at status: running indefinitely. No reconciliation happened.

The asymmetry between the two runs (one marked failed, the other left running) suggests the existing kill-handling path runs in some cases but not others, and isn't watching the originating-process liveness as a primary signal.

Proposed solution

Either of these would solve our case; (b) is more useful but (a) is much cheaper to ship:

(a) Liveness-based reconciliation on read. Persist the originating CLI process PID + start time alongside the run record. When any read-side command (workflow history get/search, workflow run of the same workflow, report get) encounters a running record, sanity-check that PID/start-time combination on the local host. If the process is gone (or the start time doesn't match), transition the record to a terminal state — e.g. cancelled or a new abandoned — with a reason like originating-process-not-alive. Cheap, no new daemon, but only works on the original host.

(b) Daemonised execution / supervised resume. Have swamp workflow run spawn (or attach to) a small detached supervisor that owns the run lifecycle independent of the CLI. The CLI then becomes a thin client that streams logs from the supervisor and exits cleanly without affecting the run. Combined with an explicit swamp workflow cancel <run-id> for operator-driven cleanup. This is what you'd want long-term anyway for unattended scheduled runs.

What we'd want either way

A way for an agent to call something like swamp workflow reconcile (or have it happen automatically as part of workflow history/workflow run) that catches and terminates orphaned local runs without re-touching the world.
A clear distinction in the run status field between real failure (a step exited non-zero) and abandoned (the originating process disappeared) — the operator response is different (re-run vs. investigate).
The reconciliation event should appear in the run log so post-hoc forensics still work.

Why this matters

This is going to bite anyone driving swamp from an LLM agent or a long-lived editor session — exactly the agentic-CI workflow swamp is well-positioned for. The mitigate workflow is also a good motivating case: idempotent, but the orphaned record makes it look unsafe to re-run when actually the right move is "reconcile + re-run".

telemetry: emit child entries for follow-up action method invocations

Publish release-candidate / unstable extension versions

extension source rm leaves stale @local/. rows in catalog, blocking later pulls of registry types

extension push drops binaries: field from re-emitted archive manifest.yaml

macOS launchd autoupdate (club.swamp.autoupdate) silently fails — binary stays stale

Clicking @hivemq/honeycomb extension card on /extensions shows 'Something went wrong'

Docs: Update model-definitions.md and workflows.md for direct type execution

Docs: document binaries manifest field in extension-manifest.md

Add an official @swamp/ssh extension for general-purpose SSH (brownfield-friendly)

Accept and display binaries field from extension push metadata

Direct type execution: collapse model create + method run into one command

Per-method telemetry events for workflow runs

Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run

Provide a CLI-shape primer for AI agents to reduce rediscovery overhead

quality rubric: don't penalize extensions whose upstream constrains them to a single platform

Extension update rejects multiple .ts files extending the same target type within one local extension (regression)

swamp extension push deadlocks when invoked from inside a swamp workflow step

Collective-scoped auth keys + OIDC federation for CI publishing

forEach self.* in modelIdOrName not resolved in runtime execution path

Award leaderboard points for referrals and collective invites

Add agent harness detection and AiTool to telemetry

Workflow-level runtime expressions (env.*, vault.*) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings

Implement W5: Per-fingerprint import URLs + subprocess test harness (extension catalog rearchitecture)

swamp config set crashes with YAML serialization error

datastore compact: VACUUM fails in compiled binary (SQLITE_LIMIT_ATTACHED=0)

Repo-level version gating: minSwampVersion high-water mark for team consistency

Docs: document self.* expressions in modelIdOrName during forEach

Docs: How-to guide for background autoupdating

Manifest version bumps silently ignored for existing local extension aggregates

materialiseExtensions misclassifies pulled rows when manifest name collides with a pulled extension

Local extension edits don't reliably trigger rebundle

Missing unique indexes on user.email and user.username allow duplicate users

Resolve self.* expressions in modelIdOrName during forEach expansion

discord-bot double-sends sign_up notifications

Discord bot sends duplicate signup notifications

Add 'swamp workflow list' as alias for 'swamp workflow search'

Add 'swamp auth status' as alias for 'swamp auth whoami'

Extension bundle cache does not invalidate on source edits

extension pull fails on @local/[email protected] phantom-claim collision when local repo has its own extension

Lab search by numeric issue ID returns no results

W3 sourceToRow writes empty source_mtime — should carry filesystem mtime through Source entity

Warm-start rebundleAndUpdateCatalog should respect terminal RowStates set by reconcile

Implement W4: KindAdapter + unified loader (extension catalog rearchitecture)

Docs: add vault read-secret command to reference manual

Extension layer garbage collection: prune catalog rows + evict orphaned bundles

Docs: document workflow concurrency limits in reference manual

Locally-sourced extension: source_mtime updates without regenerating stale bundle

Extension push: allow shipping executable host helpers (bin/mudroom blocker)

Vault expressions silently deliver __SWAMP_VSEC__ sentinels under the docker driver

First-class shell-shim support for extensions, with registry-level visibility

Local extension model bundles don't rebuild when source changes (no rebuild CLI; manual cache delete breaks the runner)

Configurable concurrency limits for workflow fan-out (forEach, parallel jobs/steps)

feat(security): redact sensitive method arg values from audit log

docs: document swamp datastore compact and GC WAL behaviour

UAT: swamp datastore compact reclaims WAL and catalog space

Plan v4 step 9 literal test untestable under current catalog PK semantics

Cross-process concurrency stress for W2 lifecycle services

UAT additions for W2 lifecycle services (Install/Remove/Upgrade)

Implement W3: ReconcileFromDisk + freshness-as-aggregate-query (extension catalog rearchitecture)

doctor extensions repair: clean catalog-only orphans

Extensions should be able to ship Claude Code skills

Pre-existing TOCTOU windows in YAML repo walkers (findAll directory level + findById)

`swamp datastore setup` migration is not resumable / leaves the repo in a partial state on failure

Built-in models must honor AbortSignal so --timeout works in practice

Reader-lock or lock-free read path for data list/get/search/query

data search: surface jobTag alongside workflowTag and stepTag (follow-up to #237)

global-arg silently strips unknown keys; should reject via strict Zod schema

redmine extension: adopt state machine patterns from @magistr for agent-driven workflows

performance degrades significantly with large SQLite catalog

summarise: timeout/hang on repos with many workflow runs

extension quality/fmt fail on pulled extensions (path mismatch)

vault: add read-secret CLI command for agent-driven secret retrieval

data query: stepName and jobName fields always empty in CEL results

Extension ecosystem: shared utility library, version sync tooling, and HTTP resilience patterns

Agentic CLI improvements: --json stdout isolation, array inputs, and --repo-dir consistency

data delete fails with "Directory not empty (os error 39)" when concurrent writes are active

Extract LockfileRepository (W2 prequel for swamp-club#231)

Extend DatastoreSyncService.markDirty() with optional relPath argument

Implement W2: Lifecycle services own the catalog write (extension catalog rearchitecture)

data·delete logger double-quotes the data name in log output

Workflow-level runtime expressions (env., vault.) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings