#300 Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run
Opened by bixu · 5/8/2026
Problem
When swamp workflow run is invoked from a long-lived agent (e.g. Claude Code), the agent's process can be killed mid-run for reasons unrelated to swamp itself — most commonly when the model session is compacted or the parent terminal is closed. When that happens today:
- The workflow-run YAML on disk stays at
status: runningwith every step stillpending, even though the CLI process is gone and no further work will happen. - The
.swamp/workflow-runs/.../<run-id>.logsimply stops mid-stream with no terminal entry. swamp workflow history get <name>continues to reportstatus: runningindefinitely.- Subsequent
swamp workflow run <name>invocations still succeed and start a new run, but the previous run's record is never reconciled, so operators have to look at log mtimes andpsto figure out what's really alive. - For workflows whose steps mutate external state (e.g. our
dirtyfrag-mitigate-prod, which fans out a SSH/QEMU-guest-agent command across ~120 hosts), a partially-completed run is operationally significant: some hosts are mitigated, others aren't, and the report-rendering steps (probe-*+ the mitigation-status report) never run, so the verdict report is empty even though state has changed in the world.
Concrete repro this came from
PLT-549 prod mitigation rollout, 2026-05-08:
- Agent invoked
swamp workflow run dirtyfrag-mitigate-prod(run0e4adac1-…, 20:55:03Z). CLI process killed by the agent's parent at 20:56:47Z mid-mitigate-vms(90/110 VMs done). The workflow record was eventually markedfailed— so this case actually was reconciled, at the moment the parent killed the CLI. - Agent invoked it again (run
0e1ac86b-…, 20:57:13Z). CLI was killed again at 20:57:35Z, this time mid-mitigate-vms(20/110 VMs done). Run record stayed atstatus: runningindefinitely. No reconciliation happened.
The asymmetry between the two runs (one marked failed, the other left running) suggests the existing kill-handling path runs in some cases but not others, and isn't watching the originating-process liveness as a primary signal.
Proposed solution
Either of these would solve our case; (b) is more useful but (a) is much cheaper to ship:
(a) Liveness-based reconciliation on read. Persist the originating CLI process PID + start time alongside the run record. When any read-side command (workflow history get/search, workflow run of the same workflow, report get) encounters a running record, sanity-check that PID/start-time combination on the local host. If the process is gone (or the start time doesn't match), transition the record to a terminal state — e.g. cancelled or a new abandoned — with a reason like originating-process-not-alive. Cheap, no new daemon, but only works on the original host.
(b) Daemonised execution / supervised resume. Have swamp workflow run spawn (or attach to) a small detached supervisor that owns the run lifecycle independent of the CLI. The CLI then becomes a thin client that streams logs from the supervisor and exits cleanly without affecting the run. Combined with an explicit swamp workflow cancel <run-id> for operator-driven cleanup. This is what you'd want long-term anyway for unattended scheduled runs.
What we'd want either way
- A way for an agent to call something like
swamp workflow reconcile(or have it happen automatically as part ofworkflow history/workflow run) that catches and terminates orphaned local runs without re-touching the world. - A clear distinction in the run status field between real failure (a step exited non-zero) and abandoned (the originating process disappeared) — the operator response is different (re-run vs. investigate).
- The reconciliation event should appear in the run log so post-hoc forensics still work.
Why this matters
This is going to bite anyone driving swamp from an LLM agent or a long-lived editor session — exactly the agentic-CI workflow swamp is well-positioned for. The mitigate workflow is also a good motivating case: idempotent, but the orphaned record makes it look unsafe to re-run when actually the right move is "reconcile + re-run".
Open
No activity in this phase yet.
Sign in to post a ripple.