Skip to main content
← Back to list
01Issue
FeatureOpenSwamp CLI
AssigneesNone

#300 Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run

Opened by bixu · 5/8/2026

Problem

When swamp workflow run is invoked from a long-lived agent (e.g. Claude Code), the agent's process can be killed mid-run for reasons unrelated to swamp itself — most commonly when the model session is compacted or the parent terminal is closed. When that happens today:

  • The workflow-run YAML on disk stays at status: running with every step still pending, even though the CLI process is gone and no further work will happen.
  • The .swamp/workflow-runs/.../<run-id>.log simply stops mid-stream with no terminal entry.
  • swamp workflow history get <name> continues to report status: running indefinitely.
  • Subsequent swamp workflow run <name> invocations still succeed and start a new run, but the previous run's record is never reconciled, so operators have to look at log mtimes and ps to figure out what's really alive.
  • For workflows whose steps mutate external state (e.g. our dirtyfrag-mitigate-prod, which fans out a SSH/QEMU-guest-agent command across ~120 hosts), a partially-completed run is operationally significant: some hosts are mitigated, others aren't, and the report-rendering steps (probe-* + the mitigation-status report) never run, so the verdict report is empty even though state has changed in the world.

Concrete repro this came from

PLT-549 prod mitigation rollout, 2026-05-08:

  1. Agent invoked swamp workflow run dirtyfrag-mitigate-prod (run 0e4adac1-…, 20:55:03Z). CLI process killed by the agent's parent at 20:56:47Z mid-mitigate-vms (90/110 VMs done). The workflow record was eventually marked failed — so this case actually was reconciled, at the moment the parent killed the CLI.
  2. Agent invoked it again (run 0e1ac86b-…, 20:57:13Z). CLI was killed again at 20:57:35Z, this time mid-mitigate-vms (20/110 VMs done). Run record stayed at status: running indefinitely. No reconciliation happened.

The asymmetry between the two runs (one marked failed, the other left running) suggests the existing kill-handling path runs in some cases but not others, and isn't watching the originating-process liveness as a primary signal.

Proposed solution

Either of these would solve our case; (b) is more useful but (a) is much cheaper to ship:

(a) Liveness-based reconciliation on read. Persist the originating CLI process PID + start time alongside the run record. When any read-side command (workflow history get/search, workflow run of the same workflow, report get) encounters a running record, sanity-check that PID/start-time combination on the local host. If the process is gone (or the start time doesn't match), transition the record to a terminal state — e.g. cancelled or a new abandoned — with a reason like originating-process-not-alive. Cheap, no new daemon, but only works on the original host.

(b) Daemonised execution / supervised resume. Have swamp workflow run spawn (or attach to) a small detached supervisor that owns the run lifecycle independent of the CLI. The CLI then becomes a thin client that streams logs from the supervisor and exits cleanly without affecting the run. Combined with an explicit swamp workflow cancel <run-id> for operator-driven cleanup. This is what you'd want long-term anyway for unattended scheduled runs.

What we'd want either way

  • A way for an agent to call something like swamp workflow reconcile (or have it happen automatically as part of workflow history/workflow run) that catches and terminates orphaned local runs without re-touching the world.
  • A clear distinction in the run status field between real failure (a step exited non-zero) and abandoned (the originating process disappeared) — the operator response is different (re-run vs. investigate).
  • The reconciliation event should appear in the run log so post-hoc forensics still work.

Why this matters

This is going to bite anyone driving swamp from an LLM agent or a long-lived editor session — exactly the agentic-CI workflow swamp is well-positioned for. The mitigate workflow is also a good motivating case: idempotent, but the orphaned record makes it look unsafe to re-run when actually the right move is "reconcile + re-run".

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

5/8/2026, 9:11:35 PM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.