Skip to main content
← Back to list
01Issue
FeatureShippedExtensions
Assigneesstack72

#226 Detect AWS CredentialsProviderError in summarizeSyncError and prepend SSO-expiration hint

Opened by stack72 · 5/4/2026· Shipped 5/5/2026

Problem statement

When AWS SSO session credentials expire (Token is expired) and a model run reaches the lock-acquire path on an S3-backed datastore, the resulting failure surfaces as a raw CredentialsProviderError buried in the middle of an S3 SDK stack trace:

FTL error S3OperationError [CredentialsProviderError]: S3 putObjectConditional failed CredentialsProviderError — Token is expired. To refresh this SSO session run 'aws sso login' with the corresponding profile.
  at S3Client2.wrapError (.swamp/datastore-bundles/.../s3.js:55300:12)
  at S3Client2.run (.swamp/datastore-bundles/.../s3.js:55273:18)
  at async S3Client2.putObjectConditional (.../s3.js:55392:7)
  at async S3Lock.acquire (.../s3.js:55504:23)
  at async registerDatastoreSyncNamed (src/infrastructure/persistence/datastore_sync_coordinator.ts:269:7)
  at async acquireModelLocks (src/cli/repo_context.ts:810:5)
  ...
  [cause]: _CredentialsProviderError: Token is expired. To refresh this SSO session run 'aws sso login' with the corresponding profile.

The SDK message itself is already correct and even includes the right remediation hint (run 'aws sso login'). The problem is discoverability: the hint sits inside the SDK's wrapped error, embedded in a long stack, and the user reading the output sees S3 putObjectConditional failed first — which reads like a backend or networking failure, not an auth-session-expired failure. In a recent Lab triage (swamp-club#224 → swamp-club#218 / swamp-club#219), the reporter initially attributed the failure to lock contention because the putObjectConditional framing primed them to look at the lock layer.

Proposed solution

Extend summarizeSyncError (and/or the lock-acquire error path it delegates to) to recognise CredentialsProviderError — and ideally other auth/permission classes from the AWS SDK (ExpiredTokenException, AccessDenied on the lock object, InvalidAccessKeyId) — and prepend a swamp-flavoured summary line above the SDK message. Something like:

ERR Datastore session expired: your AWS profile's SSO session is no longer valid.
    Run 'aws sso login --profile <profile>' to refresh, then retry.
    (S3OperationError [CredentialsProviderError]: Token is expired ...)

The swamp-flavoured line should:

  • name the cause in swamp's vocabulary ("datastore session expired") rather than S3's ("putObjectConditional failed"),
  • pull the AWS profile name from the active datastore config when available so the aws sso login hint is concrete,
  • be the first thing the user sees, with the SDK detail kept underneath for completeness.

The same treatment fits naturally next to the existing summary path used by enrich sync error surfacing with SDK metadata (swamp-club#135 / PR #1200) — that PR already enriches sync errors with SDK metadata; this is the auth-class extension of the same idea, applied specifically to S3Lock.acquire failures and the putObjectConditional path it uses.

Alternatives

  1. Leave as-is, rely on the SDK hint. The aws sso login instruction is already in the message — users who read carefully will find it. Rejected because the field experience says they don't: the surrounding S3OperationError framing misdirects to the lock layer.
  2. Handle this in the top-level CLI error formatter rather than summarizeSyncError. Possible, but summarizeSyncError is the natural place — it already classifies sync errors by category and is reached by both the pull path and the lock-acquire path. Centralising the auth detection here keeps the lock layer free of cross-cutting message-formatting logic.
  3. Add a startup-time aws sts get-caller-identity probe and refuse to run if it fails. Heavier, would slow every invocation, and doesn't help the mid-execution expiration case where creds were valid at startup and aged out during a long-running model. Better as a swamp doctor add-on if anything.

Context / origin

Surfaced by Lab triage on swamp-club#224. Reporter: blake.irvin. Hit on swamp 20260504.140403.0-sha.d4c9188f against a shared S3 datastore (HiveMQ collective). The lock-loop and Deno-panic angles in #224 are tracked elsewhere; this issue is purely the error-message/discoverability improvement that fell out of the same triage.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 5 MOREREVIEW+ 3 MOREPR_MERGEDSHIPPED

Shipped

5/5/2026, 12:04:35 AM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack725/4/2026, 5:23:40 PM

Sign in to post a ripple.