You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- AI is now called on every poll (every 2 min by default) with fresh
CF events + recent application logs, behaving like a sysadmin watching
the deploy in real time
- Verdicts changed from WAIT/CANCEL to CONTINUE/CANCEL; AI unreachable
defaults to CONTINUE (safe) instead of WAIT
- hang_threshold_minutes is now a hint to the AI rather than a hard trigger
- auto_cancel=false posts an advisory commit comment but keeps monitoring
- Three-step structure (Monitor → Analyse → Cancel) collapsed into one loop
- Log group name resolved once at startup rather than every AI retry
description: "Minutes with no new CloudFormation events before asking AI (keep under 40 with defaults: up to 3x this value is spent in AI retries within the 120-minute job timeout)"
20
+
description: "Minutes with no new CloudFormation events considered suspicious; passed to the AI as context"
21
21
type: number
22
22
required: false
23
23
default: 10
24
24
poll_interval_seconds:
25
-
description: "How often to poll CloudFormation events (seconds)"
25
+
description: "How often to poll and run AI analysis (seconds)"
'{model:"gpt-4o",messages:[{role:"system",content:"You are a senior DevOps engineer expert in AWS CloudFormation and ECS deployments. Be concise and decisive."},{role:"user",content:("Is this CloudFormation deployment stuck (will not progress without intervention) or still making progress?\n\n" + $prompt + "\n\nReply with exactly CANCEL or WAIT on the first line, followed by a 2-sentence explanation.")}],max_tokens:300}')
202
+
'{model:"gpt-4o",messages:[
203
+
{role:"system",content:"You are a senior DevOps engineer monitoring an AWS CloudFormation/ECS deployment in real time. You see the stack status, recent CloudFormation events, and recent application logs on every check. Be concise and decisive."},
204
+
{role:"user",content:("Is this deployment progressing normally, or is it stuck/failing and needs to be cancelled?\n\n" + $prompt + "\n\nReply with CONTINUE or CANCEL on the first line, followed by one sentence explaining why.")}
AI_TEXT=$(echo "$RESPONSE" | jq -r '.choices[0].message.content // "WAIT Could not reach AI model."' 2>/dev/null || echo "WAIT Could not parse AI response.")
COMMENT_BODY+="---"$'\n'"*Posted by [reusable-cdk-deploy-monitor](https://github.com/geolonia/.github/blob/main/.github/workflows/reusable-cdk-deploy-monitor.yml)*"
229
+
230
+
gh api "repos/$REPO/commits/$SHA/comments" \
231
+
-f body="$COMMENT_BODY" 2>/dev/null || echo "Warning: failed to post commit comment"
232
+
233
+
if [[ "$AUTO_CANCEL" == "true" ]]; then
234
+
echo "Cancelling stack update: $STACK_NAME"
235
+
if aws cloudformation cancel-update-stack \
236
+
--stack-name "$STACK_NAME" \
237
+
--region "$AWS_REGION" 2>/tmp/cancel_error; then
238
+
echo "Stack update cancelled. CloudFormation is rolling back."
239
+
gh api "repos/$REPO/commits/$SHA/comments" \
240
+
-f body="### CDK Deploy Monitor -- Update Cancelled"$'\n\n'"Stack \`$STACK_NAME\` update was cancelled (AI verdict: CANCEL, auto_cancel: true)."$'\n'"CloudFormation is rolling back. The deploy job will fail with a rollback error -- this is expected." \
-f body="### CDK Deploy Monitor -- Cancel Attempt Failed"$'\n\n'"Tried to cancel \`$STACK_NAME\` but got: $CANCEL_ERR"$'\n\n'"The stack may have already completed or been cancelled manually." \
247
+
2>/dev/null || true
248
+
fi
249
+
exit 0
250
+
else
251
+
echo "auto_cancel=false -- advisory comment posted, continuing to monitor."
252
+
fi
278
253
fi
279
254
280
-
AI_RETRIES=$(( AI_RETRIES + 1 ))
281
-
if [[ "$AI_RETRIES" -lt "$MAX_AI_RETRIES" ]]; then
282
-
echo "AI said WAIT -- sleeping ${HANG_THRESHOLD_MINUTES}m before retry..."
283
-
sleep $(( HANG_THRESHOLD_MINUTES * 60 ))
284
-
else
285
-
echo "Max AI retries reached -- treating as CANCEL."
286
-
echo "CANCEL" > /tmp/ai_verdict
287
-
fi
255
+
sleep "$POLL_INTERVAL_SECONDS"
288
256
done
289
-
290
-
- name: Cancel stack update (if verdict is CANCEL and auto_cancel is true)
291
-
if: always() && !cancelled()
292
-
env:
293
-
STACK_NAME: ${{ inputs.stack_name }}
294
-
AWS_REGION: ${{ inputs.aws_region }}
295
-
AUTO_CANCEL: ${{ inputs.auto_cancel }}
296
-
GH_TOKEN: ${{ github.token }}
297
-
REPO: ${{ github.repository }}
298
-
SHA: ${{ github.sha }}
299
-
run: |
300
-
set -euo pipefail
301
-
302
-
if [[ ! -f /tmp/ai_verdict ]]; then
303
-
echo "No AI verdict -- nothing to cancel."
304
-
exit 0
305
-
fi
306
-
307
-
VERDICT=$(cat /tmp/ai_verdict)
308
-
309
-
if [[ "$VERDICT" != "CANCEL" ]]; then
310
-
echo "AI verdict is $VERDICT -- skipping cancellation."
311
-
exit 0
312
-
fi
313
-
314
-
if [[ "$AUTO_CANCEL" != "true" ]]; then
315
-
echo "auto_cancel=false -- advisory comment posted, not cancelling stack."
316
-
exit 0
317
-
fi
318
-
319
-
echo "Cancelling stack update: $STACK_NAME"
320
-
if aws cloudformation cancel-update-stack \
321
-
--stack-name "$STACK_NAME" \
322
-
--region "$AWS_REGION" 2>/tmp/cancel_error; then
323
-
echo "Stack update cancelled. CloudFormation is rolling back."
324
-
CANCEL_COMMENT="### CDK Deploy Monitor -- Update Cancelled"$'\n\n'"Stack \`$STACK_NAME\` update was cancelled (AI verdict: CANCEL, auto_cancel: true)."$'\n'"CloudFormation is rolling back. The deploy job will fail with a rollback error -- this is expected."
325
-
gh api \
326
-
"repos/$REPO/commits/$SHA/comments" \
327
-
-f body="$CANCEL_COMMENT" \
328
-
2>/dev/null || echo "Warning: failed to post cancel comment"
CANCEL_COMMENT="### CDK Deploy Monitor -- Cancel Attempt Failed"$'\n\n'"Tried to cancel \`$STACK_NAME\` but got an error:"$'\n'"$CANCEL_ERR"$'\n\n'"The stack may have already completed or been cancelled manually."
333
-
gh api \
334
-
"repos/$REPO/commits/$SHA/comments" \
335
-
-f body="$CANCEL_COMMENT" \
336
-
2>/dev/null || echo "Warning: failed to post error comment"
0 commit comments