|
When cancelling job there is chance that running software handle some resources that needs to release when job is cancelled. To make this robust way GitHub Actions should send signals for running process so that application can tear-down tasks properly. Gentle terminator would send first SIGINT, wait some seconds and if app still doesn’t die it would send another signal SIGTERM and finally SIGKILL to force terminate it eventually. Probably GH already manage this someway, but at least I didn’t find any documentation about subject. would be good to document it properly how it behaviours at the moment so it’s easier to propose changes if needed. Here is nice document for Jenkins about same issue: https://gist.github.com/datagrok/dfe9604cb907523f4a2f |
Replies: 19 comments 23 replies
|
Thanks for your feedback. |
|
According the introduction from the engineering team, after the user click “Cancel workflow”:
Hope this can help you understand better. |
|
Hi @brightran I have a terraform/terragrunt action that when it is canceled it leaves state file locked in AWS dynamoDB. Then I need to force unlock it and most of the resources are not in state file so my state is corrupted. And when I cancel the job all I see in action console is: So it does not look like job allowed terraform to do all the necessary steps, to save state, release lock etc. |
|
Hi, This is happening for us too, terraform does not have chance to release the state locks. If it is working as documented then it would be good to be able to extend the 7500ms shutdown time |
|
I can also confirm that using I’ve tested locally sending |
|
If you’re using terraform, I’d suggest using atlantis: Terraform Pull Request Automation | AtlantisAtlantis: Terraform Pull Request Automation Yes, it means you’re running a small VM somewhere, but, it also means you don’t have to worry about it being killed. |
|
I confirmed; Github runner is using SIGKILL for cancel-in-progress. |
|
Can we make job termination operate more gentle? |
|
This seems like a common issue for everybody running tasks like terraform - that need to gracefully exit. You can and should disable Is there any solution for this? |
|
I wrote a demo to illustrate and confirm the behaviour described above, which really should be in the docs. At least on a Linux runner, "CTRL-C" means SIGINT and "CTRL-Break" (also CTRL-) means SIGTERM.
in this simple case the result observed was consistent with above docs - SIGINT, about 7.5s, SIGTERM, about 2.5s, and presumably SIGKILL + cleanup. But ... note that I've If I don't It looks like the github actions runner probably waits for the session leader process to exit, then hard-kills anything under it when it exits. It doesn't appear to deliver signals to the process tree by signalling the process group; AFAICS it only signals the leader process. So the leader must install a signal handler that explicitly propagates signals to child processes then waits for them to exit. |
|
I wrote this up better in a demo at https://github.com/ringerc/github-actions-signal-handling-demo since I wasn't satisfied with the answers from @BrightRan above, nor my earlier quick tests. It's a right mess. It looks like you really need to rely on |
|
Has something changed in the way Github processing of the job termination ? |
|
Why is this marked as answered when it isn't really answered? How to avoid e.g. |
|
Could we maybe adjust the wrapper so that it invokes terraform via |
|
This is the workaround I'm using now ...
|
|
Here is my workaround. Like @breathe , I don't use the terraform wrapper. Aside from that, I instead use tini to make sure all signals get propagated: Which results in something like the following when the job gets cancelled: |
|
I built a simple tool which speeds up the signals propagation based on the above material: It can be just put to ...
jobs:
my-job:
runs-on: ubuntu-latest
steps:
- name: Long-running step
shell: signal-fanout {0}
run: |
for i in $(seq 1 30); do echo "$(date): $i"; sleep 1; done |
|
FYI: We are using the following snippet as a workaround. - name: Terraform apply
id: apply
run: terraform apply -no-color -auto-approve
- name: Release lock if exists
if: ${{ steps.apply.outcome == 'cancelled' && always() }}
run: |
lock_id=$(terraform plan -no-color -refresh=false 2>&1 | grep ' ID: ' | cut -d: -f2 | tr -d ' ' || true)
if [[ -n "${lock_id}" ]]; then
terraform force-unlock -force ${lock_id}
fi |
|
Can we raise this as a bug? ie. task is sent SIGKILL rather than SIGINT re-created: |
@jupe,
According the introduction from the engineering team, after the user click “Cancel workflow”:
The server will re-evaluate job-if condition on all running jobs.
If the job condition is always(), it will not get canceled.
For the rest of the jobs that need cancellation, the server will send a cancellation message to all the runners.
Each runner has 5 minutes to finish the cancellation process before the server force terminate the job.
The runner will re-evaluate if condition on the current running step.
If the step condition is always(), it will not get canceled.
Otherwise, the runner will send Ctrl-C to the action entry process (node for javascript action, docker for c…