Skip to content

safe_sleep.sh rarely hangs indefinitely #3792

@AlekseiNikiforovIBM

Description

@AlekseiNikiforovIBM

Describe the bug
Very rarely on update of github actions runner safe_sleep.sh hangs forever:

$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
actions+   72298 96.3  0.0   4372  3328 ?        R    Apr02 1988:54 /bin/bash /home/actions-runner/safe_sleep.sh 1
...

I suspect it may happen sometimes if machine runs in cloud and is overloaded and/or overcommitted.

To Reproduce
Steps to reproduce the behavior:

  1. Download runner of version prior to version, for example, 2.322.
  2. Register and run runner.
  3. Runner updates itself. It can also take a task to complete in meantime.

Expected behavior
Update should not hang infinitely.

Runner Version and Platform

Runner 2.322
OS: Linux

Activity

mlugg

mlugg commented on Aug 13, 2025

@mlugg

script in question

The bug in this "safe sleep" script is obvious from looking at it: if the process is not scheduled for the one-second interval in which the loop would return (due to $SECONDS having the correct value), then it simply spins forever. That can easily happen on a CI machine under extreme load. When this happens, it's pretty bad: it completely breaks a runner until manual intervention. On Zig's CI runner machines, we observed multiple of these processes which had been running for hundreds of hours, silently taking down two runner services for weeks.

I don't understand how we got here. Even ignoring the pretty clear bug, what makes this Bash script "safer" than calling into the POSIX standard sleep utility? It doesn't seem to solve any problem; meanwhile, it's less portable and needlessly eats CPU time by busy-waiting.


The sloppy coding which is evident here, as well as the inaction on core Actions bugs (in line with the decay in quality of almost every part of GitHub's product), is forcing the Zig project to strongly consider moving away from GitHub Actions entirely. With this bug, and many others (severe workflow scheduling issues resulting in dozens of timeouts; logs randomly becoming inaccessible; random job cancellations without details; perpetually "pending" jobs), we can no longer trust that Actions can be used to implement reliable CI infrastructure. I personally would seriously encourage other projects, particularly any using self-hosted runners, to look carefully at the stability of Actions and ask themselves whether it is a solution worth sticking with long-term when compared with alternatives.

markozajc

markozajc commented on Aug 15, 2025

@markozajc

I don't understand how we got here.

This bugs me too, so I looked into it. It seems that a safe_sleep function was originally declared and used in run-helper.sh.template:

safe_sleep() {
if [ ! -x "$(command -v sleep)" ]; then
if [ ! -x "$(command -v ping)" ]; then
COUNT="0"
while [[ $COUNT != 5000 ]]; do
echo "SLEEP" > /dev/null
COUNT=$[$COUNT+1]
done
else
ping -c 5 127.0.0.1 > /dev/null
fi
else
sleep 5
fi
}
I assume it's named "safe" sleep because it doesn't rely on the presence of the sleep utility, however

  • /usr/bin/sleep is provided by the coreutils on Debian and Ubuntu, which is Priority: required and Essential: yes, so the binary will always be present on a correctly installed system,
  • the final fallback incorrectly assumes that writing "SLEEP" to /dev/null 5000 times somehow take 5 seconds, and
  • sleep is the only coreutils binary that has a fallback mechanism like this. Why not also do this for id, readlink and dirname?

Apparently this function was such a hit that #1707 opted to use it across the entire project. For some reason, this included rewriting it in a way that, like you mentioned, burns CPU cycles for no good reason and causes random hangs:

SECONDS=0
while [[ $SECONDS != $1 ]]; do
:
done

Why was this PR merged without so much as a comment? Why does it execute the helper script instead of exporting and importing it as a function? Why doesn't it check for the presence of sleep like the old one one did? At least the negative clause is effectively dead code in that case, so it wouldn't cause bugs.

mlugg

mlugg commented on Aug 15, 2025

@mlugg

Adding to the confusion, it turns out that someone submitted a perfectly good fix for the actual bug here -- while keeping the inexplicable busy-loop, so any hypothetical advantage of safe_sleep.sh would be retained -- over a year ago! That was #3157. Unfortunately, it sat completely untouched for over a year, before it was helpfully closed by a bot due to inactivity... despite being a perfectly good PR with a pending unfulfilled review request. This situation is baffling; it appears at least to me like this mission-critical repository is borderline unmaintained.

notcancername

notcancername commented on Aug 16, 2025

@notcancername

I assume it's named "safe" sleep because it doesn't rely on the presence of the sleep utility

If this is the reason, it's a pretty funny one considering Node is relied on quite heavily in GitHub Actions-land.

andrewrk

andrewrk commented on Aug 16, 2025

@andrewrk

I don't understand how we got here.

it's vibes all the way down

Summertime

Summertime commented on Aug 16, 2025

@Summertime

Runner relies on readlink, rm, dirname, cat, uname, seq (not posix!), bash (not posix!)...

Also ldd, sudo, grep, ldconfig, docker...

I think an assumption can be made that sleep (posix) will probably exist

Summertime

Summertime commented on Aug 16, 2025

@Summertime

https://lobste.rs/s/l4rowa/github_actions_safe_sleep_sh#c_ppqpea Additional exploration into the history of this code is here.

ashishb

ashishb commented on Aug 16, 2025

@ashishb

GitHub Actions do have these gotchas, and defaults are just not optimal.

  • No timeouts - so the job can run forever
  • No dependency caching - use cache explicitly
  • No cancellation of stale executions - e.g., if you push N times to your branch, all the prior N-1 executions will be running
  • Bad security permissions

There are fixes, except they are not very obvious to the first-time user.

@mlugg Happy to send some fixes that might help Zig (I just started using it, and it definitely is an interesting language).

mlugg

mlugg commented on Aug 16, 2025

@mlugg

We just discovered another 4 deadlocked safe_sleep.sh processes on one of Zig's CI runner machines, collectively saturating 4 CPU cores on infinite loops and reducing CI capacity by 30%. Two of them had been running for around a day, the other two for almost 2 weeks. This had certainly been a contributor to our ongoing CI issues. Why are Microsoft refusing to give this bug any attention? We should not need to babysit our CI runners to prevent them from becoming glorified space heaters.

mlugg

mlugg commented on Aug 16, 2025

@mlugg

@ashishb thanks for the suggestions -- we've decided to just migrate away from Actions ASAP since it's become clear that GitHub is a rapidly sinking ship.

mlugg

mlugg commented on Aug 16, 2025

@mlugg

As @kristoff-it rightly pointed out while streaming on Twitch yesterday, there's an even better question than "why is this buggy and inefficient script used instead of the universally-available sleep", that being: "why does the runner need to sleep in the first place"? Well, let's look at where safe_sleep.sh is used...

Waiting for Runner Termination

update.sh.template uses safe_sleep.sh to poll the runner's PID in /proc to see if it has terminated:

# wait for runner process to exit.
date "+[%F %T-%4N] Waiting for $runnerprocessname ($runnerpid) to complete" >> "$logfile" 2>&1
while [ -e /proc/$runnerpid ]
do
date "+[%F %T-%4N] Process $runnerpid still running" >> "$logfile" 2>&1
"$rootfolder"/safe_sleep.sh 2
done
date "+[%F %T-%4N] Process $runnerpid finished running" >> "$logfile" 2>&1
# start re-organize folders
date "+[%F %T-%4N] Sleep 1 more second to make sure process exited" >> "$logfile" 2>&1
"$rootfolder"/safe_sleep.sh 1

Firstly, this is another very obvious race condition -- nothing stops the runner from terminating, and another process from picking up that PID, within the 2-second interval, such that update.sh.template thinks the Actions runner is still running and continues to wait. Since Actions runners are long-lived (even if they handle just one job, a job could go on for hours!), PID wraparound could have easily happened, so this isn't an esoteric concern: on any system doing more than hosting a single Actions runner, there is a non-negligible chance that something happens to spawn a new process in the required interval and hence takes down a runner indefinitely (until one unlucky process happens to terminate). This is, again, extremely sloppy coding -- the process should just be monitored using an appropriate primitive. Perhaps the program that spawned the runner uses wait; perhaps the runner which spawns this script manually communicates its termination through a pipe; or perhaps the runner just doesn't spawn the script until it can make progress.

Secondly, because safe_sleep.sh spins a CPU needlessly, this causes a full CPU core to get locked up spinning, while a CI run is potentially in progress, waiting for that run to finish. I haven't been able to understand the maze of C# logic which ultimately calls this code, but we've observed a half dozen non-deadlocked safe_sleep.sh processes on a CI machine in normal operation, so I'm fairly confident in saying that the script is spawned while CI runs are in-progress. So on the machine I mentioned in a previous comment, not only were there 4 deadlocked safe_sleep.sh processes, there were also 6 which were being continuously respawned each time they terminated; re-rolling the dice for a deadlock, and causing a full 10 cores (1000% CPU usage) to be saturated on... waiting for timestamps.

Waiting for Docker Spawn

run-helper.sh.template uses safe_sleep.sh to poll docker ps to check whether Docker has spawned:

# Wait for docker to start
if [ ! -z "$RUNNER_WAIT_FOR_DOCKER_IN_SECONDS" ]; then
if [ "$RUNNER_WAIT_FOR_DOCKER_IN_SECONDS" -gt 0 ]; then
echo "Waiting for docker to be ready."
for i in $(seq "$RUNNER_WAIT_FOR_DOCKER_IN_SECONDS"); do
if docker ps > /dev/null 2>&1; then
echo "Docker is ready."
break
fi
"$DIR"/safe_sleep.sh 1
done
fi
fi

This seems like it's just the wrong approach again; synchronization should be ensured rather than just polling a few times. However, I won't ponder this for too long, because it seems like this is dead code: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS does not appear anywhere else in the repository so presumably will never be set.

Delay between Runner Restarts

run-helper.sh.template uses safe_sleep.sh to implement a 5 second delay before restarting a failed runner:

elif [[ $returnCode == 2 ]]; then
echo "Runner listener exit with retryable error, re-launch runner in 5 seconds."
"$DIR"/safe_sleep.sh 5
exit 2
elif [[ $returnCode == 3 ]]; then

I find this quite amusing: a big reason to use a "retry delay" like this is to avoid wasting system resources on repeatedly retrying something that keeps failing, but thanks to the implementation of safe_sleep.sh, we're replacing "locking up a full CPU core on retrying something that failed" with "locking up a full CPU core on doing literally nothing".

Waiting for Update Completion

run-helper.sh.template uses safe_sleep.sh to wait for up to 30 seconds while update.sh.template (that's the one we saw earlier!) runs an update:

elif [[ $returnCode == 3 ]]; then
# Wait for 30 seconds or for flag file to exists for the runner update process finish
echo "Runner listener exit because of updating, re-launch runner after successful update"
for i in {0..30}; do
if test -f "$updateFile"; then
echo "Update finished successfully."
rm "$updateFile"
break
fi
"$DIR"/safe_sleep.sh 1
done
exit 2
elif [[ $returnCode == 4 ]]; then

(Weirdly, this case is duplicated almost exactly directly below; apparently exit codes 3 and 4 both mean "exiting for update", but I guess the author of this code was a little too scared to write an OR || in their condition, preferring to duplicate the block.)

Again, using polling here is a clearly bad approach. Incidentally, this case lines up with the first one I looked at to mean that as soon as an update is available, assuming none of the race conditions are hit, a full CPU core is locked up spinning on nothing right up until the update completes.

There at least isn't a deadlock bug here in normal operation, and even if update.finished were never created, the loop is capped at 30 seconds. Perhaps that's a lazy workaround for some of the bugs we've covered here, I don't know.


Summary

Here's where we're at. The safe_sleep.sh script:

  • ...attempts to solve a problem which obviously does not exist
  • ...solves that hypothetical problem in a significantly worse way than the safe_sleep function it replaced
  • ...locks up a full CPU core while running to spin on nothing
  • ...includes an obvious deadlock bug, which when hit locks up a full CPU core and disables the Actions runner until manual intervention
  • ...is called almost exclusively to wait for other tasks to complete
    • ...taking up system resources (one CPU core) and therefore likely slowing down the task being waited for
    • ...introducing another potential deadlock by incorrectly polling a PID
    • ...slowing down the whole system, including active Actions jobs, in fairly common situations

This report of the common deadlock bug, which can turn self-hosted runners into expensive heaters, has been ignored for months. A valid fix for that deadlock was PR'd over a year ago, went completely ignored by maintainers for over a year, before a bot closed it as stale because the maintainers of this repository chose to ignore it. We are having to babysit our Actions runners to occasionally kill these deadlocked spinloops. Users who aren't aware of this bug (virtually all of them!) and who use self-hosted runners probably have cores spinning away in an infinite loop as we speak for no coherent reason, wasting money, wasting electricity, and bottlenecking their workflows.

This situation is insanity. Can anyone at Microsoft look away from the electricity-to-slop converter that is Copilot for the 5 minutes necessary to write some vaguely competent code?

mlugg

mlugg commented on Aug 26, 2025

@mlugg

Ah, it looks like the maintainers have quietly re-opened and merged #3157, so the common deadlock is at least fixed. That was presumably thanks to the non-trivial amount of attention this issue has garnered.

Amusingly, that PR was merged without closing (or even responding to) this issue -- I can only assume out of embarrassment. Nonetheless, this issue is resolved now; OP can probably close it.

What isn't solved is the other potential deadlock I outline above (related to PID reuse), and (perhaps more importantly) the fact that this script hammers the CPU with a busy-loop every time it is run. It actually turns out that the latter issue is already tracked by #2380 -- open for over 2 years, with a fix #3143 (restoring the original behavior where sleep and ping are preferred) open and ignored for over 1 year. Hm, I'm getting déjà vu here...

mawkler

mawkler commented on Aug 29, 2025

@mawkler
TjlHope

TjlHope commented on Sep 14, 2025

@TjlHope

I wouldn't normally comment, but a) this is hilarious (as someone it hasn't bitten - I thought some of my old shell scripts were bad!) and b) I've had to try and restore a remote server before that was suffering process starvation, so have a bit of an interested about various core-utils as non-forking functions.

So in case anyone is interested sleep is actually fairly easy in both bash and busybox sh (though unfortunately not dash) with

sleep_() { read -t"$1" -u1 _dummy || :; }

(You tell read to timeout after $1 seconds from trying to read its own stdout, which is always silent, then return true.)

It took me ~5 minutes to come up with that, and it only took that long because of the number of times time appears in the bash man page before you hit the read builtin... So I just don't know why anyone would have ended up using spinning (or even ping) for this!?

Correction: I realised the read -t trick only works without stdio redirection. If you want it to work even if redirected, you need to force it to be a pipe (so unfortunately it does fork, but's still bash only):

sleep_() { read -t"$1" -u1 _dummy | read _dummy >/dev/null || :; }
$ time sleep_ 1.55 >/dev/null </dev/null

real    0m1.563s
user    0m0.000s
sys     0m0.013s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @andrewrk@ashishb@TjlHope@mlugg@mawkler

        Issue actions

          safe_sleep.sh rarely hangs indefinitely · Issue #3792 · actions/runner