-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Very rarely on update of github actions runner safe_sleep.sh hangs forever:
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
actions+ 72298 96.3 0.0 4372 3328 ? R Apr02 1988:54 /bin/bash /home/actions-runner/safe_sleep.sh 1
...
I suspect it may happen sometimes if machine runs in cloud and is overloaded and/or overcommitted.
To Reproduce
Steps to reproduce the behavior:
- Download runner of version prior to version, for example, 2.322.
- Register and run runner.
- Runner updates itself. It can also take a task to complete in meantime.
Expected behavior
Update should not hang infinitely.
Runner Version and Platform
Runner 2.322
OS: Linux
meziantou, MarioUhrik, pdeva and Dr-Emann
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
mlugg commentedon Aug 13, 2025
script in question
The bug in this "safe sleep" script is obvious from looking at it: if the process is not scheduled for the one-second interval in which the loop would return (due to
$SECONDShaving the correct value), then it simply spins forever. That can easily happen on a CI machine under extreme load. When this happens, it's pretty bad: it completely breaks a runner until manual intervention. On Zig's CI runner machines, we observed multiple of these processes which had been running for hundreds of hours, silently taking down two runner services for weeks.I don't understand how we got here. Even ignoring the pretty clear bug, what makes this Bash script "safer" than calling into the POSIX standard
sleeputility? It doesn't seem to solve any problem; meanwhile, it's less portable and needlessly eats CPU time by busy-waiting.The sloppy coding which is evident here, as well as the inaction on core Actions bugs (in line with the decay in quality of almost every part of GitHub's product), is forcing the Zig project to strongly consider moving away from GitHub Actions entirely. With this bug, and many others (severe workflow scheduling issues resulting in dozens of timeouts; logs randomly becoming inaccessible; random job cancellations without details; perpetually "pending" jobs), we can no longer trust that Actions can be used to implement reliable CI infrastructure. I personally would seriously encourage other projects, particularly any using self-hosted runners, to look carefully at the stability of Actions and ask themselves whether it is a solution worth sticking with long-term when compared with alternatives.
markozajc commentedon Aug 15, 2025
This bugs me too, so I looked into it. It seems that a
safe_sleepfunction was originally declared and used inrun-helper.sh.template:runner/src/Misc/layoutroot/run-helper.sh.template
Lines 13 to 27 in f2347b7
sleeputility, however/usr/bin/sleepis provided by thecoreutilson Debian and Ubuntu, which isPriority: requiredandEssential: yes, so the binary will always be present on a correctly installed system,/dev/null5000 times somehow take 5 seconds, andsleepis the only coreutils binary that has a fallback mechanism like this. Why not also do this forid,readlinkanddirname?Apparently this function was such a hit that #1707 opted to use it across the entire project. For some reason, this included rewriting it in a way that, like you mentioned, burns CPU cycles for no good reason and causes random hangs:
runner/src/Misc/layoutroot/safe_sleep.sh
Lines 3 to 6 in b917970
Why was this PR merged without so much as a comment? Why does it execute the helper script instead of exporting and importing it as a function? Why doesn't it check for the presence of
sleeplike the old one one did? At least the negative clause is effectively dead code in that case, so it wouldn't cause bugs.mlugg commentedon Aug 15, 2025
Adding to the confusion, it turns out that someone submitted a perfectly good fix for the actual bug here -- while keeping the inexplicable busy-loop, so any hypothetical advantage of
safe_sleep.shwould be retained -- over a year ago! That was #3157. Unfortunately, it sat completely untouched for over a year, before it was helpfully closed by a bot due to inactivity... despite being a perfectly good PR with a pending unfulfilled review request. This situation is baffling; it appears at least to me like this mission-critical repository is borderline unmaintained.notcancername commentedon Aug 16, 2025
If this is the reason, it's a pretty funny one considering Node is relied on quite heavily in GitHub Actions-land.
andrewrk commentedon Aug 16, 2025
it's vibes all the way down
Summertime commentedon Aug 16, 2025
Runner relies on
readlink,rm,dirname,cat,uname,seq(not posix!),bash(not posix!)...Also
ldd,sudo,grep,ldconfig,docker...I think an assumption can be made that
sleep(posix) will probably existSummertime commentedon Aug 16, 2025
https://lobste.rs/s/l4rowa/github_actions_safe_sleep_sh#c_ppqpea Additional exploration into the history of this code is here.
ashishb commentedon Aug 16, 2025
GitHub Actions do have these gotchas, and defaults are just not optimal.
There are fixes, except they are not very obvious to the first-time user.
@mlugg Happy to send some fixes that might help Zig (I just started using it, and it definitely is an interesting language).
mlugg commentedon Aug 16, 2025
We just discovered another 4 deadlocked
safe_sleep.shprocesses on one of Zig's CI runner machines, collectively saturating 4 CPU cores on infinite loops and reducing CI capacity by 30%. Two of them had been running for around a day, the other two for almost 2 weeks. This had certainly been a contributor to our ongoing CI issues. Why are Microsoft refusing to give this bug any attention? We should not need to babysit our CI runners to prevent them from becoming glorified space heaters.mlugg commentedon Aug 16, 2025
@ashishb thanks for the suggestions -- we've decided to just migrate away from Actions ASAP since it's become clear that GitHub is a rapidly sinking ship.
mlugg commentedon Aug 16, 2025
As @kristoff-it rightly pointed out while streaming on Twitch yesterday, there's an even better question than "why is this buggy and inefficient script used instead of the universally-available
sleep", that being: "why does the runner need to sleep in the first place"? Well, let's look at wheresafe_sleep.shis used...Waiting for Runner Termination
update.sh.templateusessafe_sleep.shto poll the runner's PID in/procto see if it has terminated:runner/src/Misc/layoutbin/update.sh.template
Lines 28 to 39 in 0ebdf9e
Firstly, this is another very obvious race condition -- nothing stops the runner from terminating, and another process from picking up that PID, within the 2-second interval, such that
update.sh.templatethinks the Actions runner is still running and continues to wait. Since Actions runners are long-lived (even if they handle just one job, a job could go on for hours!), PID wraparound could have easily happened, so this isn't an esoteric concern: on any system doing more than hosting a single Actions runner, there is a non-negligible chance that something happens to spawn a new process in the required interval and hence takes down a runner indefinitely (until one unlucky process happens to terminate). This is, again, extremely sloppy coding -- the process should just be monitored using an appropriate primitive. Perhaps the program that spawned the runner useswait; perhaps the runner which spawns this script manually communicates its termination through a pipe; or perhaps the runner just doesn't spawn the script until it can make progress.Secondly, because
safe_sleep.shspins a CPU needlessly, this causes a full CPU core to get locked up spinning, while a CI run is potentially in progress, waiting for that run to finish. I haven't been able to understand the maze of C# logic which ultimately calls this code, but we've observed a half dozen non-deadlockedsafe_sleep.shprocesses on a CI machine in normal operation, so I'm fairly confident in saying that the script is spawned while CI runs are in-progress. So on the machine I mentioned in a previous comment, not only were there 4 deadlockedsafe_sleep.shprocesses, there were also 6 which were being continuously respawned each time they terminated; re-rolling the dice for a deadlock, and causing a full 10 cores (1000% CPU usage) to be saturated on... waiting for timestamps.Waiting for Docker Spawn
run-helper.sh.templateusessafe_sleep.shto polldocker psto check whether Docker has spawned:runner/src/Misc/layoutroot/run-helper.sh.template
Lines 21 to 33 in 0ebdf9e
This seems like it's just the wrong approach again; synchronization should be ensured rather than just polling a few times. However, I won't ponder this for too long, because it seems like this is dead code:
RUNNER_WAIT_FOR_DOCKER_IN_SECONDSdoes not appear anywhere else in the repository so presumably will never be set.Delay between Runner Restarts
run-helper.sh.templateusessafe_sleep.shto implement a 5 second delay before restarting a failed runner:runner/src/Misc/layoutroot/run-helper.sh.template
Lines 45 to 49 in 0ebdf9e
I find this quite amusing: a big reason to use a "retry delay" like this is to avoid wasting system resources on repeatedly retrying something that keeps failing, but thanks to the implementation of
safe_sleep.sh, we're replacing "locking up a full CPU core on retrying something that failed" with "locking up a full CPU core on doing literally nothing".Waiting for Update Completion
run-helper.sh.templateusessafe_sleep.shto wait for up to 30 seconds whileupdate.sh.template(that's the one we saw earlier!) runs an update:runner/src/Misc/layoutroot/run-helper.sh.template
Lines 49 to 61 in 0ebdf9e
(Weirdly, this case is duplicated almost exactly directly below; apparently exit codes 3 and 4 both mean "exiting for update", but I guess the author of this code was a little too scared to write an OR
||in their condition, preferring to duplicate the block.)Again, using polling here is a clearly bad approach. Incidentally, this case lines up with the first one I looked at to mean that as soon as an update is available, assuming none of the race conditions are hit, a full CPU core is locked up spinning on nothing right up until the update completes.
There at least isn't a deadlock bug here in normal operation, and even if
update.finishedwere never created, the loop is capped at 30 seconds. Perhaps that's a lazy workaround for some of the bugs we've covered here, I don't know.Summary
Here's where we're at. The
safe_sleep.shscript:safe_sleepfunction it replacedThis report of the common deadlock bug, which can turn self-hosted runners into expensive heaters, has been ignored for months. A valid fix for that deadlock was PR'd over a year ago, went completely ignored by maintainers for over a year, before a bot closed it as stale because the maintainers of this repository chose to ignore it. We are having to babysit our Actions runners to occasionally kill these deadlocked spinloops. Users who aren't aware of this bug (virtually all of them!) and who use self-hosted runners probably have cores spinning away in an infinite loop as we speak for no coherent reason, wasting money, wasting electricity, and bottlenecking their workflows.
This situation is insanity. Can anyone at Microsoft look away from the electricity-to-slop converter that is Copilot for the 5 minutes necessary to write some vaguely competent code?
mlugg commentedon Aug 26, 2025
Ah, it looks like the maintainers have quietly re-opened and merged #3157, so the common deadlock is at least fixed. That was presumably thanks to the non-trivial amount of attention this issue has garnered.
Amusingly, that PR was merged without closing (or even responding to) this issue -- I can only assume out of embarrassment. Nonetheless, this issue is resolved now; OP can probably close it.
What isn't solved is the other potential deadlock I outline above (related to PID reuse), and (perhaps more importantly) the fact that this script hammers the CPU with a busy-loop every time it is run. It actually turns out that the latter issue is already tracked by #2380 -- open for over 2 years, with a fix #3143 (restoring the original behavior where
sleepandpingare preferred) open and ignored for over 1 year. Hm, I'm getting déjà vu here...mawkler commentedon Aug 29, 2025
You may not like it but this is what peak vibe coding looks like
TjlHope commentedon Sep 14, 2025
I wouldn't normally comment, but a) this is hilarious (as someone it hasn't bitten - I thought some of my old shell scripts were bad!) and b) I've had to try and restore a remote server before that was suffering process starvation, so have a bit of an interested about various core-utils as non-forking functions.
So in case anyone is interested
sleepis actually fairly easy in bothbashandbusybox sh(though unfortunately notdash) with(You tell
readto timeout after$1seconds from trying to read its own stdout, which is always silent, then return true.)It took me ~5 minutes to come up with that, and it only took that long because of the number of times
timeappears in thebashmanpage before you hit thereadbuiltin... So I just don't know why anyone would have ended up using spinning (or evenping) for this!?Correction: I realised the
read -ttrick only works without stdio redirection. If you want it to work even if redirected, you need to force it to be a pipe (so unfortunately it does fork, but's still bash only):