Just now, I took a look at the HTTP logs on git.sr.ht. Of the past 100,000 HTTP requests received by git.sr.ht (representing about 2½ hours of logs), 4,774 have been requested by GoModuleProxy — 5% of all traffic. And their requests are not cheap: every one is a complete git clone. They come in bursts, so every few minutes we get a big spike from Go, along with a constant murmur of Go traffic.
This has been ongoing since around the release of Go 1.16, which came with some changes to how Go uses modules. Since this release, following a gradual ramp-up in traffic as the release was rolled out to users, git.sr.ht has had a constant floor of I/O and network load for which the majority can be attributed to Go.
I started to suspect that something strange was going on when our I/O alarms started going off in February 2021 (we eventually had to tune these alarms up above the floor of I/O noise generated by Go), correlated with lots of activity from a Go user agent. I was able to narrow it down with some effort, but to the credit of the Go team they did change their User-Agent to make more apparent what was going on. Ultimately, this proved to be the end of the Go team’s helpfulness in this matter.
I did narrow it down: it turns out that the Go Module Mirror runs some crawlers that periodically clone Git repositories with Go modules in them to check for updates. Once we had narrowed this down, I filed a second ticket to address the problem.
I came to understand that the design of this feature is questionable. For a start, I never really appreciated the fact that Go secretly calls home to Google to fetch modules through a proxy (you can set GOPROXY=direct to fix this). Even taking the utility at face value, however, the implementation leaves much to be desired. The service is distributed across many nodes which all crawl modules independently of one another, resulting in very redundant git traffic.
140 8a42ab2a4b4563222b9d12a1711696af7e06e4c1092a78e6d9f59be7cb1af275
57 9cc95b73f370133177820982b8b4e635fd208569a60ec07bd4bd798d4252eae7
44 9e730484bdf97915494b441fdd00648f4198be61976b0569338a4e6261cddd0a
44 80228634b72777eeeb3bc478c98a26044ec96375c872c47640569b4c8920c62c
44 5556d6b76c00cfc43882fceac52537b2fdaa7dff314edda7b4434a59e6843422
40 59a244b3afd28ee18d4ca7c4dd0a8bba4d22d9b2ae7712e02b1ba63785cc16b1
40 51f50605aee58c0b7568b3b7b3f936917712787f7ea899cc6fda8b36177a40c7
40 4f454b1baebe27f858e613f3a91dfafcdf73f68e7c9eba0919e51fe7eac5f31b
This is a sample from a larger set which shows the hashes of git repositories on the right (names were hashed for privacy reasons), and the number of times they were cloned over the course of an hour. The main culprit is the fact that the nodes all crawl independently and don’t communicate with each other, but the per-node stats are not great either: each IP address still clones the same repositories 8-10 times per hour. Another user hosting their own git repos noted a single module being downloaded over 500 times in a single day, generating 4 GiB of traffic.
The Go team holds that this service is not a crawler, and thus they do not obey robots.txt — if they did, I could use it to configure a more reasonable “Crawl-Delay” to control the pace of their crawling efforts. I also suggested keeping the repositories stored on-site and only doing a git fetch, rather than a fresh git clone every time, or using shallow clones. They could also just fetch fresh data when users request it, instead of pro-actively crawling the cache all of the time. All of these suggestions fell on deaf ears, the Go team has not prioritized it, and a year later I am still being DDoSed by Google as a matter of course.
I was banned from the Go issue tracker for mysterious reasons,1 so I cannot continue to nag them for a fix. I can’t blackhole their IP addresses, because that would make all Go modules hosted on git.sr.ht stop working for default Go configurations (i.e. without GOPROXY=direct). I tried to advocate for Linux distros to patch out GOPROXY by default, citing privacy reasons, but I was unsuccessful. I have no further recourse but to tolerate having our little-fish service DoS’d by a 1.38 trillion dollar company. But I will say that if I was in their position, and my service was mistakenly sending an excessive amount of traffic to someone else, I would make it my first priority to fix it. But I suppose no one will get promoted for prioritizing that at Google.
-
In violation of Go’s own Code of Conduct, by the way, which requires that participants are notified moderator actions against them and given the opportunity to appeal. I happen to be well versed in Go’s CoC given that I was banned once before without notice — a ban which was later overturned on the grounds that the moderator was wrong in the first place. Great community, guys. ↩︎
Have a comment on one of my posts? Start a discussion in my public inbox by sending an email to ~sircmpwn/public-inbox@lists.sr.ht [mailing list etiquette]