Hacker News new | past | comments | ask | show | jobs | submit login
Why are my Go executable files so large? (cockroachlabs.com)
315 points by caiobegotti 7 hours ago | hide | past | web | favorite | 207 comments





> prior to 1.2, the Go linker was emitting a compressed line table, and the program would decompress it upon initialization at run-time. in Go 1.2, a decision was made to pre-expand the line table in the executable file into its final format suitable for direct use at run-time, without an additional decompression step.

This is a good choice I think and the author of the article missed the most important point - it uses less memory to have an uncompressed table.

This sounds paradoxical but if a table has to be expanded at runtime then it has to be loaded into memory.

However if a table is part of the executable, the OS won't even load it into memory unless it is used and will only page the bits into memory that are used.

You see the same effect when you compress a binary with UPX (for example) - the size on disk gets smaller, but because the entire executable is decompressed into RAM rather than demand paged in, then it uses more memory.


If you decompress it to an mmaped file it'll be one of the first things written to disk under memory pressure anyways and instantly available in normal situations.

With the ever decreasing cost of flash and it's ever increasing speed relative to the CPU compression is not really worth what it used to be to startup times 10 years ago though.


Swapping is fine for workstations and home computers. But high performance machines running in a production environments will absolutely have swap disabled.

The performance difference between RAM and disk is not an acceptable tradeoff. RAM will be tightly provisioned and jobs will be killed rather than letting the machine OOM.


Are you arguing you prefer your production jobs to be killed rather than slowed ?

Generally yes. Swapping changes the perf characteristics of that process (and often any other process on the same machine) in unpredictable ways. It's better to have predictable process termination -- with instrumentation describing what went wrong, so capacity planning and resource quotas can be updated. The process failure would generally be compensated-for at a higher level, anyway.

I don't buy the "never swap" school of thought.

Sure, swap is bad on steady state loads. But for transient loads it works

Unless you like playing "why did my service got killed this time" every once in a while


Virtual memory works even without swap enabled. Since the mapped file is the binary, and code is never changed after loading, the OS can simply take pages backing the memory. But when there is a page fault, it will be bought back in.

Is it just me or something like that runtime.pclntab shouldn't be included in production builds at all?

I mean, it makes all sense while you're developing and testing, but it should be reasonably possible to strip it out from production build binaries, and instead put it in a separate file so that if you do get a crash with a stack trace, then some external script can transform the program counters to line numbers, not have it embedded in every deployed binary.


The Go language literally requires that pclntab be included in release builds. I'm with you—it seems kind of crazy that this was designed into the language—but there you have it.

The reason is that Go's standard library provides functions that allow retrieving a backtrace and symbolicating that backtrace at runtime:

* https://golang.org/pkg/runtime/?m=all#Callers

* https://golang.org/pkg/runtime/?m=all#CallersFrames

Unlike in C or C++, where you can link in something like libbacktrace [0] to get a best-effort backtrace, those Go functions are guaranteed to be correct, even when functions have been inlined. This is no small feat, and indeed programs compiled with gccgo will often be incorrect because libbacktrace doesn't always get things right when functions have been inlined.

[0]: https://github.com/ianlancetaylor/libbacktrace


Is it that common for these functions to be used? Perhaps transitively through some popular libraries? Just being in the standard library doesn't even necessarily mean that these functions (and the data they need) should get included by the linker.

Any program that uses a logging framework, including the stdlib log package, will wind up depending on runtime.Callers at least transitively. That’s probably most Go programs; certainly most of the programs large enough to be worrying about binary size.

Unlike in C, there are no macros like __FILE__ and __LINE__, so there is no alternative to runtime.Callers (short of preprocessing your Go source code).


You can still get a backtrace without symbols though.

Why couldn't the Go team introduce a flag that strips symbols, while making clear to people that they should only use it if they are okay with backtraces looking like

  #1  0x00007ffff7ddb899 in __GI_abort () at abort.c:79
  #2  0x0000555555555156 in ?? ()
  #3  0x0000555555555168 in ?? ()
  #4  0x000055555555517d in ?? ()
  #5  0x0000555555555192 in ?? ()
or similar

Because one of the mantra of Go is not telling users to have a "debug" build and a "release" build. The development build is the one that goes into production, with no difference in optimizations, symbols and whatnot. This has pros and cons, like all tradeoffs.

Are you sure this is true? Doesn't delve for example build Go source with special flags (gcflags=all='-N -l')to generate debugging symbols. I also remember having to build Go code with those flags for Stackdriver to get the correct debugging information without any optimisations.

This shouldn't be necessary anymore.


Thanks; I didn’t know that, but it makes sense. Not a design decision I agree with, but it’s coherent, at least.

Backtrace can be amazing at run time if your going to put that info into logs. It makes finding "fringe" errors a lot less painful or more easily reproducible.

Not sure if it is just you but normally you DO want this information in the production build. It is quite a bad situation to have a run time exception in PROD and having no idea how it happened. Sure, there is defensive programming and checks and asserts but most of the time you cannot foresee everything.

I get the point about external symbols and location database, but oftentimes time is precious and having fully laid out stack trace in the log will allow you to get to the root much faster.


> I get the point about external symbols and location database, but oftentimes time is precious and having fully laid out stack trace in the log will allow you to get to the root much faster.

You can also set up a service that automatically symbolicates everything in a log file as soon as it is generated, before a human ever even looks at it.

Granted, yes, this is slightly more complicated, but the point is that the toolchain should let the developers choose which strategy they want to use.


There is a big difference between choose and do.

I don't understand what you mean; can you elaborate?

Not a parent commenter but I interpreted that as: given choice between alternatives A (fast and good enough) and B (more complex and much better) you would want to choose B but end up doing A for various reasons (lack of time, unclear ROI etc)

Sure, but the toolchain designers can (relatively easily) offer both, so that people who really do have the resources and/or need to do B can do so.

So what would a solution be? How do other languages solve that problem?

In my experience you would strip the symbols out of the prod binary, and save them separately somewhere.

Then your production binary will give you stack traces like [0x12345, 0xabcde, ...], but you can use the separately-stored files to symbolicate them and get the source file/line info.

Not sure if this is possible on all platforms but it at least is for all combinations of {C, C++, Objective-C, Rust} and { Linux, macOS, iOS } .

And if that added operational complexity is not worth the size savings, you can freely choose not to do it, and things will work like they do in Go.


That is debug information. Just have it stored elsewhere (not on the binary you ship everywhere) and use that in conjunction with your core dump to debug.

Separable debuginfo which can be loaded at runtime. DWARF uses an efficient compression mechanism much smarter than a table for this sort of mapping. And of course things like coredumps and crash dumps being sent to automated processing where devtools have the full debug symbols, while production deployments do not.

Go's insistance not just reinventing the wheel but on actively ignoring core infrastructure improvements made in the last 20 years is bizarre.


A lot of them have symbol files separate from the binary. Unixy tooling doesn't do this by default but for example objcopy(1) in binutils can copy symbols to another file before you run strip(1), and on Mac my memory is rusty but I think it may be dsymutil(1) that lets you copy to a .dSYM bundle. Microsoft has its .pdb files and never even keeps debug info inside the binary proper.

The debug info is in a separate file. You only need that file when you’re inspecting a crash report, so it doesn’t need to be pushed out to the host device(s).

Because it's not a problem. So everybody does the same. And is not about the programming language, is about the programmer's choice. If it wants debug info inside a production program, the language let it happens. In today's age size of your executable is a non-issue. The only issue should be your performance.

Here is an example from my past. As embedded programmer I went and added manually a hundred lines of constants, which initially were just an array generated at start and increased that code by about 5%. Why? Because I gained 5 ms in execution speed. And in embedded world that's huge. Especially when your code is executed on the lowest 10 ms cycle. So the department head approved such a huge change because the code size doesn't matter, you can always buy a bigger chip, but if your car don't start because the watchdog keeps resetting your car's on-board computer then speed in code execution is everything.


> Because it's not a problem. So everybody does the same. And is not about the programming language, is about the programmer's choice. If it wants debug info inside a production program, the language let it happens. In today's age size of your executable is a non-issue. The only issue should be your performance.

No.


I do not know much about go, but languages like C++ and Java give you the tools to make tradeoffs appropriate to your situation: externalizing or stripping symbols and/or debugging information.

There are very different production scenarios - in many of them noone will ever look (or even be able to look) at a stack trace if it crashes after it's shipped (at best you'll record bug reports from customers and attempt to reproduce them on your test hardware), so the debug information is literally useless there. And these are the same scenarios extra 50mb of disk and memory matter more than for some software running in a cloud environment.

I was pleasantly surprised how good Microsoft's tooling was around firing up a debugger to examine the final state in a crash dump using external symbols from that build. Everything seemed to work except you couldn't resume any thread. I agree symbols don't need to be embedded in every running binary, but having a warm copy somewhere can be pretty helpful.

This is where letting a large enterprise guide the development of a piece of widely-used software becomes questionable. At a FAANG the constraints are fundamentally different.

At work I routinely see CLIs programs clocking in at a gigabyte, because it was simpler to statically link the entire shared dependency tree than to figure out the actual set of dependencies, and once your binary grows that big, running LTO adds too much time to one's builds. And disk space is basically free at FAANG scale...


Disk space in general is pretty much free these days. 123mb for a whole database is really not that big of a deal, IMHO. For example, my local PostgreSQL docker image is 140mb plus the alpine OS (5mb). And the Ruby on Rails application using that image clocks in at a little over 1gb, also using the ruby alpine image as a base (50mb).

With my company, the cost really started to become a burden with data transfer. But transferring images to and from the AWS container registry is so expensive that we actually build production images inside the Kubernetes cluster (plus the cluster has access to all the secrets and stuff), even though it was a bit harder to implement.

If you're FAANG and you can run your own stable and highly-available cloud, data transfer rates don't matter, so you can deploy your application the "right" way in a containerized world.


Yes but when running a long-lived application on a lot of data it's typically important to keep the executable small both so it can be "hot" and to leave more room for data. At scale this could be even more important, not less, than for a smaller operation.

Of course the real (i.e. explicitly stated by Pike) driver for go was the assertion that inexperienced new hires write poor code and so the harder it is for them to get into trouble the better, even at the cost of other issues.


> Yes but when running a long-lived application on a lot of data it's typically important to keep the executable small both so it can be "hot" and to leave more room for data.

Who really runs server applications these days where data/rss are not a large multiple of the actual code segments? People happily run JVM server processes these days, how much code/data do you think that pulls in just to start up?


Executable size on disk doesn't dictate effective size at higher levels of the memory hierarchy.

IOW, you're paying by the page/cache line. If the extra bloat (debug information) in your executable isn't interspersed within your actual code (it shouldn't be), you aren't paying for it in runtime efficiency.


Genuinely wondering how a CLI would be 1GB, even at Facebook. Never encountered anything _close_ at Amazon.

Not your usually usecase, but the binary for running the Bloomberg terminal is well over 1GB (In fact running into the the 4GB executable boundary was a problem).

Oh, that's fascinating. I never checked the binary size but that makes sense.

I never saw one a singular tool that big at Amazon either, but the general concept held. The AWS service I worked on had ~1,000 jars in its dependency closure. All of those had to be uploaded to S3 after each build, and then downloaded onto each EC2 instance during deployment.

We're talking on the order of a terabyte of data transfer each time we deployed to a thousand instance fleet (ideally deploying weekly)


LTO is no magic bullet for binary size either. A binary that does nothing will still link in the whole C library. It doesn't end up decreasing large program sizes that much in my experience either.

Tree shaking?

Mmmm I don't know if I agree with that logic.

I'd argue the opposite -- as a startup, you can't afford to micro-optimize. Labor, time and opportunity cost dwarf all but the grossest resource waste. If you need to use 100GB/k8s node instead of 50GB, it will have 0 effect on the success of your venture.

At Google scale, it becomes worth it to optimize:

- You are delivering more product per engineer, and more product means more resources. Instead of a single customer instance which costs $1/month more, you have 100,000 customer instances, costing a significant amount more. It becomes worth trimming margins.

- You have economies of scale, and it might be worth it for an engineer to spend a month trimming 2% of the cost of a software deliverable.

The common refrain for startups is "do things that don't scale", and this is for good reason. Google has to actually worry about fixing things AFTER they are scaled.


There are plenty of efforts and tooling that reduce the number of dependencies and the deploy size of binaries at Google. The notion that we don't care about size isn't true.

But it's true that it's not worth optimising first, it's done by first evaluating the impact across the fleet and then prioritising the most effective changes.


I think you're both right.

A 125MB binary is relatively beefy, but still easily fits in RAM. The amount of disk that you're spending on a single executable (your database, in this instance) is tiny in comparison to the amount of data stored in that database.

It's definitely worth it for Google to trim 2% off of their storage requirements - but if your binary is 0.1% of your storage, it's barely even worth glancing at.


This is where go’s insistence on reinventing the wheel feels terribly misplaced. Every major debug format has a way to associate code locations with line numbers. Every major debug format also has a way to separate the debug data from the main executable (.dSYM, .dbg, .pdb). In other words, the problem that the massive pclntab table (over 25% of a stripped binary!) is trying to solve is already a well-trodden and solved problem. But go, being go, insists on doing things their own way. The same holds for their wacky calling convention (everything on the stack even when register calling convention is the platform default) and their zero-reliance on libc (to the point of rolling their own syscall code and inducing weird breakage).

Sure, the existing solutions might not be perfect, but reinventing the wheel gets tiresome after a while. Contrast this with Rust, which has made an overt effort to fit into existing tooling: symbols are mangled using the C++ mangler so that gdb and friends understand them, rust outputs nice normal DWARF stuff on Linux so gdb debugging just works, Rust uses platform calling convention as much as possible, etc. It means that a wealth of existing tooling just works.


I am not a fan of Go, and I also wish these things were true (and more[1], actually), but I find it hard to agree that its priorities are "terribly misplaced." Inside the context of Go's goals (e.g., "compile fast") and non-goals (e.g., "make it easy to attach debuggers to apps replicated a zillion times in Borg") these trade-offs make a lot of sense to me. Like: Go rewrote their linker, I think, 3 times, to increase the speed. If step 1 was to wade through the LLVM backend, I am not sure this would have happened. Am I missing something?

I love Rust, but Go is focused on a handful of very specific use cases. Rust is not. I don't know that I can fault Go for choosing implementation details that directly enable those use cases.

[1]: http://dtrace.org/blogs/wesolows/2014/12/29/golang-is-trash/


I'd check out the HN comments in response to the parent's [1]: https://news.ycombinator.com/item?id=8815778

Specifically the top reply there is by rsc (tech lead for Go)


> non-goals (e.g., "make it easy to attach debuggers to apps replicated a zillion times in Borg")

But wouldn't it still be nice to have a standardized way to analyze post-mortem dumps across languages?


Google's anointed production languages used to be five: C++, Java, JavaScript, Python, and Go. Not much to reasonably standardize across, especially if a standardized solution ends up with more compromises than a custom one.

But DWARF uses less space than Go's native format. So inventing a custom "linetab" format seems like the compromise, not using DWARF.

Insert standardization XKCD. It's been tried. And even so, you can still use the "standard" coredump tool to analyze a Go program's coredump with decent success.

I totally agree with the above - was never able to click with Go _but_ I totally understand how reinventing the wheel has worked well for them.

The days when the Go project fired up were different than the days when Rust started. Rust made different tradeoffs by relying on LLVM and it has advantages (free optimizations!) and disadvantages of their own.


The first releases of each were only 8 months apart, far from “different days”. The projects simply have different goals.

Well go also uses its own assembler, on top of that a kind of modified garbage version of real ones. You can only justify so many reinventions of the wheel, yet they redid everything.

Did they actually redo everything, or does it just look that way from starting from the Plan9 toolchain? Which could also be said to be re-doing everything, but from a much earlier starting point.

IIRC Go started out shipping with a port of the Plan9 C compiler and toolchain - it was bootstrapped by building the C compiler with your system C compiler, then building the Go compiler. Which, until re-written in Go circa-2013, was in Plan-9 style C. It all looks deeply idiosyncratic but it was a toolchain the initial implementors were highly familiar with.


Perhaps the other assemblers would not provide desired compilation speed?

Perhaps their IP requirement would not sufice Google lawyers?

Perhaps Go devs would rather have more control on the development of assembler by writing it from scratch to understand every design decision instead of inheriting thousands of unknown design decisions?

I don't know. Neither do others outside of the project.

I find these baseless micro-aggressions against Go missplaced and unfruitful.


> I don't know. Neither do others outside of the project.

> I find these baseless micro-aggressions against Go missplaced and unfruitful.

Hu? Ok then Go is perfect because it is developed in secret.

We are discussing here, I'm not "micro-agressing" anyone. If I don't like a design / re-implementation decision, and I in the mood to share that opinion with this cyber-assembly, I do it. And I expect developers to not be offended by me having a technical opinion; and I expect third parties to be even less offended. And yes, it might be a bad opinion in some cases. I'm not even 100% sure it is not the case here, because like you said they could have had some kind of justification to do that. But it suspect it is extremely rare to have a good justification to rewrite an assembler, with really big quirks on top of that, when they did it.


> Ok then Go is perfect because it is developed in secret.

Didn't implied that at all. No need for straw man.


Yes, you absolutely did, by stating (not just implying) that any criticism of the project that does not take its internal decision making into account is "baseless micro-agression".

I was referring specifically to this:

> You can only justify so many reinventions of the wheel, yet they redid everything.

Don't extrapolate what I write.


A lot of the insularity and weirdness comes from the Plan 9 heritage. Go's authors (Rob Pike, Ken Thompson, and Russ Cox) cannibalized/ported a bunch of their own Plan 9 stuff during initial development. For example, I believe the original compiler was basically a rewrite of the Inferno C compiler.

This is a large part of why Go is not based on GCC or LLVM, why it has its own linker, its own assembly language, its own syscall interface, its own debug format, its own runtime (forgoing libc), and so on. Clearly Go's designers were more than a little contrarian in their way of doing things, but that's not the whole answer.

Being able to repurpose existing code is an efficiency multiplier during the bootstrapping phase. But when bootstrapping is done, you have to consider the ROI of going back and redoing some things or keep a design that works pretty well. The Go team is undoubtedly aware of some of these issues, but probably don't consider them to be a priority.

In some cases the tools are a benefit. Go's compiler and linker are extremely fast, which I appreciate as a developer. A possible compromise would be to offer a slower build pipeline for production builds, which made use of LLVM and its many man-years of code optimizations.


Personally I more wish Rust would take this approach. Rust desperately needs a fast, developer oriented compiler. The slow compile times is potentially Rusts biggest flaw, to the point where I find it keeps me off the language for anything non-trivial. Even better might be a Rust interpreter, so you'd get REPL and fast development cycles.

This is why Cranelift is being worked on. There is also a Rust interpreter, miri.

I think starting with LLVM was the right decision (and one that was I was primarily responsible for). Rust would lose most of its benefits if it didn't produce code with performance on par with C++. LLVM is not the fastest compiler in the world (though it's not like it's horribly slow either), but its optimization pipeline is unmatched. I don't see replicating LLVM's code quality as feasible without a large team and a decade of work. Middling code gen performance is an acceptable price to pay until we get Cranelift; the alternative, developing our own backend, would mean not being able to deploy Rust code at all in many scenarios.


Forgive me if this is ignorant, since I havent done any benchmarks on this in a while, but doesnt GCC produce slightly faster code on average across a wide set of benchmarks compared to clang/LLVM?

Perhaps, but the advantages of a large third-party ecosystem around LLVM outweighed any performance differences between GCC and LLVM.

At least in these benchmarks that phoronix run time-to-time, (so they at least can be compared to their older self) LLVM, in its Clang incarnation, is finally getting some parity in execution times with GCC

https://www.phoronix.com/scan.php?page=article&item=gcc-clan...

Of course, benchmarks, yada yada, but at least is some sort of comparison axis where the improvement over the years is clear.


Thanks for the link. I was probably thinking about some older phoronix benchmarks when I made my post

Thanks for the pointer! I was unfamiliar with Cranelift and it seems like a promising tech. I'll keep an eye on it in hopes that once it is stable I'll be able to put together a development environment that allows for the fast turnaround I prefer.

I have not used rust for anything very large, but the using an editor that supports the rust language server mitigates the compile time problem. In VSCode it show you the compiler warnings and errors as you are editing a file. There is a little lag in updating but the workflow is faster than switching to a terminal to do a full compile.

Isn't Cranelift going to be usable for that?

If you need any other evidence for this, just look at GOPATH and similar. That was plan9 through and through; they wanted to delegate work to the filesystem. No need for a package manager or anything, just pull down URIs and they'll be where Go wants them to be.

What are you talking about? Plan 9 doesn't even use $path. At least not consistently -- binaries live in /bin.

It's derived from convention of /n/sources and /n/contrib and the like. Sources mounted from network fileservers from various places, etc.

The git support was added to make it a bit easier outside plan9.


Go has had to walk back on some of its choices recently; most notably on platforms without a stable syscall ABI and a very strong push for dynamic linking (…so macOS) they link against the system libraries.

The only popular platform with a stable syscall ABI is Linux. This is a product of the historical accident that Linux doesn't control a libc and ensuing drama.

Almost everyone else doesn't have a stable ABI below the (C) linker level.


I don't think Linux actually guarantees syscall-level compatibility, so no need to single it out, it's just like everyone else.

It does - the syscalls are part of the official userspace interface which the Linux kernel promises not to break. They can add new syscalls, options or flags, but can’t break existing ones.

> platforms without a stable syscall ABI and a very strong push for dynamic linking (…so macOS)

That's an even better description of Windows. The macOS system call table isn't officially stable, but it's at least slow to change. The Windows equivalent has been known to change from service pack to service pack.


Small note, we don't use the C++ mangler (https://github.com/rust-lang/rfcs/pull/2603), and did the upstream work in GDB to get it to understand things. (There's also more work to do: https://github.com/rust-lang/rust/issues?q=is%3Aopen+is%3Ais... )

That being said, yes, we see integration into the parent platform as being an important design constraint for Rust. I think Go made reasonable choices for what they're trying to do, though. It's all tradeoffs.


Indeed, though it's worth mentioning that the Rust mangling scheme is based on that of the Itanium C++ ABI.

For those who don't know, it's also worth mentioning that while it's called the "Itanium" C++ ABI (because it was designed originally for the Itanium), it's nowadays used for every architecture on Linux.

> to the point of rolling their own syscall code

It makes the Go concurrency mechanism possible, this is not just a kind of whim.

Most importantly, this allows the scheduler to hook on syscalls in order to schedule an other routine. But this also allows to control what happens during a syscall, since the libc tends to do more than just calling the kernel in its syscall wrappers, which might not be thread safe or might not play well with stack manipulations.

This has never been a problem in my experience.


The problem is that there is exactly one OS that maintains the system call ABI as a stable API: Linux. On other systems, trying to invoke the system calls manually and bypassing the C wrapper opens you up into undefined behavior, and this was particularly problematic on OS X, which occasionally made assumptions about the userspace calling code that weren't true for the Go wrapper shell, since it wasn't the expected system wrapper library.

The problem is that that one OS is right. Systems calls form an API and it needs to be stable and managed. We (developers) have been working on this issue for years and have at least attempted solutions (eg. semantic versioning) where most OS developers feel free to break them on a whim. It is a terrible practice that forces others to spend their time working around.

I would note that system calls only form an external API if the developer says they are an external API. Which it is for Linux but for other OSes the external API is a C library, kernel32, etc.

But right and wrong aside, there's the practical matter of reality. You can't simply pretend everything works how you want them to. At the end of the day you have to deal with how they actually work.


The Linux model is not the "right" one, it's a choice that they've made. Just like static linking isn't the "right" choice either, it's an option with its own drawbacks. Other OSes provide an approved, API stable layer to access the OS; it's just not the syscall layer.

The Linux approach reflects the social structure this project has been developed in.

The Linux project needs to be able to evolve independently of other projects, so they do just that.

This is a classic case of technical architectures following most of the time social structures.

> Systems calls form an API and it needs to be stable and managed.

In some cases (e.g; nearly all the other oses...) system calls form an internal API, and they don't need to be stable, and they actually even don't need to be accessible except to intermediate layers provided in a coordinated way.


No one here is disagreeing on the need for the operating system to provide a stable interface to applications: the question is where that stable interface should lie. Linux takes the most restrictive approach, asserting that the actual hardware instruction effecting the user/kernel switch is the appropriate boundary. OS X and Windows instead take the approach that there are C functions you call that provide that system call layer (these are not necessarily the POSIX API). OpenBSD and FreeBSD have the most permissive approach, placing it at an API, not ABI level (so the function calls may become macros to allow extra arguments to be added).

My preference is that the Windows/OS X model is where the boundary should belong.


There's no "right" about it. You're arguing that _having_ a stable ABI is important, and nobody is denying that. There are other ways to get a stable ABI. All of the other non-Linux OS's have one, they just guarantee it in a different place (generally in a userspace library that manages the syscall interface)

> On other systems, trying to invoke the system calls manually and bypassing the C wrapper opens you up into undefined behavior

It opens you up to a bit more of behavior changing in the future, but just a tiny bit more. No need to make a big deal out of it. It's a very normal thing in software. Nobody is going to promise you a perfect stable interface to rely on forever, not even Linux. But syscalls are actually pretty easy to keep up with, they change slowly, and it's easy to detect kernel version and choose appropriate wrappers to use with very little extra code.

OS X problem is its own thing. Apple making breaking changes is not a new thing. I use an Apple laptop super rarely and still got fed up with breaking changes, even not upgrading past 13.6 at the moment.


I agree with some of your points; but zero-reliance on libc is the reason why it's so easy to use Go in containers; and Docker is one of the primary reasons why Go is popular. It's what they have got right.

You could statically link in libc and get the same effect.

Statically linking libc is it’s own minefield. It can and is done but even if you statically link everything else you should almost always dynamically link against your platform’s libc.

Statically linking libc is harder than dynamically linking it, but certainly easier than rewriting it.

Except glibc does not really support static linking of you want network support.

You'd still have to get a static libc (not usually preinstalled), possibly compile it for a different os/architecture..

Says someone who has never actually tried to do that. You can statically link with Musl. But you can't really statically link with Glibc.

Technically speaking OP only suggested statically linking (a) libc, not glibc specifically, and musl is a libc.

> Docker is one of the primary reasons why Go is popular.

Based on what data?


I've deployed all sorts of things which dynamically link libc in containers. This just isn't an issue in practice.

Except that Docker was originally written in Java by the former team that actually started the project, and nowadays contains modules written in OCaml taken from the MirageOS project, for the macOS and Windows variants of Docker.

So how much they got right regarding Docker's success and Go is a bit debatable.


Docker was not written in Java. It was shell scripts, python, then go. Dotcloud was primarily a python shop.

Perhaps you are thinking of a different project?


Probably Kubernetes which was indeed Java in formative years.

Kubernetes was never publicly available (open source) in any other language than Go. Early internal prototypes may have been in Java, but those bear no more resemblance to current Kubernetes than Borg does.

You are right, I got that wrong, too late to edit now.

Indeed, I got that wrong.

I think the parent was reffering to using Go in Docker containers, not using Go for implementing Docker itself.

That said, I agree that Docker was the first major project written in Go many people were exposed to and probably had some influence.


I'm just wondering, how is your line about the code taken from the MirageOS project relevant? Nobody uses the Windows and macOS variants of Docker in production.

Windows shops do use Docker in production, there are plenty of them.

It is relevant in the sense that Docker isn't 100% Go nowadays.


As an aside, do they use Windows containers in that context? Otherwise why?

Yes, for example in Azure deployments.

If you want static linking use musl..

You'd supposedly sacrifice performance though.

At least that's the common complaint for Alpine docker images... It's based on musl and halve of the community always complains about serious performance degradation


Well if the performance difference matters, going to a GC language makes no sense.

Tell that to the go enthusiasts. They're all claiming it to be peak performance surpassing everything else.

though even java is faster in most benchmarks


Reinventing the wheel is sometimes a feature - using other people's stuff, you gain their features, but you inherit their bugs, their release timelines, whatever overhead they baked in which they thought was okay, etc. You lose the ability to customize and optimize because it's no longer your code...

It's all just tradeoffs in the end - I think golang is finding some success because they didn't make the same tradeoffs everyone else did.


Let's not forget their attempt at inventing yet another Asm syntax for x86, when there is already the horrible GNU/AT&T as well as the official syntax of the CPU documentation.

Go's assembler syntax is inherited from Plan 9 project, which started in late 1980 and first released in 1992.

For context, gcc was first released in 1987 i.e. about the same time that Plan 9 started.

Go authors didn't attempt to re-invent asm syntax. They re-used the work they did over 30 years ago.

And at the time Plan 9 happened it was hardly re-inventing anything either. It was still the time of invention.

References:

* https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs

* https://en.wikipedia.org/wiki/GNU_Compiler_Collection


And at the time Plan 9 happened it was hardly re-inventing anything either.

Intel's Asm syntax was defined in 1978 with the release of the 8086, and the 32-bit superset in 1985 with the 386. CP/M, DOS, and later Windows assemblers all used the official syntax.


Plan9 assembler syntax didn't start out on x86, and is kept the same across all platforms as much as possible.

The question remains, why not reuse the work that somebody else did even earlier, and that has a lot more adoption already?

The calling convention is a serious wtf. They're relying on store-load forwarding to make the stack as free as a register, but that's iffy at best and changes heavily between microarchitectures.

I'd assert the calling convention is strange by design: there is the underlying reality that, to support actual closures and lambdas, as Go does, in the Lisp sense, not the fake Java sense, one can't use the C calling conventions. In particular, it's not true that a called function can expect to find bindings for its variables on a call stack, because of the upward funargs issue: some bound variables for a called function in the presense of true lambdas and thus closures will necessarily NOT be found on the C call stack, because of the dissociation of scope with liveness in the presence of lambda (anonymous functions).

What you describe is a non-problem: you can trivially spill upvars to the stack on-demand, as most compilers do, while keeping formal parameters in registers. Java needs upvars to be final because it doesn't have the concept of "reference to local variable", but that's just a limitation of the JVM, and one easily solved in other runtimes that very much can pass arguments in registers (e.g. .NET).

I'm not familiar with the issue: what makes Java's lambdas/closures fake? Is it that bound variables need to be effectively final?

I don’t know if they’ve done anything new, but as originally implemented, they were inner classes.

The inner class gets copies of the variables, so imperative code that wants to reassign them isn't allowed because it probably won't do what you expected.

The goal is not to GC stack frames. But I'm not sure why the didn't create an inner class to hold the closed-over variables in non-final fields (moving them from the stack to the heap) for both the function and all closures it creates.

(Obligatory "doctor, it hurts when I use mutable state!")


> Is it that bound variables need to be effectively final?

I believe this is it.


Even with store-load fw, you get a penalty (~3 cycle latency) over register accesses, no?

yeah, but it's cheaper than full L1 hit, which is where it would go if not for that.

I was trying to cite a typical full L1 hit latency... I thought store-load fw simply avoid having to flush the complete write buffer before the access is even possible, which risk to take far more than ~3 cycles. Now maybe it can be faster in some cases than an L1 hit, I don't know.

Edit: it seems that store-load forwarding is actually slightly slower than L1: https://www.agner.org/optimize/blog/read.php?i=854#854


I'm guessing that the reason was simply ease of porting 32-bit x86 assembly code to 64-bit.

I think by doing it everything their own way, they are not shackled to all of these dependencies - especially to some rusty old C++ compiler. That way, among other benefits, they get some very nice compiler speeds.

I installed golang the other day to check it out for the first time. For whatever reason, I chose to input the 'Hello world' program from golang.org by typing it in manually. As with most C/C++ code I would typically write, I put the brackets on their own lines.

Welp, so much for Go.


Like all opinionated formatters, you adjust to it, or you don't. I don't hate gofmt, other than tabs. Sweet jesus.

I'm not sure why people are so worried about the size of the executable file here. If the runtime.pclntab table is never[1] used then it won't be paged into memory, and disk space is mostly free these days.

[1] Well, hardly _ever_! (Sorry not sorry for the obligatory Gilbert and Sullivan reference.)

If you're using the Go executable on a system without virtual memory support, yeah, that's going to suck, but it appears the Go runtime is horribly bloated and not really suited for super-tiny 16-bit processors in the micro-embedded space. But for something like Cockroachdb, why worry about the file size?


I used to think that but now with containers its annoying to have to wait for a big binary image to get copied to the node and loaded up.

> disk space is mostly free these days.

This is the only "argument" ever presented, and I don't think it is any good. I care about file sizes. I want to get the most out of my hardware. Not needing to buy another drive is always going to be cheaper for me and every other user.


> Not needing to buy another drive is always going to be cheaper for me and every other user.

128GB+ drives are standard on mid-range laptops. Even at 64GB are you really going to fill up disk space because of Go executables?

CockroachDB (a large software project) is only 123MB. I doubt most people even have 100 pieces of non-default software on their laptop or that executables are going to fill up storage and break anyone's bank these days.

If you're short on HD space, I'm typically targeting photos and videos, not software.


Then don't use Go, you aren't their target audience in that case. And I don't mean this in a harsh way, just that Google is clearly opinionated in how they are building Go.

While it's true that disk space is virtually free, that is not true for bandwidth.

Bandwidth [to transfer big binaries around] is not free however.

If you’re using something like GCP Cloud Run to execute containers on demand, cold start time (which affects both new invocations and scaling events) is directly impacted by container size. As you said, not as much of a concern for a database, but extremely relevant for an HTTP server.

Also if you have multiple instances, I guess it is better to not allocate N versions of the same thing in anonymous memory.

Since Go is statically typed, the runtime data should be constant. Couldn’t a copy on write cache mean that the logical RAM redundancy doesn’t actually affect real memory?

If you have to decompress it at startup, you will typically do it to anonymous memory. You can attempt to be fancy at user-level by trying some silly tricks like putting them to shared memory, although I don't know the API of common ones enough to know if it is even possible in reality because of all the details to handle (ref count of the users with auto destruction when the last one closes, and that atomic with the creation of just one when none exists, etc.)

Ideally, to get all the optims, you would want some compression support at FS level, or even a specialized mapper of data coming from executable files in the kernel (or in cooperation with the kernel), but this will bring added complexity.

(Thinking more about it a solution involving a microkernel would be really cool, but I digress...)


I guess this would specifically be a benefit of fork/exec. Though would it need to be decompressed after 1.2? That was my assumption is that it trades speed for memory on the first launch, and memory in subsequent launches would be virtual only

There is an project called TinyGo [0] which brings Go to embedded systems where binary and memory size matter even more.

[0] https://archive.fosdem.org/2019/schedule/event/go_on_microco...


The next time you need to make an HTML treemap like this, try my tool: https://github.com/evmar/webtreemap

It provides a command line app that accepts simple space-delimited data and outputs an HTML file. See the doc: https://github.com/evmar/webtreemap#command-line

(It also is available as a JS library for linking in web apps, but the command line app is the one that I end up using the most. I actually built it to visualize binary size exactly like this post and then later generalized it.)


Another option is to generate a text file in the format expected by flamegraph. Especially useful when the data is hierarchical.

https://github.com/brendangregg/FlameGraph

Exemple for Java:

https://github.com/pcdv/deps-flamegraph/blob/master/README.m...


Any plans to fix (or close if they are fixed) the existing issues?

The author guessed a few things wrong:

* fmt.Println pulling in 300KB isn't proof that Go's standard library isn't "well modularized". It's the wonders of Unicode and other code that is actually used.

* 900K for the runtime isn't surprising when you have complex garbage collection and goroutine scheduling among other things


> fmt.Println pulling in 300KB isn't proof that Go's standard library isn't "well modularized". It's the wonders of Unicode and other code that is actually used.

I would guess a large part is that it has to pull in the entire reflection library to support format verbs like %#v which renders the argument as a Go literal.


Great writeup, I believe there is a open issue from Rob Pike 2013 this would fall under: https://github.com/golang/go/issues/6853

Except that this write up says this was a deliberate design decision to trade space for speed. That's not something to be fixed, unless you convince Go to make different trade offs or to provide more optimization options.

The linked issue is tagged "NeedsFix".

You can compress with upx (at the cost of increased startup time in the order of hundreds of ms, which is okay for servers) and/or not include all debug symbols. Doing both usually shaves >60% off a binary.

UPX transforms demand-paged, reclaimable page cache memory into a blob of unreclaimable anonymous memory.

It makes no sense for most use cases where I’ve seen it. It adds runtime costs both in terms of start-up and memory usage.

Maybe it helps in terms of binary sizes for downloads — but those are often compressed anyways! E.g. Your docker images are compressed and UPX’ed binaries in a layer aren’t buying you anything (just adding runtime costs).


I have one system I target where, bizarrely, persistent storage is the bottleneck, not volatile memory or startup speed. In this one case, UPX makes a whole lot of sense.

For a lot of applications the increase in startup size and memory usage are negligible. The startup time is increased by dozens or hundreds of ms, which is not a lot for a server application. I tried to measure the increase in memory, but wasn't really able to, so either it's a very subtle difference or it's very small.

Not everything uses Docker.


UPXed binaries have their code pages mapped as dirty - this means the OS can't page them back to disk if needs or wants to. In some cases, that's an acceptable cost to pay - for low-latency servers you might want to mlock all your executable pages so there's no risk of a page fault and disk read killing your tail latency. Of course if you're doing that then you have to either pay the full cost of your binary size in memory, or you have to have some warm up phase and then hope that everything you need is loaded then. In the first case you suddenly care a lot about binary size, because memory is quite a bit more expensive than disk.

But one valuable reason to use something like UPX is that you can attach a crappy and thus inexpensive disk to servers that you're not using for actual storage. Compression on disk lets you load from a slow disk faster, and of you weren't paging to disk anyway then UPX doesn't have much of a cost.

But if you're on a traditional desktop operating system, UPX will increase your effective memory footprint, and force writing to swap instead of merely dropping pages. On Android, which doesn't swap, you'll significantly increase your memory footprint.


UPX makes sense if you're trying to fit your executable onto fixed-size media from which it needs to execute (e.g. a floppy disk or USB drive), and almost nowhere else.

Reduced transfer times are good. Not everyone has a 100MB/s internet connection (in fact, most of the world doesn't).

People don’t often directly download and run software binaries. On Windows, they download .msi packages, or .exe installers with embedded MSIs. On Linux, people download .deb or .rpm packages. All of the above packaging formats are already compressed.

Also, even if you publish raw binaries without an installer package, HTTP protocol supports compression. Usually quite easy to implement, couple lines in a web server config.


Yes, there are many solutions. upx just gives you an easy, transparent, and convenient way to compress a binary without worrying about web server configurations and whatnot.

I have used UPX on Windows some 15 years ago and liked it a lot these days. My primary motivation was not network transfer speed, it was HDD storage, bandwidth, and especially latency. It was faster to sequentially read the complete binary, compared to reading individual memory pages as required.

Nowadays disk space is very cheap, disks are often solid state with ridiculously high IOPS, but antivirus software became much worse and likely to mark a UPX compressed binary as malware.


Okay, so transfer it using `rsync -z` to compress it over the wire, or gzip it first, and extract it on the other end.

gzip probably gives you better compression ratios too.


<deleted>

Edit: somehow read “doing both” as “doing either”. Just ignore this.


UPX compresses the binary and introduces a decompression step on startup, so you run exactly the input binary

Stripped means that the compiler does not include debug symbols.

They are completely different - you can use either or both.


That's already 30% off for free, now run that through upx [1]

1: https://upx.github.io/


Being an Electron and a Go developer, I'm not complaining too much about the size of Go executables. Electron backed softwareon the other hand...

Have you tried this? I've played around with it in the past.

https://github.com/asticode/go-astilectron


Would it be possible to add a flag for the compiler to disable the line table pre-expansion?

I can’t exactly tell if we’re saying the same thing but my thought was a flag to switch between the 1.2 way for faster startup and the earlier approach for longer running processes. The trade off is added complexity in identifying your binary usage patterns and keeping both methods in the tooling.

These kind of changes may not be breaking in a technical sense but it’s very unexpected behavior if you’re one to notice patterns like file sizes changing in such a significant way over time. An answer of “stick with v1.1x indefinitely if you want the old behavior” only feels like a very temporary answer.


That was my question but I would imagine the problem is then that you can't debug production. It looks like a more common solution is dSYM, .dbg, .pdb or other things (read from another commenter)

70MB source seems somewhat bloat for such project, no?

For an ACID compliant, resilient, consistent, distributed, auto-sharding and auto-tuning for low latency, highly scalable SQL database? No.

70MB of source is such an extreme amount I don't know how it could be reasonably justified, there must be an enormous amount of waste. All of sqlite is 6MB.

When you have the features of sqlite, you can have the size of sqlite.

PostreSQL is 36MB. [1] Granted Go is much terser than C and has a larger standard library, but we're not at absurd levels.

[1]

     find src -name '*.c' -o -name '*.h' | xargs cat | wc -c

It also seemed large to me so I poked around. 36 MB are just the .go files.

In comparison, postgresql has about 38 MB of .c and .h files.

So, my mind is changed.


There is a lot of copy/paste in Go.

As a Go developer, templates seems like a good idea more and more each day.

Have you seen idiomatic Go? The error handling alone is a verbose mess.

> runtime.pclntab

Ah yes, an example of Google's long and descriptive identifier names, a lament of which surfaced here recently. (https://news.ycombinator.com/item?id=21843180)

> I’m glad we now live in a futuristic utopia where keyboard farts like p, idxcrpm, and x3 are rare.


What's interesting is that Go encourages letter soup while the rest of Google styles encourage long descriptive names.

Sometimes I think it was intentionally done to troll the rest of Google that used long_descriptive_names_ with an 80 col limit (Java argued up to 100 but no one else did, not even JS which uses Java's long names).

I always thought it was funny walking around the office seeing everyone with their squished and wrapped code on a huge 32in monitor. Lots of the code in google3 looks like haiku squished to the right margin.


Excellently thought out. Very rational. Truly a language for the twenty-first century.

So what is the solution then? Will they just have to fork Go and compress the table again like before? It's completely insane that it would eventually surpass the size of the program itself.

Is the line table in go executables absolutely essential? Shouldn't they be strippable, the way you can strip debug symbols from a C binary?

They are, and many people don't include them in production builds. This article is incomplete for not mentioning it.

I did a quick search and turned up nothing about stripping the pclntab (note: distinct from the DWARF line-number tables which can be stripped). A post on Google Groups suggests the opposite - pclntab cannot be stripped because the runtime needs the info for GC - https://groups.google.com/forum/m/#!topic/golang-nuts/hEdGYn....

You're right, but that wouldn't make the main problem go away though. I just built a simple plugin for Kubernetes's kubectl and it's about 32MB with go build's -ldflags -s -w where 16MB of that is still the pclntab mentioned in the article.

The problem with removing that completely is that you won't get any information on panics. I don't think this is what you really want, and the current behaviour strikes me as a reasonable middle ground.

32M is really large for a simple plugin, and to be honest I think that says just as much about Kubernetes as it says about Go.


sidenote : currently trying to get up to date with the best way to get distributed acid key value storage those days. Is coackroach the new standard ? I tried to find benchmarks comparing it to things like postgres for various use case but only found articles that read like ads.

> the best way to get distributed acid key value storage

You will need to define "best way", "distributed" and "acid" based on your requirements.

For most people, multi-master MySQL and Redis with vector clocks is a great combination.

> Is coackroach the new standard

No database less than 5 - 10 years old works in production, so no, it's not a standard.

> I tried to find benchmarks comparing it to things like postgres for various use case

Well, without more detailed requirements, good luck. Also comparing pg to distributed databases is ... like comparing apples and oranges.

Source: experienced DBA.


Go is a hybrid language.

Its not a jvm, but its runtime has jvm-like features such as garbage collection and reflection but also a thread scheduling system called goroutines.

I love the fact it is monolithic in nature.. One exe is all you need no matter which platform you use. Everything is statically compiled into the binary.

No bundling the jvm and a load of jar files, or lib*.so dependencies.


Toolkits to precompile a Java app into a single native binary with the runtime and stdlib inside have been around for 20 years or so.

> there is about 70MB of source code currently in CockroachDB 19.1

That's what is insane here; way more so than Go executable size issues.


TLDR: They included debugging information when they did not intend to.

What's the fix?

Don’t use Go.

Because Go isn’t C?

I think the GO runtime libraries are linked into the binary so there are no external dependencies for GO itself.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: