> prior to 1.2, the Go linker was emitting a compressed line table, and the program would decompress it upon initialization at run-time. in Go 1.2, a decision was made to pre-expand the line table in the executable file into its final format suitable for direct use at run-time, without an additional decompression step.
This is a good choice I think and the author of the article missed the
most important point - it uses less memory to have an uncompressed table.
This sounds paradoxical but if a table has to be expanded at runtime
then it has to be loaded into memory.
However if a table is part of the executable, the OS won't even load
it into memory unless it is used and will only page the bits into
memory that are used.
You see the same effect when you compress a binary with UPX (for
example) - the size on disk gets smaller, but because the entire
executable is decompressed into RAM rather than demand paged in, then
it uses more memory.
If you decompress it to an mmaped file it'll be one of the first things written to disk under memory pressure anyways and instantly available in normal situations.
With the ever decreasing cost of flash and it's ever increasing speed relative to the CPU compression is not really worth what it used to be to startup times 10 years ago though.
Swapping is fine for workstations and home computers. But high performance machines running in a production environments will absolutely have swap disabled.
The performance difference between RAM and disk is not an acceptable tradeoff. RAM will be tightly provisioned and jobs will be killed rather than letting the machine OOM.
Generally yes. Swapping changes the perf characteristics of that process (and often any other process on the same machine) in unpredictable ways. It's better to have predictable process termination -- with instrumentation describing what went wrong, so capacity planning and resource quotas can be updated. The process failure would generally be compensated-for at a higher level, anyway.
Virtual memory works even without swap enabled. Since the mapped file is the binary, and code is never changed after loading, the OS can simply take pages backing the memory. But when there is a page fault, it will be bought back in.
Is it just me or something like that runtime.pclntab shouldn't be included in production builds at all?
I mean, it makes all sense while you're developing and testing, but it should be reasonably possible to strip it out from production build binaries, and instead put it in a separate file so that if you do get a crash with a stack trace, then some external script can transform the program counters to line numbers, not have it embedded in every deployed binary.
The Go language literally requires that pclntab be included in release builds. I'm with you—it seems kind of crazy that this was designed into the language—but there you have it.
The reason is that Go's standard library provides functions that allow retrieving a backtrace and symbolicating that backtrace at runtime:
Unlike in C or C++, where you can link in something like libbacktrace [0] to get a best-effort backtrace, those Go functions are guaranteed to be correct, even when functions have been inlined. This is no small feat, and indeed programs compiled with gccgo will often be incorrect because libbacktrace doesn't always get things right when functions have been inlined.
Is it that common for these functions to be used? Perhaps transitively through some popular libraries? Just being in the standard library doesn't even necessarily mean that these functions (and the data they need) should get included by the linker.
Any program that uses a logging framework, including the stdlib log package, will wind up depending on runtime.Callers at least transitively. That’s probably most Go programs; certainly most of the programs large enough to be worrying about binary size.
Unlike in C, there are no macros like __FILE__ and __LINE__, so there is no alternative to runtime.Callers (short of preprocessing your Go source code).
You can still get a backtrace without symbols though.
Why couldn't the Go team introduce a flag that strips symbols, while making clear to people that they should only use it if they are okay with backtraces looking like
#1 0x00007ffff7ddb899 in __GI_abort () at abort.c:79
#2 0x0000555555555156 in ?? ()
#3 0x0000555555555168 in ?? ()
#4 0x000055555555517d in ?? ()
#5 0x0000555555555192 in ?? ()
Because one of the mantra of Go is not telling users to have a "debug" build and a "release" build. The development build is the one that goes into production, with no difference in optimizations, symbols and whatnot. This has pros and cons, like all tradeoffs.
Are you sure this is true? Doesn't delve for example build Go source with special flags (gcflags=all='-N -l')to generate debugging symbols. I also remember having to build Go code with those flags for Stackdriver to get the correct debugging information without any optimisations.
Backtrace can be amazing at run time if your going to put that info into logs. It makes finding "fringe" errors a lot less painful or more easily reproducible.
Not sure if it is just you but normally you DO want this information in the production build. It is quite a bad situation to have a run time exception in PROD and having no idea how it happened. Sure, there is defensive programming and checks and asserts but most of the time you cannot foresee everything.
I get the point about external symbols and location database, but oftentimes time is precious and having fully laid out stack trace in the log will allow you to get to the root much faster.
> I get the point about external symbols and location database, but oftentimes time is precious and having fully laid out stack trace in the log will allow you to get to the root much faster.
You can also set up a service that automatically symbolicates everything in a log file as soon as it is generated, before a human ever even looks at it.
Granted, yes, this is slightly more complicated, but the point is that the toolchain should let the developers choose which strategy they want to use.
Not a parent commenter but I interpreted that as: given choice between alternatives A (fast and good enough) and B (more complex and much better) you would want to choose B but end up doing A for various reasons (lack of time, unclear ROI etc)
In my experience you would strip the symbols out of the prod binary, and save them separately somewhere.
Then your production binary will give you stack traces like [0x12345, 0xabcde, ...], but you can use the separately-stored files to symbolicate them and get the source file/line info.
Not sure if this is possible on all platforms but it at least is for all combinations of {C, C++, Objective-C, Rust} and { Linux, macOS, iOS } .
And if that added operational complexity is not worth the size savings, you can freely choose not to do it, and things will work like they do in Go.
That is debug information. Just have it stored elsewhere (not on the binary you ship everywhere) and use that in conjunction with your core dump to debug.
Separable debuginfo which can be loaded at runtime.
DWARF uses an efficient compression mechanism much smarter than a table for this sort of mapping. And of course things like coredumps and crash dumps being sent to automated processing where devtools have the full debug symbols, while production deployments do not.
Go's insistance not just reinventing the wheel but on actively ignoring core infrastructure improvements made in the last 20 years is bizarre.
A lot of them have symbol files separate from the binary. Unixy tooling doesn't do this by default but for example objcopy(1) in binutils can copy symbols to another file before you run strip(1), and on Mac my memory is rusty but I think it may be dsymutil(1) that lets you copy to a .dSYM bundle. Microsoft has its .pdb files and never even keeps debug info inside the binary proper.
The debug info is in a separate file. You only need that file when you’re inspecting a crash report, so it doesn’t need to be pushed out to the host device(s).
Because it's not a problem. So everybody does the same. And is not about the programming language, is about the programmer's choice. If it wants debug info inside a production program, the language let it happens. In today's age size of your executable is a non-issue. The only issue should be your performance.
Here is an example from my past. As embedded programmer I went and added manually a hundred lines of constants, which initially were just an array generated at start and increased that code by about 5%. Why? Because I gained 5 ms in execution speed. And in embedded world that's huge. Especially when your code is executed on the lowest 10 ms cycle. So the department head approved such a huge change because the code size doesn't matter, you can always buy a bigger chip, but if your car don't start because the watchdog keeps resetting your car's on-board computer then speed in code execution is everything.
> Because it's not a problem. So everybody does the same. And is not about the programming language, is about the programmer's choice. If it wants debug info inside a production program, the language let it happens. In today's age size of your executable is a non-issue. The only issue should be your performance.
I do not know much about go, but languages like C++ and Java give you the tools to make tradeoffs appropriate to your situation: externalizing or stripping symbols and/or debugging information.
There are very different production scenarios - in many of them noone will ever look (or even be able to look) at a stack trace if it crashes after it's shipped (at best you'll record bug reports from customers and attempt to reproduce them on your test hardware), so the debug information is literally useless there. And these are the same scenarios extra 50mb of disk and memory matter more than for some software running in a cloud environment.
I was pleasantly surprised how good Microsoft's tooling was around firing up a debugger to examine the final state in a crash dump using external symbols from that build. Everything seemed to work except you couldn't resume any thread. I agree symbols don't need to be embedded in every running binary, but having a warm copy somewhere can be pretty helpful.
This is where letting a large enterprise guide the development of a piece of widely-used software becomes questionable. At a FAANG the constraints are fundamentally different.
At work I routinely see CLIs programs clocking in at a gigabyte, because it was simpler to statically link the entire shared dependency tree than to figure out the actual set of dependencies, and once your binary grows that big, running LTO adds too much time to one's builds. And disk space is basically free at FAANG scale...
Disk space in general is pretty much free these days. 123mb for a whole database is really not that big of a deal, IMHO. For example, my local PostgreSQL docker image is 140mb plus the alpine OS (5mb). And the Ruby on Rails application using that image clocks in at a little over 1gb, also using the ruby alpine image as a base (50mb).
With my company, the cost really started to become a burden with data transfer. But transferring images to and from the AWS container registry is so expensive that we actually build production images inside the Kubernetes cluster (plus the cluster has access to all the secrets and stuff), even though it was a bit harder to implement.
If you're FAANG and you can run your own stable and highly-available cloud, data transfer rates don't matter, so you can deploy your application the "right" way in a containerized world.
Yes but when running a long-lived application on a lot of data it's typically important to keep the executable small both so it can be "hot" and to leave more room for data. At scale this could be even more important, not less, than for a smaller operation.
Of course the real (i.e. explicitly stated by Pike) driver for go was the assertion that inexperienced new hires write poor code and so the harder it is for them to get into trouble the better, even at the cost of other issues.
> Yes but when running a long-lived application on a lot of data it's typically important to keep the executable small both so it can be "hot" and to leave more room for data.
Who really runs server applications these days where data/rss are not a large multiple of the actual code segments? People happily run JVM server processes these days, how much code/data do you think that pulls in just to start up?
Executable size on disk doesn't dictate effective size at higher levels of the memory hierarchy.
IOW, you're paying by the page/cache line. If the extra bloat (debug information) in your executable isn't interspersed within your actual code (it shouldn't be), you aren't paying for it in runtime efficiency.
Not your usually usecase, but the binary for running the Bloomberg terminal is well over 1GB (In fact running into the the 4GB executable boundary was a problem).
I never saw one a singular tool that big at Amazon either, but the general concept held. The AWS service I worked on had ~1,000 jars in its dependency closure. All of those had to be uploaded to S3 after each build, and then downloaded onto each EC2 instance during deployment.
We're talking on the order of a terabyte of data transfer each time we deployed to a thousand instance fleet (ideally deploying weekly)
LTO is no magic bullet for binary size either. A binary that does nothing will still link in the whole C library. It doesn't end up decreasing large program sizes that much in my experience either.
I'd argue the opposite -- as a startup, you can't afford to micro-optimize. Labor, time and opportunity cost dwarf all but the grossest resource waste. If you need to use 100GB/k8s node instead of 50GB, it will have 0 effect on the success of your venture.
At Google scale, it becomes worth it to optimize:
- You are delivering more product per engineer, and more product means more resources. Instead of a single customer instance which costs $1/month more, you have 100,000 customer instances, costing a significant amount more. It becomes worth trimming margins.
- You have economies of scale, and it might be worth it for an engineer to spend a month trimming 2% of the cost of a software deliverable.
The common refrain for startups is "do things that don't scale", and this is for good reason. Google has to actually worry about fixing things AFTER they are scaled.
There are plenty of efforts and tooling that reduce the number of dependencies and the deploy size of binaries at Google. The notion that we don't care about size isn't true.
But it's true that it's not worth optimising first, it's done by first evaluating the impact across the fleet and then prioritising the most effective changes.
A 125MB binary is relatively beefy, but still easily fits in RAM. The amount of disk that you're spending on a single executable (your database, in this instance) is tiny in comparison to the amount of data stored in that database.
It's definitely worth it for Google to trim 2% off of their storage requirements - but if your binary is 0.1% of your storage, it's barely even worth glancing at.
This is where go’s insistence on reinventing the wheel feels terribly misplaced. Every major debug format has a way to associate code locations with line numbers. Every major debug format also has a way to separate the debug data from the main executable (.dSYM, .dbg, .pdb). In other words, the problem that the massive pclntab table (over 25% of a stripped binary!) is trying to solve is already a well-trodden and solved problem. But go, being go, insists on doing things their own way. The same holds for their wacky calling convention (everything on the stack even when register calling convention is the platform default) and their zero-reliance on libc (to the point of rolling their own syscall code and inducing weird breakage).
Sure, the existing solutions might not be perfect, but reinventing the wheel gets tiresome after a while. Contrast this with Rust, which has made an overt effort to fit into existing tooling: symbols are mangled using the C++ mangler so that gdb and friends understand them, rust outputs nice normal DWARF stuff on Linux so gdb debugging just works, Rust uses platform calling convention as much as possible, etc. It means that a wealth of existing tooling just works.
I am not a fan of Go, and I also wish these things were true (and more[1], actually), but I find it hard to agree that its priorities are "terribly misplaced." Inside the context of Go's goals (e.g., "compile fast") and non-goals (e.g., "make it easy to attach debuggers to apps replicated a zillion times in Borg") these trade-offs make a lot of sense to me. Like: Go rewrote their linker, I think, 3 times, to increase the speed. If step 1 was to wade through the LLVM backend, I am not sure this would have happened. Am I missing something?
I love Rust, but Go is focused on a handful of very specific use cases. Rust is not. I don't know that I can fault Go for choosing implementation details that directly enable those use cases.
Google's anointed production languages used to be five: C++, Java, JavaScript, Python, and Go. Not much to reasonably standardize across, especially if a standardized solution ends up with more compromises than a custom one.
Insert standardization XKCD. It's been tried. And even so, you can still use the "standard" coredump tool to analyze a Go program's coredump with decent success.
I totally agree with the above - was never able to click with Go _but_ I totally understand how reinventing the wheel has worked well for them.
The days when the Go project fired up were different than the days when Rust started. Rust made different tradeoffs by relying on LLVM and it has advantages (free optimizations!) and disadvantages of their own.
Well go also uses its own assembler, on top of that a kind of modified garbage version of real ones. You can only justify so many reinventions of the wheel, yet they redid everything.
Did they actually redo everything, or does it just look that way from starting from the Plan9 toolchain? Which could also be said to be re-doing everything, but from a much earlier starting point.
IIRC Go started out shipping with a port of the Plan9 C compiler and toolchain - it was bootstrapped by building the C compiler with your system C compiler, then building the Go compiler. Which, until re-written in Go circa-2013, was in Plan-9 style C. It all looks deeply idiosyncratic but it was a toolchain the initial implementors were highly familiar with.
Perhaps the other assemblers would not provide desired compilation speed?
Perhaps their IP requirement would not sufice Google lawyers?
Perhaps Go devs would rather have more control on the development of assembler by writing it from scratch to understand every design decision instead of inheriting thousands of unknown design decisions?
I don't know. Neither do others outside of the project.
I find these baseless micro-aggressions against Go missplaced and unfruitful.
> I don't know. Neither do others outside of the project.
> I find these baseless micro-aggressions against Go missplaced and unfruitful.
Hu? Ok then Go is perfect because it is developed in secret.
We are discussing here, I'm not "micro-agressing" anyone. If I don't like a design / re-implementation decision, and I in the mood to share that opinion with this cyber-assembly, I do it. And I expect developers to not be offended by me having a technical opinion; and I expect third parties to be even less offended. And yes, it might be a bad opinion in some cases. I'm not even 100% sure it is not the case here, because like you said they could have had some kind of justification to do that. But it suspect it is extremely rare to have a good justification to rewrite an assembler, with really big quirks on top of that, when they did it.
Yes, you absolutely did, by stating (not just implying) that any criticism of the project that does not take its internal decision making into account is "baseless micro-agression".
A lot of the insularity and weirdness comes from the Plan 9 heritage. Go's authors (Rob Pike, Ken Thompson, and Russ Cox) cannibalized/ported a bunch of their own Plan 9 stuff during initial development. For example, I believe the original compiler was basically a rewrite of the Inferno C compiler.
This is a large part of why Go is not based on GCC or LLVM, why it has its own linker, its own assembly language, its own syscall interface, its own debug format, its own runtime (forgoing libc), and so on. Clearly Go's designers were more than a little contrarian in their way of doing things, but that's not the whole answer.
Being able to repurpose existing code is an efficiency multiplier during the bootstrapping phase. But when bootstrapping is done, you have to consider the ROI of going back and redoing some things or keep a design that works pretty well. The Go team is undoubtedly aware of some of these issues, but probably don't consider them to be a priority.
In some cases the tools are a benefit. Go's compiler and linker are extremely fast, which I appreciate as a developer. A possible compromise would be to offer a slower build pipeline for production builds, which made use of LLVM and its many man-years of code optimizations.
Personally I more wish Rust would take this approach. Rust desperately needs a fast, developer oriented compiler. The slow compile times is potentially Rusts biggest flaw, to the point where I find it keeps me off the language for anything non-trivial. Even better might be a Rust interpreter, so you'd get REPL and fast development cycles.
This is why Cranelift is being worked on. There is also a Rust interpreter, miri.
I think starting with LLVM was the right decision (and one that was I was primarily responsible for). Rust would lose most of its benefits if it didn't produce code with performance on par with C++. LLVM is not the fastest compiler in the world (though it's not like it's horribly slow either), but its optimization pipeline is unmatched. I don't see replicating LLVM's code quality as feasible without a large team and a decade of work. Middling code gen performance is an acceptable price to pay until we get Cranelift; the alternative, developing our own backend, would mean not being able to deploy Rust code at all in many scenarios.
Forgive me if this is ignorant, since I havent done any benchmarks on this in a while, but doesnt GCC produce slightly faster code on average across a wide set of benchmarks compared to clang/LLVM?
At least in these benchmarks that phoronix run time-to-time,
(so they at least can be compared to their older self) LLVM, in its Clang incarnation, is finally getting some parity in execution times with GCC
Thanks for the pointer! I was unfamiliar with Cranelift and it seems like a promising tech. I'll keep an eye on it in hopes that once it is stable I'll be able to put together a development environment that allows for the fast turnaround I prefer.
I have not used rust for anything very large, but the using an editor that supports the rust language server mitigates the compile time problem. In VSCode it show you the compiler warnings and errors as you are editing a file. There is a little lag in updating but the workflow is faster than switching to a terminal to do a full compile.
If you need any other evidence for this, just look at GOPATH and similar. That was plan9 through and through; they wanted to delegate work to the filesystem. No need for a package manager or anything, just pull down URIs and they'll be where Go wants them to be.
Go has had to walk back on some of its choices recently; most notably on platforms without a stable syscall ABI and a very strong push for dynamic linking (…so macOS) they link against the system libraries.
The only popular platform with a stable syscall ABI is Linux. This is a product of the historical accident that Linux doesn't control a libc and ensuing drama.
Almost everyone else doesn't have a stable ABI below the (C) linker level.
It does - the syscalls are part of the official userspace interface which the Linux kernel promises not to break. They can add new syscalls, options or flags, but can’t break existing ones.
> platforms without a stable syscall ABI and a very strong push for dynamic linking (…so macOS)
That's an even better description of Windows. The macOS system call table isn't officially stable, but it's at least slow to change. The Windows equivalent has been known to change from service pack to service pack.
That being said, yes, we see integration into the parent platform as being an important design constraint for Rust. I think Go made reasonable choices for what they're trying to do, though. It's all tradeoffs.
For those who don't know, it's also worth mentioning that while it's called the "Itanium" C++ ABI (because it was designed originally for the Itanium), it's nowadays used for every architecture on Linux.
It makes the Go concurrency mechanism possible, this is not just a kind of whim.
Most importantly, this allows the scheduler to hook on syscalls in order to schedule an other routine. But this also allows to control what happens during a syscall, since the libc tends to do more than just calling the kernel in its syscall wrappers, which might not be thread safe or might not play well with stack manipulations.
The problem is that there is exactly one OS that maintains the system call ABI as a stable API: Linux. On other systems, trying to invoke the system calls manually and bypassing the C wrapper opens you up into undefined behavior, and this was particularly problematic on OS X, which occasionally made assumptions about the userspace calling code that weren't true for the Go wrapper shell, since it wasn't the expected system wrapper library.
The problem is that that one OS is right. Systems calls form an API and it needs to be stable and managed. We (developers) have been working on this issue for years and have at least attempted solutions (eg. semantic versioning) where most OS developers feel free to break them on a whim. It is a terrible practice that forces others to spend their time working around.
I would note that system calls only form an external API if the developer says they are an external API. Which it is for Linux but for other OSes the external API is a C library, kernel32, etc.
But right and wrong aside, there's the practical matter of reality. You can't simply pretend everything works how you want them to. At the end of the day you have to deal with how they actually work.
The Linux model is not the "right" one, it's a choice that they've made. Just like static linking isn't the "right" choice either, it's an option with its own drawbacks. Other OSes provide an approved, API stable layer to access the OS; it's just not the syscall layer.
The Linux approach reflects the social structure this project has been developed in.
The Linux project needs to be able to evolve independently of other projects, so they do just that.
This is a classic case of technical architectures following most of the time social structures.
> Systems calls form an API and it needs to be stable and managed.
In some cases (e.g; nearly all the other oses...) system calls form an internal API, and they don't need to be stable, and they actually even don't need to be accessible except to intermediate layers provided in a coordinated way.
No one here is disagreeing on the need for the operating system to provide a stable interface to applications: the question is where that stable interface should lie. Linux takes the most restrictive approach, asserting that the actual hardware instruction effecting the user/kernel switch is the appropriate boundary. OS X and Windows instead take the approach that there are C functions you call that provide that system call layer (these are not necessarily the POSIX API). OpenBSD and FreeBSD have the most permissive approach, placing it at an API, not ABI level (so the function calls may become macros to allow extra arguments to be added).
My preference is that the Windows/OS X model is where the boundary should belong.
There's no "right" about it. You're arguing that _having_ a stable ABI is important, and nobody is denying that. There are other ways to get a stable ABI. All of the other non-Linux OS's have one, they just guarantee it in a different place (generally in a userspace library that manages the syscall interface)
> On other systems, trying to invoke the system calls manually and bypassing the C wrapper opens you up into undefined behavior
It opens you up to a bit more of behavior changing in the future, but just a tiny bit more. No need to make a big deal out of it. It's a very normal thing in software. Nobody is going to promise you a perfect stable interface to rely on forever, not even Linux. But syscalls are actually pretty easy to keep up with, they change slowly, and it's easy to detect kernel version and choose appropriate wrappers to use with very little extra code.
OS X problem is its own thing. Apple making breaking changes is not a new thing. I use an Apple laptop super rarely and still got fed up with breaking changes, even not upgrading past 13.6 at the moment.
I agree with some of your points; but zero-reliance on libc is the reason why it's so easy to use Go in containers; and Docker is one of the primary reasons why Go is popular. It's what they have got right.
Statically linking libc is it’s own minefield. It can and is done but even if you statically link everything else you should almost always dynamically link against your platform’s libc.
Except that Docker was originally written in Java by the former team that actually started the project, and nowadays contains modules written in OCaml taken from the MirageOS project, for the macOS and Windows variants of Docker.
So how much they got right regarding Docker's success and Go is a bit debatable.
Kubernetes was never publicly available (open source) in any other language than Go. Early internal prototypes may have been in Java, but those bear no more resemblance to current Kubernetes than Borg does.
I'm just wondering, how is your line about the code taken from the MirageOS project relevant? Nobody uses the Windows and macOS variants of Docker in production.
At least that's the common complaint for Alpine docker images... It's based on musl and halve of the community always complains about serious performance degradation
Reinventing the wheel is sometimes a feature - using other people's stuff, you gain their features, but you inherit their bugs, their release timelines, whatever overhead they baked in which they thought was okay, etc. You lose the ability to customize and optimize because it's no longer your code...
It's all just tradeoffs in the end - I think golang is finding some success because they didn't make the same tradeoffs everyone else did.
Let's not forget their attempt at inventing yet another Asm syntax for x86, when there is already the horrible GNU/AT&T as well as the official syntax of the CPU documentation.
And at the time Plan 9 happened it was hardly re-inventing anything either.
Intel's Asm syntax was defined in 1978 with the release of the 8086, and the 32-bit superset in 1985 with the 386. CP/M, DOS, and later Windows assemblers all used the official syntax.
The calling convention is a serious wtf. They're relying on store-load forwarding to make the stack as free as a register, but that's iffy at best and changes heavily between microarchitectures.
I'd assert the calling convention is strange by design: there is the underlying reality that, to support actual closures and lambdas, as Go does, in the Lisp sense, not the fake Java sense, one can't use the C calling conventions. In particular, it's not true that a called function can expect to find bindings for its variables on a call stack, because of the upward funargs issue: some bound variables for a called function in the presense of true lambdas and thus closures will necessarily NOT be found on the C call stack, because of the dissociation of scope with liveness in the presence of lambda (anonymous functions).
What you describe is a non-problem: you can trivially spill upvars to the stack on-demand, as most compilers do, while keeping formal parameters in registers. Java needs upvars to be final because it doesn't have the concept of "reference to local variable", but that's just a limitation of the JVM, and one easily solved in other runtimes that very much can pass arguments in registers (e.g. .NET).
The inner class gets copies of the variables, so imperative code that wants to reassign them isn't allowed because it probably won't do what you expected.
The goal is not to GC stack frames. But I'm not sure why the didn't create an inner class to hold the closed-over variables in non-final fields (moving them from the stack to the heap) for both the function and all closures it creates.
(Obligatory "doctor, it hurts when I use mutable state!")
I was trying to cite a typical full L1 hit latency... I thought store-load fw simply avoid having to flush the complete write buffer before the access is even possible, which risk to take far more than ~3 cycles. Now maybe it can be faster in some cases than an L1 hit, I don't know.
I think by doing it everything their own way, they are not shackled to all of these dependencies - especially to some rusty old C++ compiler. That way, among other benefits, they get some very nice compiler speeds.
I installed golang the other day to check it out for the first time. For whatever reason, I chose to input the 'Hello world' program from golang.org by typing it in manually. As with most C/C++ code I would typically write, I put the brackets on their own lines.
I'm not sure why people are so worried about the size of the executable file here. If the runtime.pclntab table is never[1] used then it won't be paged into memory, and disk space is mostly free these days.
[1] Well, hardly _ever_! (Sorry not sorry for the obligatory Gilbert and Sullivan reference.)
If you're using the Go executable on a system without virtual memory support, yeah, that's going to suck, but it appears the Go runtime is horribly bloated and not really suited for super-tiny 16-bit processors in the micro-embedded space. But for something like Cockroachdb, why worry about the file size?
This is the only "argument" ever presented, and I don't think it is any good. I care about file sizes. I want to get the most out of my hardware. Not needing to buy another drive is always going to be cheaper for me and every other user.
> Not needing to buy another drive is always going to be cheaper for me and every other user.
128GB+ drives are standard on mid-range laptops. Even at 64GB are you really going to fill up disk space because of Go executables?
CockroachDB (a large software project) is only 123MB. I doubt most people even have 100 pieces of non-default software on their laptop or that executables are going to fill up storage and break anyone's bank these days.
If you're short on HD space, I'm typically targeting photos and videos, not software.
Then don't use Go, you aren't their target audience in that case. And I don't mean this in a harsh way, just that Google is clearly opinionated in how they are building Go.
If you’re using something like GCP Cloud Run to execute containers on demand, cold start time (which affects both new invocations and scaling events) is directly impacted by container size. As you said, not as much of a concern for a database, but extremely relevant for an HTTP server.
Since Go is statically typed, the runtime data should be constant. Couldn’t a copy on write cache mean that the logical RAM redundancy doesn’t actually affect real memory?
If you have to decompress it at startup, you will typically do it to anonymous memory. You can attempt to be fancy at user-level by trying some silly tricks like putting them to shared memory, although I don't know the API of common ones enough to know if it is even possible in reality because of all the details to handle (ref count of the users with auto destruction when the last one closes, and that atomic with the creation of just one when none exists, etc.)
Ideally, to get all the optims, you would want some compression support at FS level, or even a specialized mapper of data coming from executable files in the kernel (or in cooperation with the kernel), but this will bring added complexity.
(Thinking more about it a solution involving a microkernel would be really cool, but I digress...)
I guess this would specifically be a benefit of fork/exec. Though would it need to be decompressed after 1.2? That was my assumption is that it trades speed for memory on the first launch, and memory in subsequent launches would be virtual only
(It also is available as a JS library for linking in web apps, but the command line app is the one that I end up using the most. I actually built it to visualize binary size exactly like this post and then later generalized it.)
* fmt.Println pulling in 300KB isn't proof that Go's standard library isn't "well modularized". It's the wonders of Unicode and other code that is actually used.
* 900K for the runtime isn't surprising when you have complex garbage collection and goroutine scheduling among other things
> fmt.Println pulling in 300KB isn't proof that Go's standard library isn't "well modularized". It's the wonders of Unicode and other code that is actually used.
I would guess a large part is that it has to pull in the entire reflection library to support format verbs like %#v which renders the argument as a Go literal.
Except that this write up says this was a deliberate design decision to trade space for speed. That's not something to be fixed, unless you convince Go to make different trade offs or to provide more optimization options.
You can compress with upx (at the cost of increased startup time in the order of hundreds of ms, which is okay for servers) and/or not include all debug symbols. Doing both usually shaves >60% off a binary.
UPX transforms demand-paged, reclaimable page cache memory into a blob of unreclaimable anonymous memory.
It makes no sense for most use cases where I’ve seen it. It adds runtime costs both in terms of start-up and memory usage.
Maybe it helps in terms of binary sizes for downloads — but those are often compressed anyways! E.g. Your docker images are compressed and UPX’ed binaries in a layer aren’t buying you anything (just adding runtime costs).
I have one system I target where, bizarrely, persistent storage is the bottleneck, not volatile memory or startup speed. In this one case, UPX makes a whole lot of sense.
For a lot of applications the increase in startup size and memory usage are negligible. The startup time is increased by dozens or hundreds of ms, which is not a lot for a server application. I tried to measure the increase in memory, but wasn't really able to, so either it's a very subtle difference or it's very small.
UPXed binaries have their code pages mapped as dirty - this means the OS can't page them back to disk if needs or wants to. In some cases, that's an acceptable cost to pay - for low-latency servers you might want to mlock all your executable pages so there's no risk of a page fault and disk read killing your tail latency. Of course if you're doing that then you have to either pay the full cost of your binary size in memory, or you have to have some warm up phase and then hope that everything you need is loaded then. In the first case you suddenly care a lot about binary size, because memory is quite a bit more expensive than disk.
But one valuable reason to use something like UPX is that you can attach a crappy and thus inexpensive disk to servers that you're not using for actual storage. Compression on disk lets you load from a slow disk faster, and of you weren't paging to disk anyway then UPX doesn't have much of a cost.
But if you're on a traditional desktop operating system, UPX will increase your effective memory footprint, and force writing to swap instead of merely dropping pages. On Android, which doesn't swap, you'll significantly increase your memory footprint.
UPX makes sense if you're trying to fit your executable onto fixed-size media from which it needs to execute (e.g. a floppy disk or USB drive), and almost nowhere else.
People don’t often directly download and run software binaries. On Windows, they download .msi packages, or .exe installers with embedded MSIs. On Linux, people download .deb or .rpm packages. All of the above packaging formats are already compressed.
Also, even if you publish raw binaries without an installer package, HTTP protocol supports compression. Usually quite easy to implement, couple lines in a web server config.
Yes, there are many solutions. upx just gives you an easy, transparent, and convenient way to compress a binary without worrying about web server configurations and whatnot.
I have used UPX on Windows some 15 years ago and liked it a lot these days. My primary motivation was not network transfer speed, it was HDD storage, bandwidth, and especially latency. It was faster to sequentially read the complete binary, compared to reading individual memory pages as required.
Nowadays disk space is very cheap, disks are often solid state with ridiculously high IOPS, but antivirus software became much worse and likely to mark a UPX compressed binary as malware.
I can’t exactly tell if we’re saying the same thing but my thought was a flag to switch between the 1.2 way for faster startup and the earlier approach for longer running processes. The trade off is added complexity in identifying your binary usage patterns and keeping both methods in the tooling.
These kind of changes may not be breaking in a technical sense but it’s very unexpected behavior if you’re one to notice patterns like file sizes changing in such a significant way over time. An answer of “stick with v1.1x indefinitely if you want the old behavior” only feels like a very temporary answer.
That was my question but I would imagine the problem is then that you can't debug production. It looks like a more common solution is dSYM, .dbg, .pdb or other things (read from another commenter)
70MB of source is such an extreme amount I don't know how it could be reasonably justified, there must be an enormous amount of waste. All of sqlite is 6MB.
What's interesting is that Go encourages letter soup while the rest of Google styles encourage long descriptive names.
Sometimes I think it was intentionally done to troll the rest of Google that used long_descriptive_names_ with an 80 col limit (Java argued up to 100 but no one else did, not even JS which uses Java's long names).
I always thought it was funny walking around the office seeing everyone with their squished and wrapped code on a huge 32in monitor. Lots of the code in google3 looks like haiku squished to the right margin.
So what is the solution then? Will they just have to fork Go and compress the table again like before? It's completely insane that it would eventually surpass the size of the program itself.
I did a quick search and turned up nothing about stripping the pclntab (note: distinct from the DWARF line-number tables which can be stripped). A post on Google Groups suggests the opposite - pclntab cannot be stripped because the runtime needs the info for GC - https://groups.google.com/forum/m/#!topic/golang-nuts/hEdGYn....
You're right, but that wouldn't make the main problem go away though. I just built a simple plugin for Kubernetes's kubectl and it's about 32MB with go build's -ldflags -s -w where 16MB of that is still the pclntab mentioned in the article.
The problem with removing that completely is that you won't get any information on panics. I don't think this is what you really want, and the current behaviour strikes me as a reasonable middle ground.
32M is really large for a simple plugin, and to be honest I think that says just as much about Kubernetes as it says about Go.
sidenote : currently trying to get up to date with the best way to get distributed acid key value storage those days. Is coackroach the new standard ? I tried to find benchmarks comparing it to things like postgres for various use case but only found articles that read like ads.
Its not a jvm, but its runtime has jvm-like features such as garbage collection and reflection but also a thread scheduling system called goroutines.
I love the fact it is monolithic in nature.. One exe is all you need no matter which platform you use. Everything is statically compiled into the binary.
No bundling the jvm and a load of jar files, or lib*.so dependencies.
This is a good choice I think and the author of the article missed the most important point - it uses less memory to have an uncompressed table.
This sounds paradoxical but if a table has to be expanded at runtime then it has to be loaded into memory.
However if a table is part of the executable, the OS won't even load it into memory unless it is used and will only page the bits into memory that are used.
You see the same effect when you compress a binary with UPX (for example) - the size on disk gets smaller, but because the entire executable is decompressed into RAM rather than demand paged in, then it uses more memory.
reply