(cache) The npm Blog

v5.0.4 (2017-06-13)

Hey y'all. This is another minor patch release with a variety of little fixes we’ve been accumulating~

f0a37ace9 Fix npm doctor when hitting registries without ping. (@zkat)
64f0105e8 Fix invalid format error when setting cache-related headers. ([@Kat Marchán](https://github.com/Kat Marchán))
d2969c80e Fix spurious EINTEGRITY issue. (@zkat)
800cb2b4e #17076 Use legacy from field to improve upgrade experience from legacy shrinkwraps and installs. (@zkat)
4100d47ea #17007 Restore loose semver parsing to match older npm behavior when running into invalid semver ranges in dependencies. (@zkat)
35316cce2 #17005 Emulate npm@4’s behavior of simply marking the peerDep as invalid, instead of crashing. (@zkat)
e7e8ee5c5 #16937 Workaround for separate bug where requested was somehow null. (@forivall)
2d9629bb2 Better logging output for git errors. (@zkat)
2235aea73 More scp-url fixes: parsing only worked correctly when a committish was present. (@zkat)
80c33cf5e Standardize package permissions on tarball extraction, instead of using perms from the tarball. This matches previous npm behavior and fixes a number of incompatibilities in the wild. (@zkat)
2b1e40efb Limit shallow cloning to hosts which are known to support it. (@zkat)

Why use SemVer?

npm’s documentation recommends that you use semantic versioning, which we also call semver, but it doesn’t explain why you’d use SemVer in the first place.

This post is a quick overview of SemVer and why it’s a good idea.

What is SemVer, again?

At its most basic, SemVer is a contract between the producers and consumers of packages that establishes how risky an upgrade is — that is, how likely it is that an upgrade will break something. The different digits that comprise a SemVer version number each have meaning, which is where the “semantic” part comes from.

There’s a great deal of nuance to the full semver specification but it takes just a few seconds to review the core idea.

A simple semver version number looks like this: 1.5.4. These three numbers, left to right, are called:

major
minor
and patch versions.

A more descriptive way to think of them is as:

breaking
feature
and fix versions.

You release a major version when the new release will definitely break something in users’ code unless they change their code to adopt it. You release a minor version when you introduce a feature that adds functionality in a backwards-compatible way, i.e., you add a feature that doesn’t require users of the previous versions to change their code. You release a patch version when you make a backwards-compatible bug fix, like closing a security flaw or correcting the code to match documented behavior.

By separating releases by risk, SemVer allows the consumer of the software to set rules about how to automatically pull in new versions.

A pretty common set of rules when using a library in development is:

I won’t accept any breaking changes.
I will accept new features if they’re not breaking.
I will accept any fixes if they’re not breaking.

When using npm, you can express this set of rules by listing a package version as ^1.3.5 or 1.x. These are the default rules npm will apply to a package when you add it to your project’s package.json. npm@5 ensures that this happens automatically when you run npm install.

However, you might not care about new features as long as there are no bugs:

I won’t accept any breaking changes.
I don’t need any new features.
I will accept any fixes if they’re not breaking.

You would express those rules in npm using a range like ~1.3.5 or 1.3.x. You can make this the default behavior of npm using the save-prefix configuration option.

The best formulation of rules isn’t 100% clear: a fix version isn’t guaranteed not to break your code, the author just doesn’t think it will. But excluding fix versions entirely might leave you open to a known security problem, in which case your code is “broken” simply by staying as-is.

Many people accept feature and fix versions in development, but lock down the packages they depend on to exact, known-good versions once in production by using the package-lock.json feature of npm@5.

Why is SemVer a good idea?

SemVer allows you — and npm — to automatically manage, and thus reduce, the risk of breaking your software by baking information about relative risk into the version number. The key word here is automatically.

Imagine if everybody used a single number for their version, which they incremented every time they made any kind of change. Every time a package changed, you would need to go to the project’s home page or changelog and find out what changed in the new version. It might not be immediately clear if that change would break existing code, so you would have to ask the author or install it and test the software to find out.

Imagine instead if everybody used a single number for their version and incremented it only when they’d added a bunch of new features that they were really proud of. This would be even worse. Not only would you not know if a change was going to break your code, but if an update did break your code, you’d have no way of specifying that you wanted a specific earlier version.

Either of these extreme alternatives would be painful for the consumers of a package, but even more painful for the author of a package, who would constantly be getting inquiries from users about how risky an upgrade was. A good author might put that information in a known place on their home page, but not everyone might be able to find it.

By making this form of communication automatic, SemVer and npm save everybody involved a great deal of time and energy. Authors and users alike can spend less time on emails, phone calls, and meetings about software, and more time writing software.

SemVer changed the way we write JavaScript

It’s common for a modern JavaScript project to depend on 700–1200 packages. When you’re using that many packages, any system that requires you to manually check for updates is totally unworkable, making SemVer critical — but SemVer is also why there are that many packages in the first place.

10 years ago, the JavaScript world was dominated by a handful of very large libraries, like YUI, Mootools, and jQuery. These “kitchen sink” libraries tried to cover every use case, so you would probably pick one and stick with it. Different libraries were not guaranteed to work well together, so you’d have to consider compatibility before adding a new one to your project.

Then Node.js came along, and server-side JavaScript developers began using npm to add new libraries with very little effort.

This “many small modules” pattern became hugely popular, and npm’s automatic use of SemVer allowed it to blossom without software being constantly broken by unexpected changes. In the last few years, tools like webpack and babel have unlocked the 500,000 packages of the npm ecosystem for use on the client side in browsers, and the pattern proved equally popular with front-end developers.

The evidence suggests that using a large number of smaller modules is a more popular pattern than a handful of large libraries. Why that’s the case is up for debate, but its popularity is undeniable, and SemVer and npm are a big part of what make it possible.

Customer Convos: Alistair Brown, ShopKeep

Q: Hi! Can you state your name and what you do?

A: Ahoy, I’m Alistair Brown and I’m a lead front-end engineer at ShopKeep, primarily focusing on our BackOffice app, which enables more than 23,000 merchants the ability to manage their business operations from anywhere. With ShopKeep’s BackOffice, business owners manage everything from inventory management to accessing customized reporting specific to their business, so this a vital component to the ShopKeep product.

How’s your day going?

It’s going pretty well — my team is mostly front-end focused, so we use npm many times every day. We’re currently prepping some dependency upgrades to make sure we’re ready to jump on the newest version of React (v16) when it’s released. It’s important to us that we stay up to date, getting the benefits of optimization, bug fixes, and new tools.

What is your history with npm?

I’ve used npm in a few jobs to manage dependencies as well as publish some personal projects as modules on the npm registry. A few years ago, I was given an iKettle for Christmas and spent much of that holiday creating an npm module so I could boil water remotely using JavaScript — not a very popular module, but a lot of fun to build! More recently, I’m excited about the release of npm5. We’ve just rolled it out across our developer machines and onto the CI servers, and we’re really seeing the benefits.

What problem did you have that npm Orgs helped you fix?

The main problem we wanted to solve was being able to share code between upcoming projects. The npm Organization setup allowed us to create our own modules and keep control over who could access them. Having private packages within the organization has allowed us the freedom to create a versioned module, but without the fanfare of opening it up to the world.

Can you tell us a story about a specific package you wanted to make that private packages really enabled you to do?

At Node.js Interactive Europe last year, I’d been inspired by a talk by Aria Stewart, called “Radical Modularity.” With the concept of “anything can be a package” in mind, we first started small with our brand colours (JSON, SASS, etc.) and configs. I explored pulling these components out of our code base into separate modules as part of a Code Smash (our internal hackathon). This allowed us to test the waters. As we mainly write in React and had created a number of generic components, there were lots of packages we wanted to extract. In the end, we started modularizing everything and have even extracted out our icon assets.

How’s the day to-day experience of using private packages and orgs?

It’s super easy. Day to day, there’s no difference from using any other package from npm. Once the code is out in a module, we get to treat it just like any other piece of third-party code. There had been a little bit of fear that the scope prefix would cause problems with existing tooling, but so far there have been no problems at all — a great feat!

Does your company do open source? How do you negotiate what you keep private and public?

We have several repositories of useful tools that we’ve open-sourced on GitHub, hoping these same tools could be useful for other developers. These range from shpkpr, a tool we use for managing applications on marathon and supporting zero-downtime deploys, to our air traffic controller slack bot, which helps us coordinate deployments to all of the different services we run. Open sourcing a project is an important undertaking and we always want to make sure that we have pride in what we release. Using private packages gives us that middle ground, allowing us to separate out reusable code but keep it internal until we’re ready to show it off.

To people who are unsure how they could use private packages,how would you explain the use case?

We started off wanting to get code reuse by sharing code as a package. Making private packages allowed us to be more confident about pulling the code out, knowing it wasn’t suddenly visible to the world. Our ESLint config is a nice example of a small reusable module we created, containing rules which enforce our internal code style. Splitting this out allowed us to apply the rules across multiple codebases by extending from this central config. Later, we added a new rule, and having immutable packages meant we could cut a new version and stagger the updates to dependent projects. Really, we get all the benefits that you’d expect from using a third-party package, while keeping control of updating and distribution.

How would you see the product improved or expanded in the future?

With the rapid development of the JavaScript ecosystem, it can be hard to keep up to date with new versions as they come out. The `outdated` command helps towards this, but anything that can be built to help developers stay on the latest and greatest would be really handy.

Would you recommend that other groups or companies use Orgs?

Definitely! It’s not just so you can use private packages, it’s also a great way to group your modules under a brand and avoid naming clashes. With the recent pricing change making organizations free, there really is no excuse for open source groups and companies not to publish their modules under an org.

What’s your favorite npm feature/hack?

I’m a huge fan of npm scripts. It’s allowed us to provide a single interface for useful commands and avoid forcing developers to install global dependencies. From building our application with gulp, upgrading tooling with a shell script, to publishing multiple modules with lerna, the developer experience stays the same by hiding the internals behind the simplicity of `npm run`.

What is the most important/interesting/relevant problem with the JavaScript package ecosystem right now? If you could magically solve it, how would you?

Building a package manager is a difficult problem to solve and it’s great to see so much engagement in this space. Deterministic installs is something that has been really important, so it’s good to see this in npm5 and yarn. I think the natural next step is a client-agnostic lock file. When there are multiple developers on a project, making sure that we can replicate a development environment across all dev machines and CI servers is very important — we use a shrinkwrap file (moving soon to a package-lock.json!), but those are npm-specific. Reducing that barrier between different packaging clients should allow for more experimentation on new approaches and optimisations.

Any cool npm stuff your company has done that you’d like to promote?

No — we’re just happy users!

Customer Convos: Gregor Martynus, Hoodie

This piece is a part of our Customer Convos series. We’re sharing stories of how people use npm at work. Want to share your thoughts? Drop us a line.

Q: Hi! Can you state your name and what you do?

A: Gregor, community manager at the open source project Hoodie, and co-founder at Neighbourhoodie, a consultancy; we do Greenkeeper.

How’s your day going?

I just arrived in Berlin and life is good!

What is your history with npm?

Love at first sight! I’m a big fan of all of what you do! npm is a big inspiration for how a company can self-sustain and be a vital part of the open source community at the same time.

What problems has npm helped you fix?

We love small modules. Because of the maintenance overhead that comes with small modules, we created semantic-release and eventually Greenkeeper. Here is an overview of all our modules.

The `@hoodie` scope allows us to signal that this is a module created by us, the core hoodie team, and that it’s part of the core architecture. I could imagine using the scope in the future for officially supported 3rd party plugins, too.

How’s it going? How’s the day to day experience?

Our release process is entirely automated via semantic-release so that we don’t use different npm accounts to release software. Technically, it’s all released using the https://www.npmjs.com/~hoodie account.

How would you see the product improved or expanded in the future?

Hmm I can’t think of anything… I’ll ask around.

How would you recommend that other groups or companies use npm?

I don’t see why companies would not use scopes. I think it’s a great way to signal an “official” package, in order to differentiate 3rd party modules from the community.

What’s your favorite npm feature/hack?

As a developer, I love that I can subscribe to the live changes of the entire npm registry. It allows us to build really cool tools, including Greenkeeper.

What is the most important/interesting/relevant problem with the JavaScript package ecosystem right now? If you could magically solve it, how would you?

For me, a big challenge is to fix a bug in a library like `hoodie` that is caused by dependency (or sub dependency, or sub sub…). It would be cool if there was a way to easily set up a development environment in which I could test the bug on the main module, but working on the dependency, until I have it resolved. That would make it simple to release a new fixed version of the dependency and the package.json of the main module to request the fixed version.

This is kind of related: let’s say I have a module A with a dependency of module B and B depends on module C, so it’s A=>B=>C. Now, if I fix C and release a new fix version, I cannot release a new version of A that enforces this new version, because it’s a sub-dependency. I’m not sure what the right approach to this problem is, but that’s one that’s bothering me related to npm.

Any cool npm stuff your company has done that you’d like to promote?

Greenkeeper.

v5.0.2 (2017-06-02)

Here’s another patch release, soon after the other!

This particular release includes a slew of fixes to npm’s git support, which was causing some issues for a chunk of people, specially those who were using self-hosted/Enterprise repos. All of those should be back in working condition now.

There’s another shiny thing you might wanna know about: npm has a Canary release now! The npm5 experiment we did during our beta proved to be incredibly successful: users were able to have a tight feedback loop between reports and getting the bugfixes they needed, and the CLI team was able to roll out experimental patches and have the community try them out right away. So we want to keep doing that.

From now on, you’ll be able to install the ‘npm canary’ with npm i -g npmc. This release will be a separate binary (npmc. Because canary. Get it?), which will update independently of the main CLI. Most of the time, this will track release-next or something close to it. We might occasionally toss experimental branches in there to see if our more adventurous users run into anything interesting with it. For example, the current canary (npmc@5.0.1-canary.6) includes an experimental multiproc branch that parallelizes tarball extraction across multiple processes.

If you find any issues while running the canary version, please report them and let us know it came from npmc! It would be tremendously helpful, and finding things early is a huge reason to have it there. Happy hacking!

A NOTE ABOUT THE ISSUE TRACKER

Just a heads up: We’re preparing to do a massive cleanup of the issue tracker. It’s been a long time since it was something we could really keep up with, and we didn’t have a process for dealing with it that could actually be sustainable.

We’re still sussing the details out, and we’ll talk about it more when we’re about to do it, but the plan is essentially to close old, abandoned issues and start over. We will also add some automation around issue management so that things that we can’t keep up with don’t just stay around forever.

Stay tuned!

GIT YOLO

1f26e9567 pacote@2.7.27: Fixes installing committishes that look like semver, even though they’re not using the required #semver: syntax. (@zkat)
85ea1e0b9 npm-package-arg@5.1.1: This includes the npa git-parsing patch to make it so non-hosted SCP-style identifiers are correctly handled. Previously, npa would mangle them (even though hosted-git-info is doing the right thing for them). (@zkat)

COOL NEW OUTPUT

The new summary output has been really well received! One downside that reared its head as more people used it, though, is that it doesn’t really tell you anything about the toplevel versions it installed. So, if you did npm i -g foo, it would just say “added 1 package”. This patch by @rmg keeps things concise while still telling you what you got! So now, you’ll see something like this:

$ npm i -g foo bar + foo@1.2.3 + bar@3.2.1 added 234 packages in .005ms

362f9fd5b #16899 For every package that is given as an argument to install, print the name and version that was actually installed. (@rmg)

OTHER BUGFIXES

a47593a98 #16835 Fix a crash while installing with --no-shrinkwrap. (@jacknagel)

DOC UPATES

89e0cb816 #16818 Fixes a spelling error in the docs. Because the CLI team has trouble spelling “package”, I guess. (@ankon)
c01fbc46e #16895 Remove --save from npm init instructions, since it’s now the default. (@jhwohlgemuth)
80c42d218 Guard against cycles when inflating bundles, as symlinks are bundles now. (@iarna)
7fe7f8665 #16674 Write the builtin config for npmc, not just npm. This is hardcoded for npm self-installations and is needed for Canary to work right. (@zkat)

DEP UPDATES

63df4fcdd #16894 node-gyp@3.6.2: Fixes an issue parsing SDK versions on Windows, among other things. (@refack)
5bb15c3c4 read-package-tree@5.1.6: Fixes some racyness while reading the tree. (@iarna)
a6f7a52e7 aproba@1.1.2: Remove nested function declaration for speed up (@mikesherov)

npm Pride 2017 Shirts

npm loves everyone!

By popular demand, this year we’re making the npm team’s Pride shirts available to all, with help from our friends at Teespring. Select your favorite design and click through for types and sizes — or collect them all! — and 100% of proceeds will benefit The Trevor Project.

Are we missing a design? Reach out and let us know!

npm@5 is now `npm@latest`

It’s here!

Starting today, if you type `npm install npm@latest -g`, you’ll be updated to npm version 5. In addition, npm@5 is bundled in all new installations of Node.js 8, which has replaced Node.js 7 in the Node Project’s current release line.

Over the last year and a half, we’ve been working to address a huge number of pain points, some of which had existed since the registry was created. Today’s release is the biggest ever improvement to npm’s speed, consistency, and user experience.

The definitive list of what’s new and what’s changed is in our release notes,

but here are some highlights:

It’s fast

We’ve reworked package metadata, package download, and package caching, and this has sped things up significantly. In general, expect performance improvements of 20–100%; we’ve also seen some installations and version bumps that run 5x faster.

(Installing the npm website on our own dev environments went from 99 seconds using npm@4 to 27 seconds with npm@5. Now we spend less time jousting.)

Since npm was originally designed, developers have changed how they use npm. Not only is the npm ecosystem exponentially larger, but the number of dependencies in the average npm package has increased 250% since 2014. More devs now install useful tools like Babel, Webpack, and Tap locally, instead of globally. It’s a best practice, but it means that `npm install` does much more work.

Given the size of our community, any speed bump adds up to massive savings for millions of users, not to mention all of our Orgs and npm Enterprise customers. Making npm@5 fast was an obvious goal with awesome rewards.

It’s consistent

Default lockfiles

Shrinkwrap has been a part of npm for a long time, but npm@5 makes lockfiles the default, so all npm installs are now reproducible. The files you get when you install a given version of a package will be the same, every time you install it.

We’ve found countless common and time consuming problems can be tied to the “drift” that occurs when different developer environments utilize different package versions. With default lockfiles, this is no longer a problem. You won’t lose time trying to figure out a bug only to learn that it came from people running different versions of a library.

SHA-512 hashes

npm@5 adds support for any tarball hash function supported by Node.js, and it publishes with SHA-512 hashes. By checking all downloaded packages, you’re protected against data corruption and malicious attacks, and you can trust that the code you download from the registry is consistent and safe.

Self-healing cache

Our new caching is wicked fast, but it’s also more resilient. Multiple npm processes won’t corrupt a shared cache, and npm@5 will check data on both insertion and extraction to prevent installing corrupted data. If a cache entry fails an integrity check, npm@5 will automatically remove it and re-fetch.

It’s easier to use

With your feedback, we’ve improved the user experience with optimizations throughout npm@5. A big part of this is more informative and helpful output. The best example of this is that npm no longer shows you the entire tree on package install; instead, you’ll see a summary report of what was installed. We made this change because of the larger number of dependencies in the average package. A file-by-file readout turned out to be pretty unwieldy beyond a certain quantity.

It needs you

npm@5 is a huge step forward for both npm and our awesome community, and today’s release is just the beginning. A series of improvements in the pipeline will make using npm as frictionless as possible and faster than ever before.

But: npm exists because of its users, and our goal remains being open and flexible to help people build amazing things, so we depend on your feedback.

What works for you? What should we improve next? How much faster are your installs? Let us know. Don’t hesitate to find us on Twitter, and, if you run into any trouble, be sure to drop us a note.

basic auth to be limited soon

Since before the release of npm 2.0 in 2014 we have encouraged developers using our APIs to use token authentication instead of passing username and password in a basic auth header. Over the next few weeks we will be turning the recommendation into a requirement: basic http authentication will no longer work for any of the npm registry endpoints that require authorization. Instead you should use bearer tokens.

There are two exceptions:

The /login endpoint remains the endpoint to use to log into the npm registry & generate an auth token for later use.
The /whoami endpoint will continue to respond with the username for a successful login.

Both of these endpoints are monitored and rate-limited to detect abuse.

If you’re an npm user, this change will likely not affect you. Log in with the npm cli as you would normally:

npm login

A successful login will store an auth token in your .npmrc , which the client will use for all actions that require auth.

If you are using the npm cli to interact with registries other than npm’s, you should also not be affected. We have no plans to remove support for basic auth from the npm cli.

If you are a developer using npm’s API, make sure you’re using a bearer token when you need to authenticate with the registry. For more information about how to do this, please see the documentation for npm/npm-registry-client. This package is what the official command-line client uses to do this work.

If you have any questions or requests for us, please contact npm support. We want to hear about how you’re using our APIs and how you’d like them to evolve to support your use cases.

How we deploy at npm

Here’s how we deploy node services at npm:

cd ~/code/exciting-service
git push origin +master:deploy-production

That’s it: git push and we’ve deployed.

Of course, a lot is triggered by that simple action. This blog post is all about the things that happen after we type a git command and press Return.

Goals

As we worked on our system, we were motivated by a few guiding principles:

deploying should be so easy there’s no reason not to;
rolling back a push should be as easy as the original push was;
our team is small: automate everything!
everything a service needs to start must be on the host with it;
each step is simple & useful on its own;
everything can be run by hand (or by a Slack bot!) if necessary;
every step can be run many times without harm.

Why? We want no barriers to pushing out code once it’s been tested and reviewed, and no barriers to rolling it back if something surprising happens — so any friction in the process should be present before code is merged into master, via a review process, not after we’ve decided it’s good. By separating the steps, we gain finer control over how things happen. Finally, making things repeatable means the system is more robust.

Thousand-foot view

What happens when you do that force-push to the deploy-production branch? It starts at the moment an instance on AWS is configured for its role in life.

We use Terraform and Ansible to manage our deployed infrastructure. At the moment I’m typing, we have around 120 AWS instances of various sizes, in four different AWS regions. We use Packer to pre-bake an AMI based on Ubuntu Trusty with most of npm’s operational requirements, and push it out to each AWS region.

For example, we pre-install a recent LTS release of node as well as our monitoring system onto the AMI. This pre-bake greatly shortens the time it takes to provision a new instance. Terraform reads a configuration file describing the desired instance, creates it, adds it to any security groups needed and so on, then runs an Ansible playbook to configure it.

Ansible sets up which services a host is expected to run. It writes a rules file for the incoming webhooks listener, then populates the deployment scripts. It sets up a webhook on GitHub for each of the services this instance needs to run. Ansible then concludes its work by running all of the deploy scripts for the new instance once, to get its services started. After that, it can be added to the production rotation by pointing our CDN at it, or by pointing other processes to it through a configuration change.

This setup phase happens less often than you might think. We treat microservices instances as disposable, but most of them are quite long-lived.

So our new instance, configured to run its little suite of microservices, is now happily running. Suppose you then do some new development work on one of those microservices. You make a pull request to the repo in the usual way, which gets reviewed by your colleagues and tested on Travis. You’re ready to run it for real!

You do that force-push to deploy-staging, and this is what happens: A reference gets repointed on the GitHub remote. GitHub notifies a web hooks service listening on running instances. This webhooks service compares the incoming hook payload against its configured rules, decides it has a match, & runs a deploy script.

Our deploy scripts are written in bash, and we’ve separated each step of a deploy into a separate script that can be invoked on its own. We don’t just invoke them through GitHub hooks! One of our Slack chatbots is set up to respond to commands to invoke these scripts on specific hosts. Here’s what they do:

pull code with git & run npm install
update configuration from our config db
roll services one process at a time

Each step reports success or failure to our company Slack so we know if a deploy went wrong, and if so at which step. We emit metrics on each step as well, so we can annotate our dashboards with deploy events.

Conventions

We name our deploy branches deploy-foo, so we have, for instance, deploy-staging, deploy-canary, and deploy-production branches for each repo, representing each of our deployment environments. Staging is an internal development environment with a snapshot of production data but very light load and no redundancy. Canary hosts are hosts in the production line that only take a small percentage of production load, enough to shake out load-related problems. And production is, as you expect, the hosts that take production traffic.

Every host runs a haproxy, which does load balancing as well as TLS termination. We use TLS for most internal communication among services, even within a datacenter. Unless there’s a good reason for a microservice to be a singleton, there are N copies of everything running on each host, where N is usually 4.

When we roll services, we take them out of haproxy briefly using its API, restart, then wait until they come back up again. Every service has two monitoring hooks at conventional endpoints: a low-cost ping and a higher-cost status check. The ping is tested for response before we put the service back into haproxy. A failure to come back up before a timeout stops the whole roll on that host.

You’ll notice that we don’t do any cross-host orchestration. If a deploy is plain bad and fails on every host, we’ll lose at most 1 process out of 4, so we’re still serving requests (though at diminished capacity). Our Slack operational incidents channel gets a warning message when this happens, so the person who did the deploy can act immediately. This level of orchestration has been good enough thus far when combined with monitoring and reporting in Slack.

You’ll also notice that we’re not doing any auto-scaling or managing clusters of containers using, e.g., Kubernetes or CoreOS. We haven’t had any problems that needed to be solved with that kind of complexity yet, and in fact my major pushes over the last year have been to simplify the system rather than add more moving parts. Right now, we are more likely to add copies of services for redundancy reasons than for scaling reasons.

Configuration

Configuration is a perennial pain. Our current config situation is best described as “less painful than it used to be.”

We store all service configuration in an etcd cluster. Engineers write to it with a command-line tool, then a second tool pulls from it and writes configuration at deploy time. This means config is frozen at the moment of deploy, in the upstart config. If a process crashes & restarts, it comes up with the same configuration as its peers. We do not have plans to read config on the fly. (Since node processes are so fast to restart, I prefer killing a process & restarting with known state to trying to manage all state in a long-lived process.)

Each service has a configuration template file that requests the config data it requires. This file is in TOML format for human readability. At deploy time the script runs & requests keys from etcd namespace by the config value, the service requesting the config, and the configuration group of the host. This lets us separate hosts by region or by cluster, so we can, for example, point a service at a Redis in the same AWS data center.

Here’s an example:

> furthermore get /slack_token/
slack_token matches:
/slack_token == xoxb-deadbeef
/slack_token.LOUDBOT == XOXB-0DDBA11
/slack_token.hermione == xoxb-5ca1ab1e
/slack_token.korzybski == xoxb-ca11ab1e
/slack_token.slouchbot == xoxb-cafed00d

Each of our chatbots has a different Slack API token stored in the config database, but in their config templates they need only say they require a variable named slack_token[1].

These config variables are converted into environment variable specifications or command-line options in an upstart file, controlled by the configuration template. All config is baked into the upstart file and an inspection of that file tells you everything you need to know.

Here’s LOUDBOT’s config template:

app = "LOUDBOT"
description = "YELL AND THEN YELL SOME MORE"
start = "node REAL_TIME_LOUDIE.js"
processes = 1

[environment]
  SERVICE_NAME = "LOUDBOT"
  SLACK_TOKEN = "{{slack_token}}"

And the generated upstart file:

# LOUDBOT node 0

description "YELL AND THEN YELL SOME MORE"

start on started network-services
stop on stopping network-services
respawn
setuid ubuntu
setgid ubuntu
limit nofile 1000000 1000000

script
    cd /mnt/deploys/LOUDBOT
    SERVICE_NAME="LOUDBOT" \
    SLACK_TOKEN="XOXB-0DDBA11" \
    node REAL_TIME_LOUDIE.js \
    >> logs/LOUDBOT0.log 2>&1
end script

This situation is vulnerable to the usual mess-ups: somebody forgets to override a config option for a cluster, or to add a new config value to the production etcd as well as to the staging etcd. That said, it’s at least easily inspectable, both in the db and via the results of a config run.

Open source

The system I describe above is sui generis, and it’s not clear that any of the components would be useful to anybody else. But our habit as an engineering organization is to open-source all our tools by default, so everything except the bash scripts is available if you’d find it useful. In particular, furthermore is handy if you work with etcd a lot.

jthooks - add and remove webhooks from a Github repo
jthoober - handle incoming Github webhooks to run shell scripts
etcetera - read configuration from etcd & populate an upstart template
furthermore - get & set etcd keys conveniently, with regexp searching

[1] The tokens in this post aren’t real. And, yes, LOUDBOT’s are always all-caps.

package tarball read outage today

Earlier today, July 6, 2016, the npm registry experienced a read outage for 0.5% of all package tarballs for all network regions. Not all packages and versions were affected, but the ones that were affected were completely unavailable during the outage for any region of our CDN.

The unavailable tarballs were offline for about 16 hours, from mid-afternoon PDT on July 5 to early morning July 6. All tarballs should now be available for read.

Here’s the outage timeline:

Evening of 2016 July 1: We pushed code that implemented a change in how we generate etags for tarballs for for new publications only.
Midday 2016 July 6: We commenced running a script that updated all existing tarballs on all servers to the new etags scheme. This ran over the course of some hours.
10pm PDT 2016 July 5: We received a report of some 502s being returned by some servers. In this course of investigating this, we observed a known issue with a very slow process leak in an existing service. We cleared up the known issue in the mistaken belief that this was causing the 502s.
6am PDT 2016 July 6: We received more reports of 502s on tarballs, and opened a status incident.
8am PDT: The root cause of the issue was identified as a negative file modification times for the specific affected tarballs. A new script was immediately run to fix mtimes on all tarball files appearing in the logs as producing 502 errors.

Over the next hour 502 rates fell to the normal 0.

What we’re doing to prevent outages like this next time

We’re adding an alert on all 500-class status codes, not just 503s. This alert will catch the category of errors, not simply this specific problem.

We’re also revising our operational playbook to encourage examination of our CDN logs more frequently; we could have caught the problem very soon after introducing it if we had carefully verified that our guess about the source of 502s had resulted in making them vanish from our CDN logging. We can also do better with tools for examining the patterns of errors across POPs, which would have made it clearer to us immediately that the error was not specific to the US East coast and was therefore unlikely to have been caused by an outage in our CDN.

Read on if you would like the details of the bug.

nginx and etags

The root cause for this outage was an interesting interaction of file modification time, nginx’s method of generating etags, and cache headers.

We recently examined our CDN caching strategies and learned that we were not caching as effectively as we might, because of a property of nginx. Nginx’s etags are generated using the file modification time as well as its size, roughly as mtime + '-' + the file size in bytes. This meant that if mtimes for package tarballs varied across our nginx instances, our CDN would treat the files from each server as distinct, and cache them separately. Getting the most from our CDN’s caches and from our users’ local tarball caches is key to good performance on npm installs, so we took steps to make the etags match across all our services.

Our chosen scheme was to set the file modification time to the first 32-bit BE integer from their md5 hash. This was entirely arbitrary but looked sufficient after testing in our staging environment. We produced consistent etags. Unfortunately, the script that applied this change to our production environment failed to clamp the resulting integer, resulting in negative numbers for timestamps. Ordinarily, this would result in a the infamous Dec 31, 1969 date one sees for timestamps before the Unix epoch.

Unfortunately, negative mtimes triggered an nginx bug. Nginx will serve the first request for a file in this state and deliver the negative etag. However, if there is a negative etag in the if-none-match header nginx attempts to serve a 304 but never completes the request. This resulted in the the bad gateway message returned by our CDN to users attempting to fetch a tarball with the bad mtime.

You can observe this behavior yourself with nginx and curl:

The final request never completes even though nginx has correctly given it a 304 status.

Because this only affected a small subset of tarballs, not including the tarball fetched by our smoketest alert, all servers remained in the pool. We have an alert on above-normal 503 error rates served by our CDN, but this error state produced 502s and was not caught.

All the tarballs that were producing a 502 gateway timeout error turned out to have negative timestamps in their file mtimes. The fix was to touch them all so their times were inconsistent across our servers but valid, thus both busting our CDN’s cache and dodging the nginx behavior.

tl;dr

The logs from our CDN are invaluable, because they tell us what quality of service our users are truly experiencing. Sometimes everything looks green on our own monitoring, but it’s not green from our users’ perspective. The logs are how we know.