(cache)Prepare Node.js apps production ready for Kubernetes

At Banzai Cloud we are building an application-centric platform for containers - Pipeline - running on Kubernetes to allow developers to go from commit to scale in minutes. We support multiple development languages and frameworks to build applications, with one common goal: all Pipeline deployments get integrated CI/CD, centralized logging, monitoring, enterprise-grade security, autoscaling, and spot price support automatically, out of the box. In most cases we accomplish this in a non-intrusive way (i.e. no code changes are required) or we generate and pre-package boilerplate code to enable all of the above must-have features when going to production.

One of the most popular development language we support is Node.js. We recently published a Node service tools npm library that provides the essential features including graceful error handling & shutdown, structured JSON logging, access to various HTTP middleware, health checks, metrics and more to make your Node.js application truly ready for production on Kubernetes.

Node.js

Graceful error handling

In Node.js you can register error handlers for uncaught exceptions and unhandled Promise rejections. We are all humans and errors can slip into our code, and we have edge cases we never prepared for. What happens when an unexpected error is thrown which we are not catching it? Our program exits right away, possibly leaking resources, loosing session data and leaving in-flight requests unhandled. We really want to avoid this to happen!

This is one way of doing it, using our library:

const { catchErrors } = require('@banzaicloud/service-tools')

// this should be called very early in our application to register the error handlers
catchErrors([
  // calls the cleanup handlers one after the other on error
  stopServer,
  closeDatabase,
  closeMessageQueue,
  // ...
])

Once an unexpected error happens in our running application, we can check the logs and implement the error handling (missing try-catch or Promise chain .catch) for that particular code block.

Graceful shutdown

When rolling out a new deployment, the old instances of your application get stopped. The running process will receive a SIGTERM termination signal from the process manager. This is what the application can use to get notified of a termination intent and to start cleaning up before exiting.

Lets see a possible scenario for a graceful web server shutdown:

App gets signal to stop (SIGTERM)
App lets the load balancer know that it’s not ready to handle new requests (returning 503 on the health check endpoint)
App ceases to listen on the service port
App serves all the ongoing requests
App releases all of the resources correctly: databases, queues, opened files etc.
App exits with “success” status code (process.exit())

const { gracefulShutdown } = require('@banzaicloud/service-tools')

// ...

// register event listener for the `SIGTERM` signal
gracefulShutdown([
  // calls the cleanup handlers one after the other on error
  stopServer,
  closeDatabase,
  closeMessageQueue,
  // ...
])

It is also important to mention that if the application is started via npm start the process will not receive the kill signals. You should always start your application with node directly.

Alternative solutions:

Structured JSON logger

Application logs come very handy to developers when something goes wrong or if we need to follow the state of the application. With structured logging the format of a log is well defined and easier to consume (think about filtering and searching for certain lines).

The library provides a configured pino instance as a logger. It also has a utility function to intercept all console.log calls and turn them into JSON logs.

const { logger } = require('@banzaicloud/service-tools')

logger.info('log message')
// > {"level":30,"time":<ts>,"msg":"log message","pid":0,"hostname":"local","v":1}

console.log('log message')
// > log message

logger.interceptConsole()

console.log('log message')
// > {"level":30,"time":<ts>,"msg":"log message","pid":0,"hostname":"local","v":1}

One of the most important feature of any logger is the ability to distinguish logs based on their importance. The following levels are available:

fatal: The system is unusable, a person must take an action immediately. The system is in distress, customers are probably being affected (or will soon be). Examples:
- failed to start server on the given port
- failed to start server due to missing environment variables or bad configuration
- runtime errors or unexpected conditions
- the server can’t handle load
- database is unreachable
error: Error events are likely to cause problems, an unexpected technical or business event happened. Examples:
- 5xx internal server error and its cause
- most of errors in try/catch blocks
warn: Warning events might cause problems. Examples:
- use of deprecated APIs
- poor use of API
info: Routine information, such as ongoing status or performance. For important information, things we want to see at high volume in case we need to analyse an issue. Examples:
- system life cycle events (server is listening on a port, received kill signal, etc.)
- session life cycle events (login, logout)
- significant boundary events (database calls, remote API calls, etc.)
debug: For not very important, but potentially helpful events in tracking the flow through the system and isolating issues. It should be only turned on when developing locally or for a short period of time. Examples:
- HTTP requests, responses
- messages received on a queue
trace: Similar to debug, except these are usually high volume, frequent events. It should be only turned on when developing locally or for a very short period of time. Examples:
- states of some data being processed
- entry/exit of non-trivial methods and decision points

It also enables to configure what severity to log out in different environments. The minimum logging level can be set as an environment variable: LOGGER_LEVEL.

The logger also redacts some fields, based on key names, not to expose secrets in the logs. The default fields are password, pass, authorization, auth, cookie, but it can be configured as well.

Alternative solutions:

We put great effort to collect and move logs of distributed applications deployed to Kubernetes towards a centralized location - and have automated the whole logging experience.

Health check

Health checks are used by the load balancer or the application manager to determine the health of a running application. When healthy, the application is ready to accept requests or handle other kind of load. If your application fails the system has to detect it automatically and try to fix it (for example by restarting the misbehaving instances).

In Kubernetes there are two kind of health checks:

liveness:

Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.

readiness:

Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either.

If your application is not prepared to expose these kind of information, the system doesn’t have any way to tell whether it is working correctly or not. It is extremely important to define them.

The library currently has support for Koa and Express web frameworks. The checks are executed sequentially, one after another, and the endpoint will return 200 when all of the passed, and 500 when any of them fails.

const express = require('express')
const { express: middleware } = require('@banzaicloud/service-tools').middleware

// ...

const app = express()

// each check returns a Promise
app.get('/health/liveness', middleware.healthCheck([
  checkDB
]))
app.get('/health/readiness', middleware.healthCheck([
  initFinished,
  checkDB,
  canAcceptMoreRequests
]))

In Kubernetes you can define the checks in the POD specification:

# ...
ports:
- name: http
  containerPort: 8080

livenessProbe:
  httpGet:
    path: /health/liveness
    port: http

readinessProbe:
  httpGet:
    path: /health/readiness
    port: http

To learn more about health checks in Kubernetes, check out the following video by Google:

Alternative solutions:

Application metrics

Metrics are a very important source to get insights into the state, load and stability of your running application. It gives us the ability to observe the state of our application, so we can act on the issues quickly. Pipeline uses Prometheus to collect metrics and as usual we have automated the whole monitoring experience for all Pipeline deployments.

The library builds on top of prom-client, so you can easily extend the exported metrics, fine tune them for your needs.

const express = require('express')
const { express: middleware } = require('@banzaicloud/service-tools').middleware

app.get('/metrics', middleware.prometheusMetrics())

The application is only responsible for exposing these metrics. Consuming them is done by Prometheus itself, by calling the metrics endpoint periodically.

Alternatives:

Learn by the code

As always, the easiest is to learn reading the code. We have collected some examples to kick-start your Node.js experience on Kubernetes!

If you are interested in our technology and open source projects, follow us on GitHub, LinkedIn or Twitter:

Prepare Node.js apps production ready for Kubernetes

Graceful error handling

Graceful shutdown

Structured JSON logger

Health check

Application metrics

Learn by the code

Comments