Service Mesh and Cookpad
This article is a translation of the original article which was published at the beginning of May. Cookpad is mid-size technology company with 200+ product developers, 10+ teams, 90 million monthly average users.
Hello, this is Taiki from the developer productivity team. For this article, I would like to introduce the knowledge obtained by building and operating a service mesh at Cookpad.
For background on service meshes in general, check out the following resources:
- https://buoyant.io/2017/04/25/whats-a-service-mesh-and-why-do-i-need-one/
- https://blog.envoyproxy.io/service-mesh-data-plane-vs-control-plane-2774e720f7fc
- https://istio.io/docs/setup/kubernetes/quick-start.html
- https://www.youtube.com/playlist?list=PLj6h78yzYM2P-3-xqvmWaZbbI1sW-ulZb
Our goals
We introduced a service mesh mainly to solve operational problems such as troubleshooting, capacity planning, and system reliability. In particular:
- Reduction of management cost of services
- Improving Observability
- Building a better fault isolation mechanism
For the first challenge, it became increasingly difficult as our services expanded to figure out which services were communicating with each other, and root cause the failure of a given service. We thought this problem should be solved by centrally managing information on where and where they are connected.
For the second challenge, as we further explored the first challenge, we realized that one of the problems was that we could not easily know the status of communication between two different services. For example, we did not have good visibility into metrics such as RPS, response time, number of success / failure status, timeout, status of circuit breaker, etc. In the case where two or more services referred to the same backend service, resolution of metrics from the proxy or load balancer of the backend service was insufficient because they were not tagged by request origin services.
For the third problem, we found that we did not always configure the fault isolation mechanisms properly. At the time, we used a library in each application to implement fault isolation, setting timeouts, retries, and circuit breaking behavior. The main disadvantage of this is we have to implement these features for each language. Another downside with library approach is configurations tend to be tightly coupled to the actual application. To know current settings, it was necessary to dive into the application code and the configuration separately. It was also difficult to improve the settings continuously.
In order to solve more advanced problems, we also construct functions such as gRPC infrastructure construction, delegation of processing around distribution tracing, diversification of deployment method by traffic control, authentication authorization gateway, etc. in scope. This area will be discussed later.
Current status
The service mesh in the Cookpad uses Envoy as the data-plane and we created our own control-plane. Although we initially considered installing Istio which is already implemented as a service mesh, nearly all applications in the Cookpad are operating on a container management service called AWS ECS, so the merit of cooperation with Kubernetes is limited. In consideration of what we wanted to realize and the complexity of Istio’s software itself, we chose the path of our own control-plane which let us start more incrementally.
The control-plane part of the service mesh implemented this time consists of several components. I will explain the roles and action flow of each component:
- A repository that centrally manages the configuration of the service mesh.
- Using the gem named kumonos, the Envoy xDS API response JSON is generated
- Place the generated response JSON on Amazon S3 and use it as an xDS API from Envoy
We manage these configurations in a central repository for a few reasons:
- we wanted to keep track of change history with reason and keep track of it later
- we wanted to be able to review changes in settings across organizations such as SRE team
We started with using an internal ELB for load balancing. As we adopted gRPC, we migrated to client-side load balancing with the SDS (service discovery service) API. We deploy a side-car container in the ECS task that performs a health check for the app container and registers connection destination information in the SDS API. Using this mechanism, we succeeded to run and connect our gRPC applications quickly in a production environment.
The configuration around the metrics is as follows:
- Store all metrics to Prometheus
- Send tagged metrics to statsd_exporter running on the ECS container host instance using dog_statsd sink
- All metrics include application id via fixed-string tags to identify each node
- Prometheus pulls metris using EC2 SD
- To manage ports for Prometheus, we use exporter_proxy between statsd_exporter and Prometheus
- Vizualize metrics with Grafana and Vizceral
In case the application process runs directly on the EC2 instance without using ECS or Docker, the Envoy process is running as a daemon directly in the instance, but the architecture is almost the same as ECS one. There is a reason for not setting pull directly from Prometheus to Envoy, because we still can not extract histogram metrics from Envoy’s Prometheus compatible endpoint (#1947). As this will be improved in the future, we plan to eliminate stasd_exporter.
On Grafana, dashboards and Envoy’s entire dashboard are prepared for each service, such as upstream RPS and timeout occurrence. We will also prepare a dashboard of the service x service dimension.
Per service dashboard:
For example, circuit breaker related metrics when the upstream is down:
Dashboard for the Envoy proxies:
The service configuration diagram is visualized using Vizceral developed by Netflix. For the implementation, we developed a fork of promviz and promviz-front to fit our use case. As we are introducing it only for some services yet, the number of nodes currently displayed is small, but we provide the following dashboards.
Service configuration diagram for each region, RPS, error rate:
Downstream / upstream of a specific service:
As a subsystem of the service mesh, we deploy a gateway for accessing the gRPC server application in the staging environment from the developer machine in our offices. It is constructed by combining SDS API and Envoy with software that manages internal application called hako-console (JP article).
- Gateway app (Envoy) sends xDS API requests to the gateway controller
- The Gateway controller obtains the list of gRPC applications in the staging environment from hako-console app and returns the Route Discovery Service / Cluster Discovery Service API response based on it
- The Gateway app gets the actual connection destination from the SDS API based on the response
- From the hand of the developer, the AWS ELB Network Load Balancer is referred to and the gateway app performs routing
Results
The most remarkable in the introduction of service mesh was that it was able to suppress the influence of temporary disability. There are multiple cooperation parts between services with many traffic, and up to now, 200+ network-related trivial errors have been constantly occurring in an hour (it’s a very small number comparing to the traffic). It decreased to one or less in a week with proper retry and timeout setting by the service mesh.
Various metrics have come to be seen from the viewpoint of monitoring and debugging, but since we are introducing it only for some services and we have not reached full-scale utilization due to the introduction day, we expect to use it in the future. In terms of management, it became very easy to understand our system when the connection between services became visible, so we would like to prevent overlooking and missing consideration by introducing it to all services.
Future plan
Migrate to v2 API, transition to Istio
The xDS API has been using v1 because of its initial design situation and the requirement to use S3 as a delivery back end, but since the v1 API is deprecated, we plan to move this to v2. At the same time we are considering moving control-plane to Istio. Also, if we are going to make our own control-plane, we plane to build LDS/RDS/CDS/EDS API using go-control-plane.
Replacing Reverse proxy
Up to now, Cookpad uses NGINX as reverse proxy, but we are considering replacing reverse proxy and edge proxy from NGINX to Envoy considering the difference in knowledge of internal implementation, gRPC correspondence, and acquisition metrics.
Traffic Control
As we move to client-side load balancing and replace reverse proxy, we will be able to freely change traffic by operating Envoy, so we will be able to realize canary deployment, traffic shifting and request shadowing.
Fault injection platform
It is a mechanism that deliberately injects delays and failures in a properly managed production environment and tests whether the actual service group works properly. Envoy has various functions.
Perform distributed tracing on the data-plane layer
In Cookpad, AWS X-Ray is used as a distributed tracing system (JP article). Currently we implement the distributed tracing function as a library, but we are planning to move this to data-plane and realize it at the service mesh layer.
Authentication Authorization Gateway
This is to authenticate and authorize processing only at the front-most server receiving user’s request, and the subsequent servers will use the results around. Previously, it was incompletely implemented as a library, but by shifting to data-plane, we can recieve the advantages of out of process model.
Wrapping up
We have introduced the current state and future plan of service mesh in Cookpad. Many functions can be easily realized already, and as more things can be done by the layer of service mesh in the future, it is highly recommended for every microservices system.