CloudRun Canary Releases with Terraform
Intro
The idea of a gradual roll out of a new feature under continuous monitoring known as canary deployments has been an established practice among engineers for a long time. In the cloud world, most providers support canary releases for their managed services. In the same world, gone are the times of manual infrastructure changes, with IAC(infrastructure as code) taking over the control.
In this how-to post, I’ll illustrate an implementation of the idea for the GCP CloudRun service, that will be managed with terraform.
Implementation
While CloudRun itself comes with the functionality[1][2] to do a canary-style deployment, doing so may be technically tricky if CloudRun resources are managed with terraform only: there is no other tool, not even gcloud, that you can manage your cloud resources with bypassing terraform/terragrunt config files. Which is not a bad state of IAC things to have at your org.
The implementation is based on the CloudRun(CR) feature to serve traffic coming to a service from multiple — two, in the canary case — versions of the service configurations known as revisions.
From the terraform side, google_cloud_run_service resource is the only way[3][4] to control a CR service. Traffic served by a service is controlled with a traffic block, so terraform configuration of a service in a normal state looks like this:
# for illustration purposes the rest of the code is omitted
traffic {
percent = 100
latest_revision = true
}
autogenerate_revision_name = true
The canary state with two different revisions serving different traffic percentage is represented by a config like this:
traffic {
percent = 90
revision_name = live_revision_name
}
traffic {
percent = 10
revision_name = canary_revision_name
}
Simple as this static configuration may appear, its dynamic equivalent comes with complications that arise from the fact that the multiple revisions feature is not a first-class citizen in the google terraform provider:
- Autogenerated revision names can no longer be used
- The google_cloud_run_service resource operates a single revision per deployment: how does one manage and reference two at the same time?
- A mechanism to handle the two states is required:
– Control the current state
– Control traffic split percentage between revisions in the canary state
To address revision name-related problems, let’s take a look at how a revision name is set within the resource:
resource "google_cloud_run_service" "service" {
name = my_service_name
template {
metadata {
name = revision_name
}
}
}
With the following considerations in mind:
Revision name must be prefixed with the service name it belongs to, and
There should be a distinction between live and canary versions, and
A revision name can’t be reused to deploy a new application version — it has to be unique,
Revisions can be named following a pattern like this:
service_name + changing_part (commit, application version etc. ) + live/canary identifier e.g.:
locals {
rev_name_live = my_service_name_${var.app_version_live}_live
rev_name_canary = my_service_name_${var.app_version_canary}_canary
}
resource "google_cloud_run_service" "service" {
name = my_service_name
template {
metadata {
name = local.rev_name_live
}
}
traffic {
percent = 100
revision_name = local.rev_name_live
}
}
With the revision names point now figured out, how can we actually manage and use multiple revisions at the same time? Knowing that:
- We need a new revision name for canary — it can be formed anytime as the naming pattern is now known, and
- Revision itself should physically exist — it needs to be deployed first,
We introduce a canary switch — e.g. a bool variable canary_enabled, that controls the state, and a dynamic traffic block that is conditionally created when the service is in the canary state:
variable "canary_enabled" {
description = "Canary switch"
type = bool
}
locals {
rev_name_live = my_service_name_${var.app_version_live}_live
rev_name_canary = my_service_name_${var.app_version_canary}_canary
}
resource "google_cloud_run_service" "service" {
name = my_service_name
specs {
containers {
# if canary is enabled, deploy a canary image
image = var.canary_enabled ? var.canary_image_name : var.live_image_name
}
}
template {
metadata {
# if canary is enabled, deploy a canary revision
name = var.canary_enabled ? local.rev_name_canary : local.rev_name_live
}
}
traffic {
# live serves 100% by default. If canary is enabled, this traffic block controls canary
percent = var.canary_enabled ? local.canary_percent : 100
# revision is named live by default. When canary is enabled, a new revision named canary is deployed
revision_name = var.canary_enabled ? local.rev_name_canary : local.rev_name_live
}
dynamic "traffic" {
# if canary is enabled, add another traffic block
for_each = canary_enabled ? [canary] : []
content {
# current live's traffic is now controlled here
percent = var.canary_enabled ? 100 - var.canary_percent : 0
revision_name = var.canary_enabled ? lovcal.rev_name_live : local.rev_name_canary
}
}
}
With the switch off, the configuration has only one traffic block that controls the current live revision. Once switched on, and here’s the trick to manage two revisions at the same time, the current live traffic is controlled in the dynamic block, while the original traffic block manages the canary revision.
Let’s toggle the switch:
# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"
~ template {
~ metadata {
~ name = "example-cr-service-9da0c6d-live" -> "example-cr-service-06dccff-canary"
}
~ spec {
~ containers {
~ image = "us.gcr.io/project/repo:9da0c6d1cddb26b913859e48e2ccdeba3c0ee596" -> "us.gcr.io/project/repo:06dccffbcd0fe15807feed60c51bd03dc12eef18"
# (2 unchanged attributes hidden)
}
}
}
~ traffic {
~ percent = 100 -> 10
~ revision_name = "example-cr-service-9da0c6d-live" -> "example-cr-service-06dccff-canary"
# (1 unchanged attribute hidden)
}
+ traffic {
+ percent = 90
+ revision_name = "example-cr-service-9da0c6d-live"
}
}
Note another added variable canary_percent, that we can now use to control the traffic split between the two revisions:
# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"
~ traffic {
~ percent = 10 -> 20
# (2 unchanged attributes hidden)
}
~ traffic {
~ percent = 90 -> 80
# (2 unchanged attributes hidden)
}
}
The original problems are now tackled and a canary revision can be conditionally deployed to serve an adjustable portion of the traffic the service receives alongside the currently running live version, both controlled with the same terraform resource. A revision’s name contains a changing part, what allows for new revision names generation.
Let’s now promote the canary version to live by updating the live_image_name variable and toggling the canary switch off:
# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"
~ template {
~ metadata {
~ name = "example-cr-service-06dccff-canary" -> "example-cr-service-06dccff-live"
# (3 unchanged attributes hidden)
}
}
~ traffic {
~ percent = 20 -> 100
~ revision_name = "example-cr-service-06dccff-canary" -> "example-cr-service-06dccff-live"
# (1 unchanged attribute hidden)
}
- traffic {
- latest_revision = false -> null
- percent = 80 -> null
- revision_name = "example-cr-service-9da0c6d-live" -> null
}
}
Random postfix
There are often cases when we want to update the configuration of an existing version without changing the application version itself e.g. to change a service property like max scale, or even deploy an old version again. The current set up does not allow for that as a revision name will stay the same and a deploy will fail as revision names can’t be reused. A process to regenerate revision names whenever there is a change to the service properties is then required. The naming convention can be slightly updated to introduce a random part — or if you don’t like the idea of having an identifier in a revision name, you can just use the random part — in it, that will be regenerated each time there is a change in a service properties. Such a postfix should be separate for canary and live revision names, so that changes to a live revision do not trigger changes to a canary one, and vice versa:
variable "live_image" {
description = "Live image name"
type = string
}
variable "max_scale" {
description = "Cloud Run service max number of instances"
type = number
}
variable "canary_image" {
description = "Canary image name"
type = string
}
variable "canary_enabled" {
description = "Canary switch"
type = bool
}
variable "canary_percent" {
description = "Percent of traffic canary revision will get"
type = number
}
variable "force_new_revision" {
description = "Dummy variable to trigger a new revision name"
type = bool
}
resource "random_string" "rev_name_postfix_live" {
# it gets updates on changes to the following 'keepers' - properties of a service
keepers = {
image_name = var.live_image_name
max_scale = var.max_scale
force_new_revision = var.service.force_new_revision
}
length = 2
special = false
upper = false
}
resource "random_string" "rev_name_postfix_canary" {
keepers = {
canary_enabled = var.canary_enabled
canary_image_name = var.canary_image_name
}
length = 2
special = false
upper = false
}
locals {
rev_name_live = my_service_name_${var.app_version_live}_live_${rev_name_postfix_live.rev_name_postfix.result}
rev_name_canary = my_service_name_${var.app_version_canary}_canary_${rev_name_postfix_canary.rev_name_postfix.result}
canary_percent = var.service.canary.percent
}
Upon toggling the canary_enabled switch in the config file, we’ll see the random part in name is regenerated with the rev_name_postfix_canary resource recreated in response to a change in its keepers, what then triggers a new CR revision creation:
# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"
# (4 unchanged attributes hidden)
~ template {
~ metadata {
~ name = "example-cr-service-8f09270-live-dw" -> (known after apply)
# (3 unchanged attributes hidden)
}
}
~ traffic {
~ percent = 100 -> 10
~ revision_name = "example-cr-service-8f09270-live-dw" -> (known after apply)
# (1 unchanged attribute hidden)
}
+ traffic {
+ percent = 90
+ revision_name = "example-cr-service-8f09270-live-dw"
}
# (2 unchanged blocks hidden)
}
# module.example.random_string.rev_name_postfix_canary must be replaced
-/+ resource "random_string" "rev_name_postfix_canary" {
~ id = "vw" -> (known after apply)
~ keepers = { # forces replacement
~ "canary_enabled" = "false" -> "true"
# (1 unchanged element hidden)
}
~ result = "vw" -> (known after apply)
# (9 unchanged attributes hidden)
}
Additionally, we can add a special keeper force_new_revision that allows for triggering a new revision creation when there are no changes to the service properties, what may be useful in some cases.
Canary identifier
The beauty of a canary deployment is only in the eyes of the engineer who monitors it, and to monitor a canary version, it’s essential to be able to easily identify it in your monitoring systems. Though a revision name can be handy, we can help the systems distinguish between revisions with a special identifier like an environmental variable set on a canary revision dynamically, that can be further used in the application code:
# canary identifier env var
dynamic "env" {
for_each = var.service.canary.enabled ? {"CANARY" = 1} : {}
content {
name = env.key
value = env.value
}
}
~ spec {
~ containers {
# (3 unchanged attributes hidden)
+ env {
+ name = "CANARY"
+ value = "1"
}
}
}
Rollback
In case your monitoring systems reveal that the version is not good enough to be fully released, it can be rolled backed in a blink by toggling the switch off, what shifts traffic back to a previous live revision:
# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"
~ template {
~ metadata {
~ name = "example-cr-service-06dccff-canary-rr" -> "example-cr-service-7b2b51c-live-8g"
}
}
~ traffic {
~ percent = 10 -> 100
~ revision_name = "example-cr-service-06dccff-canary-zh" -> "example-cr-service-7b2b51c-live-8g"
# (1 unchanged attribute hidden)
}
- traffic {
- latest_revision = false -> null
- percent = 90 -> null
- revision_name = "example-cr-service-7b2b51c-live-8g" -> null
}
# (2 unchanged blocks hidden)
}
# module.example.random_string.rev_name_postfix_canary must be replaced
-/+ resource "random_string" "rev_name_postfix_canary" {
~ id = "zh" -> (known after apply)
~ keepers = { # forces replacement
~ "canary_enabled" = "true" -> "false"
# (1 unchanged element hidden)
}
~ result = "zh" -> (known after apply)
}
Conclusion
Useful as an idea may sound in theory, it’s practical implementation in a particular context may be complicated by numerous different things.
I hope the post gives you some practical knowledge on how a canary deployment can be implemented for GCP CloudRun service fully managed with terraform.
Happy canarying.
Notes
- Check out the official guide. There are also tags used, that can be easily added to the implementation if needed https://cloud.google.com/architecture/implementing-cloud-run-canary-deployments-git-branches-cloud-build
- See this post on CloudRun Release Manager. I did not use it for a number of reasons, but it’s worth checking out and may fit your particular case https://medium.com/google-cloud/automatic-release-propagation-for-canary-releases-with-cloud-run-1ccc2ec74c7f
- There is also this v2 resource that does not seem to address the issue https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_v2_service
- It wouldn’t hurt to have a separate resource for revisions. Thumbs up https://github.com/hashicorp/terraform-provider-google/issues/10095