Unleashing App Engine Scalability

How to leverage App Engine to create scalable and cost-effective applications.

Building a Scalable Website

Imagine you were asked to build a website that tens of thousands, or even millions of people will be accessing—sometimes all at the same time. You would be challenged to provide a fast, reliable service, and it would be essential for the website to scale appropriately to demand.

Does your team have sufficient experience to build a scalable and reliable cluster of web servers? Can you easily and accurately estimate the cost of developing, operating, and maintaining your servers? Do you have domain expertise to hire the best people for the job?

Google App Engine offers excellent solutions for these challenges. In this document, we will look at how App Engine handles incoming user requests and how it scales your application as traffic increases and decreases. You will learn how to configure App Engine's scaling behavior to find the optimal balance between performance and cost, so that you can successfully harness the power and flexibility of Google App Engine for your projects.

Life of a Request in the App Engine Architecture

To begin, you need to understand how user requests are handled inside App Engine, how they are delivered to your application instance, and how the response is returned to the user. Understanding this overall flow will help you determine how to optimize your application. Figure 1 shows how service requests and responses flow through App Engine’s internal architecture.

Figure 1: How user requests are routed to application instances

A user request is routed to the geographically closest Google data center, which may be in the United States, the European Union, or the Asia-Pacific region. In that data center, an HTTP server known as the Google front end receives the request and routes it through the Google-owned fiber backbone to an App Engine data center that runs your application.

The App Engine architecture includes the following components:

App Engine front end servers are responsible for load balancing and failover of App Engine applications. As shown in Figure 1, the App Engine front end receives user requests from the Google front end and dispatches each request either to an app server for dynamic content or to a static server for static content.
App servers are containers for App Engine applications. An app server creates application instances and distributes requests based on traffic load. The application instances contain your code—the request handlers that implement your application.

The app server runtime environment includes APIs to access the full suite of App Engine services, allowing you to easily build scalable and highly available web applications.
Static servers are dedicated to serving static files for App Engine applications; they are optimized to provide the content rapidly with minimal latency. (To further reduce latency, Google may move high-volume public content to an edge cache.)
The app master is the conductor of the whole App Engine orchestra. When you deploy a new version of an App Engine application, the app master uploads your program code to app servers and your static content to static servers.

The App Engine front end and the app servers work together, tracking the available application instances as your application uses more or fewer instances.

App Engine is a Container Technology

App Engine is a container-based platform as a service (PaaS) offering. An application instance, running in an app server, is the container that isolates resources between applications. Each application instance is guaranteed to receive dedicated server resources, such as CPU time and memory, as well as strict security isolation.

Application instances are created and managed like Linux processes, so they can be started quickly and consume minimal system resources. In contrast, the unit of allocation and management in a typical infrastructure as a service (IaaS) offering is a hypervisor-based virtual machine (VM). Figure 2 illustrates the differences in the amount of resources that need to be initialized for a new VM instance compared to a new App Engine instance.

Figure 2: Comparing VMs to application instances

The most significant difference between VMs and App Engine is in how the operating system overhead is managed. Each VM deployment hosts its own image of an operating system, which requires memory resources in the host machine. The developer is typically responsible for creating, storing, and managing that image. This may include applying security patches, installing device drivers, and performing administrative and maintenance tasks.

It can take tens of seconds to boot up an operating system each time a new VM is added to handle more requests. In contrast, an application instance can be created in a few seconds.

Application instances operate against high-level APIs, eliminating the need to communicate through layers of code with virtual device drivers.

In summary, application instances are the computing units that App Engine uses to scale your application. Compared to VMs, application instances are fast to initialize, lightweight, memory efficient, and cost effective.

Overall, App Engine scales easily because its lightweight-container architecture can spin up application instances far more quickly than VMs can be started.

The next section describes how you can design your application to take advantage of that architecture.

Optimizing your App Engine Application

When you design and deploy your App Engine application, strive to minimize the following two related factors:

The response time to user requests
The overall cost of application instances running inside the app servers

To reduce the response time, App Engine automatically increases the number of instances based on the current load and developer-specific configuration parameters. However, additional instances cost more. To minimize additional cost, you need to understand how and when instances are created or deleted by App Engine in response to changes in the traffic load.

The following sections explain what an instance is and how to configure its parameters to balance responsiveness and cost effectiveness for your application.

Best Practices for Optimizing Scalability

Web applications are typically partitioned into two types of processing: immediate, real-time, interactive processing for a user request, and longer-term processing, such as complex database updates, batch processing, or integration with other, slower systems.

App Engine recognizes this distinction by providing frontend instance classes (F1, F2, F4) for low-latency interactive responses and backend instance classes (B1, B2, B4, B8) for high-latency background processing.

Code modules deployed to a frontend instance receive requests from clients and process them quickly, typically in the range of tens of milliseconds up to a few seconds. App Engine requires that frontend instances respond to each request within 60 seconds. If your application logic cannot fully process a request in that time, you can use task queues to defer the processing.

Code modules deployed to a backend instance do not have a limit on processing time and can be used to process tasks from a task queue, do long-running computations, run MapReduce jobs or implement other data-processing pipelines.

There are three types of instance scaling available. For modules on backend instance classes, you use either:

Manual scaling—App Engine creates the number of instances that you specify in a configuration file. The number of instances you configure depends on the desired throughput, the speed of the instances, and the size of your dataset balanced against cost considerations. Manually scaled instances run continuously, so complex initializations and other in-memory data is preserved across requests.
Basic scaling—App Engine creates instances to handle requests and releases the instances when they becomes idle. Basic scaling is ideal and cost effective if your workload is intermittent or driven by user activity.

For modules on frontend instance classes, you use automatic scaling: App Engine adjusts the number of instances based on the request rate, response latencies, and other application metrics. You can control the scaling of your instances to meet your performance requirements.

Overall, the dynamic scalability and cost effectiveness of an App Engine application is primarily controlled by the design and configuration of the frontend instances. Follow these important best practices to optimize their scalability:

Design for reduced latency and more queries per second (QPS).
Optimize idle instances and pending latency.

The rest of this paper will focus on applying these practices to automatic-scaling frontend instance classes. Refer to the App Engine Modules documentation (Java, Python, Go, PHP) for information on managing backend instance classes.

Less Latency, More QPS

Queries per second (QPS) characterizes the capacity of an instance. QPS is defined as the number of HTTP requests one instance can process in one second¹. For example, peak traffic in the Open for Questions application was about 700 QPS, and when App Engine hosted the Royal Wedding website, the worldwide media coverage generated around 32,000 QPS at its peak.

To handle more QPS, add more instances, as expressed by this formula:

Total QPS = Average QPS x Number of Instances

You can find the total number of instances and average QPS for your application on the Instances page of the App Engine Admin Console, as shown in Figure 3.

Figure 3: Instances page of the Admin Console

In this example, the average QPS is 10.479. Because there is a total of seven instances running for this application, the site is processing 73.353 QPS in total.²

As a best practice, strive to minimize the time required to process each request. You can check on the total processing time for requests on the Logs page of the Admin Console (Figure 4).

The example request in Figure 4 shows a latency of 402 ms. This time is important for two reasons:

A slow response directly impacts the user experience.
Your instance is busy during that time, so App Engine may create another instance to handle additional requests. If you can optimize your code to execute in 200 ms, the user experience is improved, and your QPS may be doubled without running extra instances.

Appstats is a powerful tool you can use to understand, optimize, and improve your application’s QPS. As shown in Figure 5, it shows the number of RPC calls that are invoked inside each request, the duration of each RPC call (such as Datastore or Memcache access), and the contribution of each RPC call to the overall latency of the request. This information gives you hints for finding bottlenecks in your application.

Figure 5: Example timeline from Appstats

If the Appstats graphs indicate that your application’s bottleneck is in CPU-intensive tasks, rather than waiting for RPC calls to return, you could try a higher CPU class for the frontend instances to reduce the latency. While this increases the CPU cost of each instance, the number of instances required to support the load will decrease, and the user experience improves without a major shift in the total cost.

You can also increase the QPS by letting App Engine assign multiple requests to each instance simultaneously. By default, one instance can run only one thread to prevent unexpected behavior or errors caused by concurrent processing. If your application code is thread-safe and implements proper concurrency control, you can increase the QPS of each instance without additional cost by specifying the threadsafe element in the configuration file.

Optimize Idle Instances and Pending Latency

While QPS represents the total throughput of your application, other parameters, such as idle instances and pending latency, determine the elasticity of your application scalability. Configure these parameters on the Application Settings page of the Admin Console³ (Figure 6).

Idle Instances

Idle instances help your site handle a sudden influx of requests. Usually, requests are handled by existing, active, available application instances. If a request arrives and there are no available application instances, App Engine may need to activate an application instance to handle that request (called a loading request). A loading request takes longer to respond, because it must wait while the new instance is initialized.

Idle instances (also called resident instances) represent the number of instances that App Engine keeps loaded and initialized, even when the application is not receiving any requests. The default is to have zero idle instances, which means requests will be delayed every time your application scales up to more instances.

You can adjust the minimum and maximum number of idle instances independently with sliders in the Admin Console.

We recommend that you maintain idle instances if you do not want requests to wait for instance creation and initialization. For example, if you specify a minimum of ten idle instances, your application will be able to service a burst of requests immediately on those ten instances. We recommend that you allocate idle instances carefully because they will always be resident and incur some cost.

You can also set an upper limit to the number of idle instances. This parameter is designed to control how gradually App Engine reduces the number of idle instances as load levels return to normal after a spike. This helps your application maintain steady performance through fluctuations in request load, but it also raises the number of idle instances (and consequently running cost) during periods of heavy load. Lowering the maximum number of idle instances can reduce cost.

Pending Latency

Pending latency is the time that a request spends in a pending queue for an app server. You can set minimum and maximum values for this parameter.

When an App Engine front end receives a request from a user and no instance is available to service that request, the request is added to a pending queue until an instance becomes available. App Engine tracks how long requests are held in this queue. If requests are held for too long, App Engine creates another instance to distribute the load. Figure 7 shows how instances are added or deleted based on traffic volume.

Figure 7: Busy instances, pending queue and pending latency

The minimum pending latency is the expected and acceptable latency for the pending queue. App Engine will always wait the specified minimum pending latency for an instance to become available. Once the minimum is reached, App Engine applies heuristics to determine whether and when to start an additional instance.⁴ (Waiting for an existing instance to become available may be faster, and it is certainly cheaper, than starting a new one.)

The maximum pending latency is the threshold of unacceptable latency. If a request is still pending when the specified maximum latency is reached, App Engine immediately starts a new instance to serve it. For example, if you set the the maximum pending latency to one second, App Engine will create a new instance if a request has been waiting in the pending queue for more than one second. Adding more instances results in increased throughput and incurs more cost.

Note: If you have specified a minimum number of idle instances, the pending latency parameters will have little or no effect (unless there is a sustained traffic spike that grows to exhaust the idle instances faster than they can be initialized).

Best Practices and Anti-Patterns

Table 1 describes what it means to set the minimum and maximum values on idle instances and pending latency. Based on this matrix, you can optimize these parameters for your requirements.

Table 1: Semantics of idle instances and pending latency
	Idle Instances Minimum ⁵	Idle Instances Maximum	Pending Latency Minimum	Pending Latency Maximum ⁶
Specifies	Minimum number of resident instances	Maximum number of resident instances	Time to hold requests on Pending Queues	Time to wait before creating new instances
Low settings	- Fewer instances before spike - Lower cost	- Fewer instances after spike - Lower cost	- More instance creation - Higher cost	- More instance creation - Higher cost
High settings	- More instances before spike - Higher cost	- More instances after spike - Higher cost	- Slower response - Lower cost	- Slower response - Lower cost

For example, if you expect high traffic to your site because you have scheduled an event or expect major media coverage related to a product release, you could increase the minimum number of idle instances and decrease the maximum pending latency shortly before and during the event to smoothly handle traffic spikes.

Known anti-patterns are to set the minimum and maximum idle instances close to each other or specify a very small pending latency gap. Either of these may cause unexpected scaling behavior in your application.

We recommend the following configurations:

Best performance—Increase the value for the minimum number of idle instances and lower the maximum pending latency while leaving the other settings on automatic.
Lowest cost—Keep the number of maximum idle instances low and increase the minimum pending latency while leaving the other settings on automatic.

We also recommend that you conduct a load test of your application before trying out the recommended settings. This will help you choose the best values for idle instances and pending latency.

Minimizing Loading Request Time

It is also important to reduce the time it takes for a loading request to complete. Loading requests take a long time to complete and result in a poor user experience for the users who happen to trigger them. In extreme cases, the user request may time out.

Do the following to minimize the time required for loading requests:

Load only the minimum amount of code required for startup.
Access the disk as little as possible.
Load code from a zip or jar file, which is faster than loading from many separate files.

If you cannot decrease the time required for a loading request to complete, you may need to have more idle instances to ensure responsiveness when the load increases. Reducing the loading request time increases the elasticity of your application and lowers the cost.

Conclusion

One of the biggest advantages of Google App Engine is that lightweight application instances can be added within a few seconds. This enables highly elastic scaling which adapts to sudden increases in traffic volume. To benefit from this power, you have to understand how requests are distributed to application instances, how to maximize the QPS of your application by increasing the throughput per instance, and how to control elasticity. By following best practices, you can build web applications that scale smoothly when traffic increases rapidly. In addition, following best practices helps you tune your application for an optimal balance of cost and performance.

Notes

The term QPS is Google’s terminology to express requests per second. It includes all HTTP requests to the servers and is not restricted to search queries.
For the mathematically precise: QPS is computed over the past 60 seconds. The seven instances handled 4401 requests in 60 seconds for 4401 / 60 = 74.35 QPS, so the average is 74.35 / 7 = 10.479 QPS. For the first instance: 13.133 QPS implies that the instance processed 13.333 * 60 = 787 requests.
If you convert your application to use modules, this graphical interface is replaced by parameter settings in the per-module configuration files.
App Engine knows what requests are outstanding, how long those requests are likely to take (from past statistics), and how loaded the various app servers are. This means it can predict whether an instance will be available in time to service a request before the maximum pending latency is reached.
The minimum idle instances is enabled using the Console for a paid app.
The maximum pending latency is enabled using the Console for a paid app.

References

App Engine Documents