Play! Framework Thread Tuning

Posted by Mayumi on July 15, 2014

What exactly is threads and thread-pools?

In higher level, threads are best described as bookmarks for code in execution. The processor is the only smart brain who can actually do the work; typical machine will have more work than the number of processors available. Therefore we create a notion of bookmark to make processor work on different tasks little by little switching between them and storing where it left off in the bookmarks. This gives us an illusion that work is done simultaneously, this is also known as context-switching.

In the web context, every request is technically a new code in execution. An ExecutionContext or Thread-Pool simply is pre-initialized threads in the memory waiting to service a requests. First thought that comes in the mind when you learn the notion of thread-pool is that why can’t we create and destroy new threads each time in needed basis if its just a bookmark? This approach seems much cleaner! In perfect world, this is exactly what we should do in my opinion, however in the world of JVM, it is expensive to create and destroy the threads on demand. Creation cost a lot of resources and time, and destruction is even more expensive, we don’t control when the cleaning up of a particular resource happens, JVM induces cleaning up or also known as garbage collection when its memory space is filled up. And one of this garbage collection is call stop-the world GC, which will halt the entire application while cleaning is taking place. By now, you should get an idea, it is not optimum to create and destroy threads because of this overhead.

Different type of servers will need different number of threads in its pool, this directly affect the responsiveness of the application. Therefore tuning the optimum number of threads that maximizes the application performance is vital.

Blocking
The most important step in tuning thread is to know when the services are blocking. Blocking services include synchronous call to the database, network IO or socket connections. When processing these services, thread’s CPU goes’s idle while waiting for the task to complete.

Motivation for Thread Tuning

Typical web servers are synchronous, this means for every request, each threads are completely occupied until the requests completes. In this kind of environment, having a large thread-pool is a way to achieve a concurrency by context-switching among threads. Play’s documentation recommends to have 300 threads in the thread-pool when the application is composed of blocking services.

Recently, we had an revamp of the website. We converted majority of services to be asynchronous or also known as non-blocking. When a service is non-blocking, it supports notion of Futures. This Future simply is a contract that ensures that the result will be available at the future times.

In a typical asynchronous scenario, when a request is received, a Future is created and queued up. This Future is periodically checked by the random threads within the pool until either Future returns or a timeout is reached . Therefore technically no thread is ever blocked. Play recommends default threading configuration for this scenario. In default thread configuration, we have minimal number of threads, which is number of cores + 1. The idea is we won’t need more than number of processors available because tasks are never blocking. In other words, tasks will take longer to complete if we have more threads in the thread-pool because this time the context-switch is going to make the tasks slower than necessary by switching among them.

Available Types of Thread-Pools

I have described threads as a bookmark in the introduction. The bookmark doesn’t seem to be doing much, but a context which governs the creation and behaviour of these bookmarks are very smart. This context is known as ExecutorService, its an interface which is an abstraction for executing tasks asynchronously. Play Framework typically uses ThreadPoolExecutor or ForkJoinPool which are the implementations of ExecutorService.

These ExecutorServices are very similar in a way that they both have inbound queues for which tasks are queued up. Each thread will take a task from this queue in synchronized manner. Threads in ThreadPoolExecutor will always take the work from this shared inbound queue which adds a synchronization overhead. However threads in the ForkJoinPool has its own local inbound queue as well. Once their own queue is empty, threads will attempt to scan through other threads’s local queue to help complete the tasks, this is known as work stealing. Since threads will be mainly taking the work off their own queues, synchronization overhead for the shared inbound queue is smaller compare to the ThreadPoolExecutor.

ForkJoinPool is the default ExecutorService chosen by Play Framework, and this is the executor we’re using for all our thread-pools.

Testing Environment

In order to prove the theory of different thread-configurations, I am using Gatling stress testing tool to perform stress testing against localhost to monitor the latency and throughput. The test is performed with 3000 concurrent users with ramp up time of 20 seconds, this gives us 150 concurrent requests per seconds with 4 seconds pause time before the start to give server little time to warm up. These numbers were based on limitation on my Mac OSX.

First step in the Thread Tuning: Know the nature of your application

First step in tuning the threads is to learn and understand the nature of your application. This is specifically whether services within the application is asynchronous (non-blocking) or synchronous (blocking). A services is asynchronous or non-blocking if it returns a future, other wise its all blocking.

Configuring Thread-Pools in Play

In order to define a new thread-pool in Play, we simply define a nested JSON like entry in HOCON format ( “Human-Optimized Config Object Notation”) to application.conf file as you can see below. All the application will have default-dispatcher which is initialized with the size of minimum number of cores in the machine up to 24. This default context is call default-dispatcher. In order to change the configuration of the default context, we simply override it. All the other threads defined is custom thread-pools.

//Overriding default-dispatcher
play {
 akka {
  actor {
   default-dispatcher = {
    fork-join-executor {
      parallelism-factor = 1.0 //ceil(available processors * factor)
      parallelism-max = 24 //upper cap
    }
   }
  }
}
//A custom thread-pool
play {
 blocking-pool = {
  fork-join-executor {
    parallelism-min = 100 //lower cap starting with parallelism-factor of 1
    parallelism-max = 100 //upper cap starting with parallelism-factor of 1
  }
 }
}

Handling Blocking (Synchronous) Services

Any services which does not return a Future is considered blocking. Blocking simply means a thread which is processing the request is blocked or unusable for the entire span of request. This means if you have a long-running blocking services, it can easily eat up all the threads in the pool and cause a starvation or blocking up of the thread-pool. This will directly affect the responsiveness of the application, since no new requests can be serviced until some threads are available.

Wrapping the blocking services with Future{ }

Any code can be wrapped with Future{ }. What does it exactly mean when you say we wrap the code with Future? In simple words, the block of code is executed in the worker thread taken from the thread-pool indicated within the scope instead of the thread which initialized the execution of code.

Typical Action looks like below. This action is always executed by the default-dispatcher even if you specify the dedicated thread-pool in the global scope.

def sync = Action {
 Thread.sleep(100)
 Ok(“Done")
}

By wrapping the blocking code by the Future, and indicating a dedicated thread-pool within the scope, we’re instructing Play to execute wrapped code in the blockingPool instead of the default-dispatcher. Note: initial request is always received by the default-dispatcher even if the dedicated thread-pool is in the global scope.

import Context.blockingPool
 def async = Action.async { 
  Future {
    Thread.sleep(100)
    Ok(“Done”)
  }
}

Using Future with blocking{ }

Play has a way to instruct the thread-pool that the block of code is going to block. This is done by wrapping Future{ } with blocking{ } as you can see below. Note that this only works under the worker thread, this means blocking{ } without outer Future{ } has no effect.

What this code essentially does is it detects that the underlying code is going to block and potentially cause the thread-pool starvation which simply means blocking up the entire thread-pool. In order to avoid the starvation, ForkJoinPool will spawns new temporal thread and do the blocking operation in this newly created thread. The new temporal thread is eventually destroyed on completion of task. We will observe temporal spike in the thread-pool size beyond parellelism-max. However the size will shrink back to the pool limit eventually. Downside of using this method is there is a possibility of out of memory exception, if the request to the long running blocking operation concentrates. Stress testing showed that temporal threads are destroyed constantly by the minor GC which does not add too much overhead to the running application. However it is hard to conclude implication to the GC without observing this scenario in the long run in the production environment. Uncertain potential cause of uncontrollable GC or filling up the heap is enough to avoid this solution if possible.

import Context.blockingPool
 def async = Action.async { 
  Future {
    blocking{
      Thread.sleep(100)
      Ok(“Done”)
    }
  }
}

Using Dedicated Thread Pool for blocking services

Final suggestion is to create a dedicated thread-pool for the blocking services.

Pool-size usually is set to maximum allowable concurrent connection to the blocking service in question. For example some database only allow finite concurrent connections. Then it makes sense to set the pool-size too be the connection limit, other wise connection will be rejected anyways. Compared to above two methods, this solution performed identical to the blocking{ } solution for both latency and throughput.

Results of Gatling stress test

	ExecutionContext	Throughput ( Request per Second)	Latency (95th percentile)
Action{ sleep(100) }	default (max 24)	77	23s
Action.async{ Future{ sleep(100) } }	default (max 24)	79	19s
Action.async{ Future{ sleep(100) } }	custom (max 150)	149	105ms
Action.async{ Future{ blocking{ sleep(100) } } }	default (max 24) Exceeded	149	103ms
Action.async{ Future{ blocking{ sleep(100) } } }	custom (max 50) Exceeded	149	103ms

As a result, blocking code will better handle in dedicated thread-pool or using the blocking{ }. In order to avoid potential cause of out of memory exception and unwanted GCs, handling blocking code with the dedicated thread-pool is recommended.

Handling Non-blocking (Asynchronous) Services
Non-blocking services returns a Future. When the requests arrives, tasks are initially queued and only executed from the queue when all their inputs are ready, thus each worker thread will either be executing a task and consuming full CPU or be entirely idle waiting for a ready task to appear in its queue. It follows that if there are always ready tasks in each worker thread’s queue, each worker thread will never need to be context switched by the OS as it will never idle. In other words, having a large thread-pool could potentially slow down the fast tasks by switching rapidly.

import play.api.libs.concurrent.Execution.Implicits._
def async = Action.async { 
  WS.url("http://yahoo.jp”).get map{x => Ok(x.body)}
}

Results of Gatling stress test

	ExecutionContext	Throughput (Request per Second)	Latency (95th percentile)	Exception
Action.async{ WS.url(“…”) }	default (max 24)	105	7s	0
Action.async{ WS.url(“…”) }	custom (max 200)	45	38s	2 (Timeout Exception)

Conclusion
As we can see from above observations, we can conclude that there is no one magical tuning method which will save all applications out there. The first step begins by learning and understanding the nature of your application. Then tuning the thread model gradually by monitoring and observing the behaviours of components. Play framework makes it very easy to allow this process.

Happy Tuning ^_^

About these ads

This entry was posted on July 15, 2014 at 2:33 am and is filed under Play!Framework, Scala, Uncategorized. Tagged: gatling, play!framework, Scala, threads, tuning threads. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Verba Docent Exempla Trahunt

Words Instruct, Illustrations Lead

Categories

Archives

Blogroll

Meta

Subscribe