Assume It Worked and Fix Later
How to Make Your App Faster and More Reliable
During account signup, a web server will make an HTTP request to send an email. Not only are synchronous requests slow, but if the remote host is unresponsive, the application can become unresponsive.
A simple way to improve performance is to use a library like async to concurrently make requests while doing other computations. However, if you need the result of an outgoing network request, you will still have reliability issues if the remote host goes down.
Often, when a handler is making outgoing requests, the response is not needed from the perspective of the client. Signup emails can meet this criteria, but time sensitive notifications are an even better example, since they are usually best effort service anyway.
If your email service is down, it can be beneficial to have the signup succeed regardless. By decoupling the success of an email request from the success of account signup, you can improve the reliability of your application. That is the “assume it worked” part of the title, you will still need to persist a record of which messages were sent and a periodic job to send them, which is the “fix it later” part. Depending your requirements, there might not be a “fix it” phase at all.
In an ideal world, you would have a durable queue service like Kafka, co-located with your server, with low or sub-millisecond latency. This magical Kafka is better and simpler solution then the ones I will present. However, you might not find yourself in such a blessed circumstance.
I’ll walk through an example of making emails non-blocking, using the Amazon Simple Email Service and the corresponding amazonka package, amazonka-ses.
Synchronous Baseline
The simplest method is to make a call to AWS SES inline to send an email.
post "/user" $ do
input <- Scotty.body
email <- maybe missingEmailError return
$ input ^? key "email" . _String
resp <- liftIO
$ runResourceT
$ runAWS env
$ AWS.send
$ makeEmail email
logFailedRequest resp
-- Imagine there is code here for
-- inserting a user into the database
json $ object ["id" .= email]
Attempt 1: Fork a Thread
An easy way to achieve non-blocking asynchronous behavior is to fork a thread every time one needs to send an email.
liftIO $ forkIO $ handle logExcept $ do
resp <- liftIO
$ runResourceT
$ runAWS env
$ AWS.send
$ makeEmail email
logFailedRequest resp
If AWS becomes slow, or is timing out, my threads will queue up. The threads will start to eat resources and, if things get bad enough, my app could become unresponsive and crash.
Forking another thread has solved the performance problem in the typical case, but I have increased systematic risk if AWS SES goes down. A down email service can now cause my whole app to crash. Before, only the account creation requests would fail.
Solution 1: Add a Thread with a Timeout
To limit the amount of threads that can build up, we can add a timeout:
liftIO $ forkIO $ handle logExcept $
logTimeout <=< timeout (60 * 1000000) $ do
resp <- liftIO
$ runResourceT
$ runAWS env
$ AWS.send
$ makeEmail email
logFailedRequest resp
As long as the rate of signups is below our maximum number of concurrent requests, a problematic email service will not take down our site.
The downside is that it is a little unclear if we have prevented catastrophic failure. For one, we need to estimate our maximum number of concurrent signups. If our rate of signups was 10,000 a minute just as the email service went down, we could be in trouble … but we would probably be in trouble even if the email service was up. That’s a lot of signups. Also, we picked an arbitrary time of one minute for the timeout. It is possible this is too small of a value and we are timing out potentially successful email requests.
We also don’t have any way to limit the concurrency of simultaneous requests, or to ensure that all of the threads have finished before we shutdown. This change is could be a solution, but it leaves room for improvement.
Solution 2: Bounded Queue
Instead of forking a thread for every request, we need a way to quickly queue notification requests. The queue should be bounded, have non-blocking writes (we will just log failures) and blocking reads, so TBMQueue will suffice.
First we create our queue and worker thread during server startup:
worker :: Env -> TBMQueue SendEmail -> IO ()
worker env queue = do
-- Make a loop enclosing the thread, env, and queue vars.
let go = do
-- Block waiting for a new email to send
mpayload <- liftIO $ atomically $ readTBMQueue queue
case mpayload of
-- Nothing means the queue is closed and empty.
-- Stop the loop ending the thread
Nothing -> return ()
Just payload -> do
resp <- AWS.send payload
logFailedRequest resp
-- Start the loop again
go
handle logExcept $ runResourceT $ runAWS env go
main = do
env <- newEnv Discover
queue <- newTBMQueueIO 100000
threadId <- forkIO $ worker env queue
scotty ...
We write a simple helper function for enqueueing:
enqueueEmail :: TBMQueue SendEmail -> Text -> IO ()
enqueueEmail queue email = do
msuccess <- atomically
$ tryWriteTBMQueue queue
$ makeEmail email
case msuccess of
Nothing -> putStrLn "Wat!! The email queue is closed?"
Just success -> unless success
$ putStrLn "Failed to enqueue email!"
We can then use queue in the handler:
post "/user" $ do
input <- Scotty.body
email <- maybe missingEmailError return
$ input ^? key "email" . _String
liftIO $ enqueueEmail queue email
json $ object ["id" .= email]
We’re done. No matter how slow the email service gets, our app will use, at most, the memory in our bounded queue (which is small) and only the resources needed for our worker thread. Our worst-case situation is that we will fail to send some emails, but our app will stay stable and the performance will be good.
Making It Real
Okay, so we’re not done. We have to handle gracefully draining the queue on shutdown. We also have to restart the thread after exceptions.
To help us with shutdowns and restarts, we’ll use a library called immortal. immortal provides threads which restart after exceptions and we can wait on their completion. It also uses proper exception masking hygiene to setup an exception handler on the newly spawned thread, something I have elided in the examples above (but is better in the example project and also doesn’t really matter for these examples).
Our new worker function will look like:
worker :: Thread -> Env -> TBMQueue SendEmail -> IO ()
worker thread env queue = do
-- Make a loop enclosing the thread, env, and queue vars.
let go :: AWS ()
go = do
-- Block waiting for a new email to send
mpayload <- liftIO $ atomically $ readTBMQueue queue
case mpayload of
-- Nothing means the queue is closed and empty.
-- Stop the loop and kill the thread.
Nothing -> liftIO $ stop thread
Just payload -> do
resp <- AWS.send payload
logFailedRequest resp
-- Start the loop again
go
handle logExcept $ runResourceT $ runAWS env go
The only thing that changed is that we now take in a Thread and stop the Thread when the queue is empty and closed with:
Nothing -> liftIO $ stop thread
To create the thread in our main function, we write:
thread <- create $ \thread -> worker thread env queue
and right before main finishes, we add:
atomically $ closeTBMQueue queue
wait thread
which will close the queue and prevent the program from exiting until the queue has been drained. I extended this to multiple workers in the example project.
Further Considerations
In our simple example, we are merely logging issues, but a real system might want to backfill the missing emails; having a method for storing which customers have been sent an email could be helpful.
Additionally, the loop should be extended to emit events useful for monitoring, such as sending the queue size to a metrics server. I’ll cover this in a future blog post, which as a bonus, will include a trick for testing imperative code.
Conclusion
The steps here were presented in increasing complexity and developer effort. It doesn’t take much to write the final version, but it’s fine to scale-up based on your needs and experience (I would skip attempt one…just add the timeout); just make sure you understand the tradeoffs. Ultimately having a durable persistent queue like Kafka is probably best, but it is always good to have options.
The “fix it later” portion will require polling the database for unsent emails. I haven’t it covered how to do that. By a using an in memory queue, you reduce the polling period you would need for prompt delivery of emails. Additionally, in other cases besides a signup email, for instance real time notifications that are time sensitive, you might not need to backfill missing notifications at all. amazonka is just really easy to use, so I chose it as an example.
If you want to try play with the examples above, take a look at this demo web server project which highlights them: https://github.com/jfischoff/asynchronous-email-example