Let’s get things started with a coarse-grained overview of the physical makeup of the Evernote service. I won’t go into a lot of detail on each component here; we’ll aim to talk about the interesting bits in separate posts later.
Starting at the top-left corner of the diagram, all stats as of May 17th, 2011 …
Networking: Virtually all traffic to and from Evernote comes to www.evernote.com via HTTPS port 443. This includes all “web” activity, but also all client synchronization via our Thrift-based service API. Altogether, that produces up to 150 million HTTPS requests per day, with peak traffic around 250 Mbps. (Unfortunately for our semi-nocturnal Operations team, this daily peak tends to arrive around 6:30 am, Pacific time.)
We use BGP to direct traffic through fully independent network feeds from our primary (NTT) and secondary (Level 3) providers. This is filtered through Vyatta on the way to the new A10 load balancers that we deployed in January when we hit the limit of SSL performance for our old balancers. We’re comfortably handling existing traffic using one AX 2500 plus a failover box, but we’re preparing to test their N+1 configuration in our staging cluster to prepare for future growth.
Shards: The core of the Evernote service is a farm of servers that we call “shards.” Each shard handles all data and all traffic (web and API) for a cohort of 100,000 registered Evernote users. Since we have more than 9 million users, this translates into around 90 shards.
Physically, shards are deployed as a pair a SuperMicro boxes with two quad-core Intel processors, a ton of RAM and full chassis of Seagate enterprise drives in mirrored RAID configurations. On top of each box, we run a base Debian host that manages two Xen virtual machines. The primary VM on a box runs our core application stack: Debian + Java 6 + Tomcat + Hibernate + Ehcache + Stripes + GWT + MySQL (for metadata) + hierarchical local file systems (for file data).
All user data on the primary VM on one box is kept synchronously replicated to a secondary VM on a different box using DRBD. This means that each byte of user data is on at least four different enterprise drives across two different physical servers, plus nightly backups. If we have any problems with a server, we can fail the services from its primary VM over to the secondary on another box with minimal downtime via Heartbeat.
Since each users’ data is completely localized to one (virtual) shard host, we can run each shard as an independent island with virtually no crosstalk or dependencies. This means that issues on one shard don’t snowball to other shards.
To connect users with their shard, we push most of the work into the load balancers, which have a cascade of rules to find the shard in the URL and/or cookies.
UserStore: While the vast majority of all data is stored within the single-tier NoteStore shards, they all share a single master “UserStore” account database (also MySQL) with a small amount of information about each account, such as: username, MD5 password, and user shard ID. This database is small enough to fit in RAM, but we maintain high redundancy with the same combination of RAID mirroring, DRBD replication to a secondary, and nightly backups.
AIR processors: In order to allow you to search for words found within images in your notes, we maintain a pool of 28 servers that spend each day using their 8 cores to process new images. On a busy day, this translates into 1.3 or 1.4 million separate images. Currently, these use a mix of Linux and Windows, but we plan to convert them all to Debian by the end of the month now that we’ve removed some pesky legacy dependencies.
These servers run a pipeline of “Advanced Imaging and Recognition” (AIR) software developed by our R&D team. This software cleans up each image, identifies word-shaped regions, and then attempts to compile a weighted list of possible matches for each word using a set of “recognition engines” that each contribute a set of guesses. This includes engines developed by our own team which specialize in, for example, handwriting recognition, as well as licensed technologies from best-of-breed commercial partners.
Other services: All of these servers are racked in a pair of dedicated cages at our data center in Santa Clara, California. In addition to the hardware that provides our core service, we also have smaller groups of servers for lighter-weight tasks that only require one or two boxes or Xen virtual machines. For example, our “incoming email” SMTP gateway is a pair of Debian servers with Postfix and a custom Java mail processor built on top of Dwarf. Our @myen Twitter gateway is a simple in-house daemon using twitter4j.
Our corporate web site is Apache, our blogs are WordPress, most of our fully redundant internal switching topology is from HP, we use Puppet for configuration management, and we monitor with Zabbix, Opsview, and AlertSite. We run nightly backups with a combination of different software that migrates data over a dedicated 1Gbps link to a secondary data center.
Wait, but why? I realize this post leaves lots of obvious questions about why we’ve chosen to do X instead of Y in a number of different places. Why run our own servers instead of using a cloud provider? Why such stuffy old software (Java, SQL, local storage, etc.) instead of hot new magic bullets? …
We’ll try to get into more details to answer these questions in the next few months.
(*) UPDATE, June 29, 2011: The title of this post was changed from “Architectural Digest” at the request of Conde Nast.
May 18, 2011 at 6:06 pm | link
Hold on, there really aren’t any elephants? Damn….
May 18, 2011 at 6:09 pm | link
Minor typo in your graphic. It says “serves (28+)”. I know you serve at least 28 people but I think you meant “servers (28+)”.
BTW I appreciate your starting a technical blog like this.
May 18, 2011 at 6:30 pm | link
Thanks! I did the diagram with the web-based (flash-based) tool “Gliffy” to try it out. It’s a nice tool, but I found that it couldn’t keep up with my typing at times, so keystrokes would occasionally get dropped…
May 18, 2011 at 6:11 pm | link
I’m a bit concerned that you use md5 as the hashing algorithm for user passwords. MD5 is flawed. Do you at least salt them before hashing?
May 18, 2011 at 6:34 pm | link
Yes, we salt the passwords with a large random value, but the MD5 flaws aren’t really relevant to internal password storage. MD5′s vulnerabilities are mostly relevant for things like digital signatures, where you already have access to a particular MD5 and want to come up with a sequence of bits that will hash to the same value. In the case of internal password storage, you don’t have access to the MD5 hashes, and don’t really have the ability to feed arbitrary bits into the input function.
May 19, 2011 at 6:31 am | link
As far as I understand it, MD5 isn’t just vulnerable when you have the final hash, it’s also vulnerable to dictionary attacks, on account of it being so fast (http://codahale.com/how-to-safely-store-a-password/).
May 23, 2011 at 5:41 pm | link
Good article, Dave. I look forward to read about the details. About the passwords, why not SHA-1 (or even better, SHA-2)? I don’t think it will affect the performance, right?
May 23, 2011 at 6:55 pm | link
Since the hashed password is never exposed outside of our data center, we don’t think that the differences between MD5 and SHA-1 are relevant. I.e. the risks for MD5 are about producing two inputs that match the same output. In the case of a purely back-end MD5 hash, any hypothetical attacker doesn’t have access to either the output (the MD5 hash) or the original input (the user’s password and our salt), so there really isn’t any productive attack based on MD5 vulnerabilities.
Of course, if someone has your original password and our salt, they might be able to come up with a SECOND synthetic password that hashes to the same value. (Since we constrain password characters that we accept, that might not be possible, but let’s assume it is…) The attacker could theoretically use this second password to access your account. But that same attacker could just use your original password, so I don’t see a real-world attack that would be improved with SHA1.
(Before Evernote, I spent five years building high-end cryptographic systems for government customers [e.g. http://www.isto.org/ose-site/file-fix/Public/corestreet-secure-access-control-govt.pdf, so I get to make use of my old crypto knowledge from time to time…)
May 23, 2011 at 8:29 pm | link
Dave, the other concern with MD5 has nothing to do with Evernote security, per se. Fairly comprehensive rainbow tables exist for MD5 hashes. If someone manages to snag the Evernote user table with hashed passwords, it may be quite feasible to reverse a large number of hashes into plaintext passwords. Since most people re-use passwords, this could expose people to collateral damage on other systems where the same email address and password were used.
The success of such an attach depends heavily on the salting you’re using, but it deserves acknowledgment — it’s more of a good citizenship issue in not exposing your users to collateral damage if you should have a security breach.
May 23, 2011 at 9:05 pm | link
Our password salting algorithm is significant enough that I’m not particularly worried about the risk of someone:
(A) getting a copy of our entire User database and then
(B) doing matches against another canned password database to get exact matches.
Or, to phrase it more precisely: We need to provide a lot of protection and attention against “(A)”, because someone who does this has penetrated through several layers of security and accessed a key database. That’s something we spend a lot of time thinking about. Protecting against “(A)” involves lots of boring work for things like: software patches, access control policies, physical security, etc.
While it’s worth mitigating against “(B)”, as we have, I think that this is really far down on the list of real-world security risks. But crypto stuff is always more fun to talk about than the boring stuff, and therefore tends to garner more attention from geeks like myself. That’s probably why there have been 3 comments about the “MD5″ algorithm and none about “how do you keep bad guys from logging into your servers?”
Schneier has written some good pieces on the topic of over-focusing on cryptographic algorithms in security planning:
http://www.schneier.com/essay-021.html
http://www.schneier.com/essay-366.html
May 18, 2011 at 6:19 pm | link
The second paragraph under the heading “networking” discuss directing traffic, feeds as well as providers which must be the part regarding the elephants. Whoever provides these networks of elephants must be directing someone to feed them. Hope this is not too technical of an explanation.
May 18, 2011 at 6:49 pm | link
Interesting that you din’t separated data storage and its processing in your shards to better utilize resources and lower TCO – as Evernote user i assume that you store huge amount of data, but require very minimum processing of it (except image tasks done by AIR cluster).
Was that done intentionally or am I missing something?
May 18, 2011 at 7:11 pm | link
Our current single-tier architecture gives a pretty good price/performance ratio while eliminating any single points of failure. E.g. we’d be comparing the performance of a direct-attached RAID against what we’d get with remote storage. Something like a NFS-based NAS is going to really hit our database performance, but you could achieve similar performance with an industrial-grade SAN.
The same level of redundancy would require redundant hardware with replication in the SAN tier, which starts to get a little bit pricey.
We’ve tried a few times to come up with an architecture with a storage tier that maintains 100% availability without increasing TCO, and haven’t been able to do so. Ultimately, stuffing drives into the primary servers and replicating with DRBD doesn’t add much overhead to the costs of each server. Any secondary storage tier adds a bunch of extra stuff onto the costs of the drives.
So we’re receptive to this in the future, but haven’t been able to come up with something cheaper that doesn’t hurt either availability or performance.
But we’ll talk about this more in a later post.
May 18, 2011 at 7:31 pm | link
Dave,
This is a really interesting glimpse into how you’ve created a working, scalable architecture – and kept it reasonably simple (the architecture) too. Many thanks for sharing this – I’m impressed with the way you guys work.
May 18, 2011 at 8:22 pm | link
Great article. Really nice from you guys to open it.
If you can let us know in the future some more details on the software layer would be great. I’m been working with the similar architecture(core application) and we have some downside with hibernate + GWT when we need to send to the client (serialization).
Where does Stripes come ?
Once more, really nice to share this info.
Evernote is a wonderful service.
Regards,
Patrick
May 18, 2011 at 8:38 pm | link
Thanks!
Our GWT pipeline is a little bit crazy. We define our high-level API in Thrift (http://www.evernote.com/about/developer/api/) and then generate Java structures and methods using the Thrift compiler. Then, we massage those so that they can be used in the GWT compiler so that one set of high-level object definitions work in all of our “clients”. This is completely independent of our Hibernate layer, so we have conversion code to go between Hibernate structures and higher-level Thrift structs.
Stripes is used for most of our Java-based web site, e.g. the Settings page on your web account. Everything except the one /Home.action view of your own notes.
May 18, 2011 at 9:17 pm | link
Fantastic. Thanks
May 18, 2011 at 8:23 pm | link
This is pretty cool info Dave, and it’s awesome that you’re actually sharing it with the public. Kudos for that!
June 12, 2011 at 2:18 am | link
I feel so much happier now I undesartnd all this. Thanks!
May 18, 2011 at 8:49 pm | link
Am a newbie kind of to Java , am awed and more inquisitive how all of this evernote thing works.
Though most of the things you have written here are a bit incomprehensible for me to now, I have enjoyed reading this one. Keep it coming.
Kudos to the guy who has come with an idea to start a tech blog. People like me will learn a lot from the insights provided in this blog.
May 19, 2011 at 3:31 am | link
Just curious – why would the daily peak traffic occur at 6:30 AM on Friday?
May 19, 2011 at 3:47 am | link
Actually, the daily peak is between 6am and 7am every day. I.e. not just Friday. I think our weekly peak is Monday 6:30am.
Why? That’s when the end of the day in Japan overlaps with the end of the work day in Europe and the start of the day on the east coast of the US. The majority (62%, I think) of our usage is outside of the US at this point, so the overlap of time zones is the key.
May 22, 2011 at 8:23 pm | link
I had a customer in the UK that serviced New York and Japan. You can imagine that – far from the double-hump workload – their machines were continually full. No time to run the batch window work.
May 19, 2011 at 6:51 am | link
Do you plan distributing a “on-premise Package”? For some Companies the Cloud-Solution is a No-Go.
May 19, 2011 at 2:58 pm | link
Unfortunately, this would be very difficult, since Evernote isn’t just a simple web service. People who love Evernote tend to use it because it runs on their Mac or their PC or their iPad or their Blackberry, etc…
All of this software talks to our service, which includes a lot of complicated components (like advanced OCR technology). The combination of a huge service and over a dozen clients makes it really hard to imagine Evernote being deployed as “shrink wrapped” enterprise software, and the sort of company that sells enterprise software tends to look very different from Evernote.
May 19, 2011 at 12:04 pm | link
Where are all the b-splines for curve fitting pix?? This is an outrage
May 19, 2011 at 2:46 pm | link
Im very interested to see the new figures of total members account, it seems getting bigger and bigger
thanks
May 19, 2011 at 3:00 pm | link
Phil likes to say that the “total members” stat is the only one guaranteed to always increase, so it looks nice on a PowerPoint presentation. I think that Andrew and Phil should have some updates on stats in the next month or so.
May 22, 2011 at 8:24 pm | link
A real time stats page would be pretty cool. As you share subscriber and note counts I doubt that it’s TMI from your perspective.
May 27, 2011 at 5:44 pm | link
Something along the lines of the Firefox 4 Download Stats page:
http://glow.mozilla.org/
May 19, 2011 at 6:51 pm | link
Dave, thanks to you and the rest of the Evernote team for doing this! What a great philosophy and a wonderful resource this will become!
My concern is, however, that if you can write these articles for the blog then you obviously don’t have enough to do. You should ask Phil for a new task.
It’s amazing to me the issues you guys have to deal with and overcome well. Things like the peak time of day mentioned earlier… I’m sure the onslaught of activity from Japan was a surprise but you never had an epic fail other companies would have experienced. That says a lot about you and the tech team and your ability to adapt and overcome.
Well done!
May 19, 2011 at 6:52 pm | link
Sshhh … Phil’s on vacation today, so I’m getting all sorts of real work done.
Thanks!
May 20, 2011 at 9:30 am | link
So… Doing a little math, are we to understand that a lot of your accounts are dormant? 150e6 daily HTTS requests, for 9e6 users isn’t a lot, assuming that I understood the paragraph correctly and it includes the syncs from the client software.
May 20, 2011 at 2:14 pm | link
No, our 30-day activity rate is about 3M unique users, so about 1/3 of everyone who created an account in the last 3 years used Evernote in the last month. But the nature of synchronizing clients means that this doesn’t translate into hundreds of “page views” per day for every user. If you use Evernote on your laptop, which is open for 8 hours a day, then we may only see 8-16 requests from you that day if you don’t make any changes in your account.
May 26, 2011 at 11:42 am | link
Right, that makes sense! I forget that I’m not a typical user, with my many updates a day and the constant use of the web interface
Thanks for your reply!
May 20, 2011 at 11:40 am | link
FASCINATING!!! Lots of work and all EN users thank you! We’ll have to ” gussy-up” our posts !
May 23, 2011 at 3:06 am | link
And “Why such stuffy old software (Java, SQL, local storage, etc.) instead of hot new magic bullets? …”
Really curious about this.
Thanks
May 24, 2011 at 12:08 am | link
I’m curious as to how you use Zabbix, Opsview, and AlertSite all for monitoring. I have been using Nagios for years and yearn for a better solution Zabbix and Opsview seem like they have a lot of overlapping functionality, do you prefer one for a certain set of tasks?
May 24, 2011 at 3:47 am | link
We started with Zabbix internally but deployed Opsview when we switched over to HP switches, since it had good modules for managing those. We’re evaluating whether to replace Zabbix with Opsview completely, but we monitor a ton of different things on dozens of boxes, so a full migration would take some work.
AlertSite provides an external sanity check to confirm whether the entire site is up or down. It also hits a pass-through URL that confirms that Zabbix is up. So we get paged if the internal monitoring system dies (which happened once), or if the all network traffic has stopped. (It also gives a nice “total uptime” number if we want to tell Phil how many “nines” we have…)
May 24, 2011 at 10:33 am | link
Impressive, even sans magic bullets! I hope that, in one of your followup articles, you touch on data synchronization. I’ve been surprised to discover how little common practice there is around this.
May 25, 2011 at 3:39 pm | link
Since you use open source software, have given back anything to the community, like reporting bugs or adding configuration hints to for example the Debian Wiki? I’m sure you must have discovered one or another glitch doing such a setup and certainly found clever ways for certain configs.
May 25, 2011 at 3:44 pm | link
We’ve done a bit … e.g. the Thrift code generators for Obj-C/Cocoa and HTML came from us, and we try to submit helpful bugs like:
http://bugs.mysql.com/?id=61105
But we’d like to do more in the future.
May 25, 2011 at 6:47 pm | link
Why the need for OpsView and Zabbix? I’ve used Zabbix extensively and its a great product. I haven’t had a chance to play with OpsView so I am simply wondering what is the benefit to running both?
May 25, 2011 at 6:55 pm | link
Largely historical reasons. We started with Zabbix, then started to use Opsview to manage our HP networking equipment. Now we’re evaluating whether to migrate everything to Opsview. Both have their benefits, but Zabbix has been a little fussy to configure and scale so far.
May 28, 2011 at 12:10 pm | link
Hi, Dave.
I’ve translated this post for Japanese people.
http://d.hatena.ne.jp/nokuno/20110528/1306566831
Please tell me if you have any trouble about this. translation.
Thank you for your great entry.
May 31, 2011 at 2:58 pm | link
First of all – thanks for the article, Dave!
I read it in russian on Habrahabr, but was asked to go here to get my technical questions answered.
I am mostly interested in your caching strategy. According to the pic above, you don’t have a separate caching farm and use in-memory (?), per-shard EhCache instances (suppose, that’s why u need tons of RAM for these). Do you use EhCache as a Hibernate 2nd level cache, or programmatically in front of DAO layer, or some other way?
Also, I am interested in your Tomcats deployment settings – how much memory do u give per Tomcat and how many Tomcats are running per shard server. This will show how u fight with GC issues.
Surely, I have a lots of other questions, like file storage architecture, but suppose these are separate big subjects to discuss.
May 31, 2011 at 4:46 pm | link
We use EhCache with Hibernate for application-level caching. This works pretty well, but Hibernate+EhCache can still be too expensive for some operations, since it still opens and closes a transaction to your database even if all of the results you need are in the EhCache.
So there are a small number of high-performance operations that we do our own custom caching in the application without touching Hibernate. This is a lot more work, because you need to keep your own cache in sync with the “real” database, but we found it necessary for certain operations.
(MySQL also does its own caching, so we give it a lot of memory, but that’s mostly for fast indices, not for primary data.)
Memory settings are a constant battle. We one Tomcat+Java instance, to simplify caching, with:
-Xmx5200m -XX:+UseSerialGC -XX:MaxPermSize=384m
The “UseSerialGC” option was required to fix some JVM crashes that we’d see every few hundred server-days. We plan to try tuning our GC again soon, but this is a constant battle between stability and performance.
June 3, 2011 at 10:13 am | link
Thank You For Share, I’ve translated this post to chiness
EverNote的系统架构
EverNote的系统架构
June 4, 2011 at 5:06 pm | link
First off great article, thank you.
The part that caught my attention was in the application stack. While our stack also includes Tomcat, Hibernate and MySQL your choice of Stripes and GWT as a “controller” layer made curious. I assume you use them for the web app version of Evernote. I took look at Stripes some time ago and thought it had some good things going for it, minimal configuration being one of the best parts about it. I haven’t heard much about Stripes since then and thought it wasn’t getting a lot of traction. My question is now that you have been using it for a while would you still go with that framework? are you and/or the engineers that work with it happy with the choice?
June 4, 2011 at 5:17 pm | link
GWT is basically just for the one “page” where you’re viewing your own account contents on the web. (i.e. https://www.evernote.com/Home.action)
Stripes handles everything else in our web application UI. It’s heavily influenced by other “convention over configuration” frameworks like Ruby’s Rails. I.e. if you make classes with certain conventions in the naming of the class and properties, then you don’t need to do a lot of manual wiring to connect web forms to Java back-end controller code.
We’re mostly happy with it, but I haven’t surveyed the Java options in a few years, so couldn’t really say whether it’s better than any alternatives. Like many of those other frameworks, it makes the simple things really simple, but it’s not always clear how best to handle more complicated things like separate data-rich forms on the same web page, etc.
June 13, 2011 at 12:32 pm | link
I’m curious as to how you use Zabbix, Opsview, and AlertSite all for monitoring. I have been using Nagios for years and yearn for a better solution Zabbix and Opsview seem like..
June 13, 2011 at 3:45 pm | link
We started with Zabbix, and found it to be pretty powerful, but a bit complicated to set up and scale.
We added Opsview later because it had good support for our HP switches, and we’re trying to decide whether to switch everything over to Opsview.
AlertSite is just for external “black box” monitoring … if our entire data center goes dark, Zabbix can’t notify us of any problems, but AlertSite will page us. We pay AlertSite to ping two different URLs every few minutes: one that hits our application, and another that (indirectly) hits Zabbix, so we know if Zabbix itself dies. (Which it has done twice.)
June 14, 2011 at 11:16 am | link
Great article and love the simple architecture with which you have been to achieve massive scale.
Any thoughts about storage i.e. local file system vs. NoSQL implementations like MongoDB (as it may be easier to perform search operations).
June 14, 2011 at 4:55 pm | link
We’re currently storing a billion files for our users. The metadata is in MySQL (one DB per shard), the search text is in Lucene (one index per user), and the files are on various hierarchical file systems with pointers in the DB.
You could definitely build something like Evernote and use a “nosql” storage system for files instead of managing your own file system(s). If we were hosting Evernote within Amazon, we would obviously use S3 to store the raw files. Since we have good data locality (your files only need to be accessed from your shard), we don’t have to solve the “global big data” problem. So some of the big data nosql solutions are more than we need.
Since we’re building Evernote to store your memories for many years, I tend to favor fairly mature/stable/common technologies over the coolest new gadgets. This helps ensure that the software will still work as expected in 5 or 10 years. Files on a file system are pretty simple to work with, and simple to debug if something goes wrong. When we looked at options 4 years ago, incremental backups were also problematic with the NoSQL software. I.e. we only want to back up the 2 million new files today rather than the 1 billion old files.
February 19, 2013 at 3:38 am | link
Dave,
Thanks for the refreshing clarity. I use Evernote every day and rely on you’re teams efforts.
Some of the Blg readers may like to know as I do
1. how did you map the Service Level commitments to infrstructure design early on?
2. Will there be a quantum leap needed or are you tracking along the curve you expected and what is the overall shape of the aggregate load on the system over the years so far.
Sean