Here's how we serve millions of requests each month on a single, standard VM on Google Cloud. First, I'll provide a concise Table of Contents. Then, we'll go in-depth about each point and how it is applied to the practice of software engineering concretely.
This part of the post is probably the most boring. You can skip it if you already know what it's about. The point here is simple: save a ton of time and headache by using technologies that have already been battle-tested in the most extreme rivers of internet traffic.
You are highly unlikely to encounter legitimate bugs in these pieces of software. If you do stumble upon an error message or confusing behaviour, chances are thousands of other developers have faced the same thing and posted about it online. Getting stuck is virtually impossible with this set of tools. And if you ever reach the point where you do need to look under the hood and do some real engineering work, you are already serving truly Herculean amounts of traffic and your company should have more than enough revenue to cover those engineering costs.
Our stack is comprised almost entirely of tried and true technologies: Nginx, Redis, PostgreSQL, Debian, and Flask. Only two components in our stack break this theme: the web-servers behind Nginx are Rust Rocket servers and all the HTML is produced with the server-side rendering framework Maud. We only use Flask for internal services.
The first time anyone formally learns about performance measurement and optimization, they are almost always told that input output (I/O) is the most important thing to consider, generally speaking. This is true. While you should always take an empirical approach and measure first, optimize second, "IO is the bottleneck" is a very useful guess to have in your back pocket.
But what exactly do we mean by I/O in this context? Starting at the highest, most zoomed-out level, consider the flow of data in your application. Maybe it starts as a request on your user's mobile device, makes its way to your server, then your server hits a database somewhere to authenticate the user, and then your server fires off a request to an AWS or GCP bucket to fetch some data and return it to the user. What's the I/O here? At the highest level, it's the network requests. This is intensive input and output: data leaves some computer, be it the user's mobile phone or your server, and travels over the internet to another server somewhere, either yours or the blackbox of AWS or GCP. These network requests take time. The fastest ones can be quick, maybe 15 milliseconds, which is essentially unnoticeable by users. But the slowest ones can take 500 or more milliseconds, easily noticeable by the majority of people.
So what can we do here? We can use aggressive caching to avoid making certain network requests altogether. Cloud buckets can be slow, especially depending on their configuration. Most of the time, the data stored in these buckets is static and immutable — it doesn't change. This means that if we run the request to fetch data from the bucket once, then we could store it in memory on our server and just read it again from memory later instead of having to go over the network again. We just saved 500+ milliseconds in a lot of cases! Fantastic!
There are two final points I want to make here. First, programmers should be aware of cache invalidation bugs, which are notoriously destructive and easy to let infest your programs. Remember this quote:
There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.
— Phil Karlton
Second, programmers should be aware of the really cool fact that this idea applies to all levels of I/O, not just network requests. Most classical computers are composed CPUs, RAM, and hard disks. Those pieces of hardware were listed in increasing order of latency — it takes longer to hit RAM than the registers of your CPU, and it takes longer to hit a hard disk than RAM. That's a hierarchy of access latencies to which we can apply a hierarchy of caching. Furthermore, each of those pieces of hardware have a hierarchy of caches within them! Isn't that cool?!
I follow the general rule of thumb that users should not have to wait more than 500 milliseconds (half a second) for a response from a web server. Sometimes though, things take longer than that. Even if they do, this rule still applies. So what's the solution? Tell the user that their request is being worked on and tell them that within the 500 millisecond rule.
To implement this in practice, you need to use some form of asynchronous programming. What we do is push things into queues. For example, we generate a lot of spritesheets:
When a user generates an animation from a sprite and a prompt, they can expect to see that their animation has started generating within 500 ms. In practice, this number is usually as low as 50 ms, but it depends on their network connection, of course. We just push the job into a redis queue and then, on the other side of the redis queue, a separate worker binary pulls out the job and begins working on it. When that job is done, we update the database to mark the job as done and the next polling request updates the web page for the user!
How you model your data affects your engineering speed as much as it affects your application’s speed. Keeping your data model simple, flat, and reproducible means you can write new code and debug and fix old code at a faster pace. It also often translates to performance gains for numerous reasons. Your database queries are less complicated. Your code lends itself more to free hardware gains: you can take advantage of caching boosts enabled by memory alignment; more of your code can implicitly avoid copies and sometimes even leverage “zero copy” — structs can be read directly from memory (for example, this could be in a network buffer, file, some IPC buffer, etc.) instead of being copied into the format that your program uses, because these are one in the same.
This is where my most controversial opinion enters the article: ORMs are harmful and should be avoided nowadays. ORMs serve the main purpose of allowing you to define all of your types in your program and avoid forcing you to write any database queries yourself. This can help the pace of development because it enforces a single source of truth for your data models. But it can lead to disaster because it obfuscates away complexity which ought to be looked at by a competent programmer. You can easily write code that ORMs will translate into hideously complex, sometimes bug-ridden database queries. Often times this code doesn’t have a correctness bug, but often times it does have a performance bug. You would then have to write some other incantation to reveal hideous monster of a query whose performance is not satisfactory and manually intervene. Nowadays, we have the technology to check your SQL queries at compile time. We also have the technology to instantly write all the boilerplate involved in adding a new column to database table, the new field to the corresponding struct, and updating all the database calls in your program. Because we have compile time guarantees about your database queries, even if the boilerplate-writer screws up, that’s okay; your code won’t ship if it’s broken. For an example of such a setup, see the sqlx rust crate.
Flat data models categorically avoid mistakes that cost engineering time and program wall time. They also implicitly take advantage of decades of hardware boosts engineered by PhDs and leading members of industry. They enable us to serve millions of requests each month on a single, standard VM.
These four standard techniques—battle-tested tech, aggressive caching, asynchronous work for anything >500 ms, and flat data models—let us serve tens of millions of requests per month on a single, standard VM without breaking a sweat.
— Tom, Creator of GameTorch