Learn System Design
A bi-weekly podcast hosted by a senior engineer named Ben Kitchell that takes a deep dive into learning about technical system design by learning together. Each episode we will explore the inner workings of what makes these systems so complex and fascinating while building on our knowledge of how they came together.
All music written and performed by the mysterious Aimless Orbiter. You can find more info about him and his music at https://soundcloud.com/aimlessorbitermusic
Learn System Design
1. Databases Decoded: Charting Scalability and Overcoming Latency Battles (Part 1)
Embark on a journey with me, Benny Kitchell, as we chart the course through the complex waters of system design, zeroing in on the pivotal role of scalability. Imagine the chaos of a concert ticket site crashing just as sales go live; it’s a scenario I’ve lived through, and one that illustrates the high stakes of scaling. Within this episode, we dissect the anatomy of scalability, providing essential insights into how understanding both the product and user behavior is crucial for ensuring a system can handle fluctuating demands. As a veteran of the tech trenches, I share war stories and lessons learned, revealing how misguided scaling can be just as disastrous as stagnation, and how financial implications like ballooning AWS fees can catch you off guard if you’re not prepared.
Then, we shift gears and plunge into the 'War on Latency', where I illuminate the strategic deployment of caching and CDNs as our primary weapons. Not only do we unravel the technical threads of these systems, but we also tune in to the melodies of Aimless Orbiter, punctuating our discussion with an auditory experience that transcends the typical tech talk. From the intricacies of the CAP theorem to the harmonious balance proposed by its successor, the PACELC theorem, you’ll come away from this session armed with the knowledge that’s as practical as it is profound. Prepare to be enlightened by the symbiosis of robust system design discourse and the soul-stirring tunes that underscore our technological odyssey.
Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.
Hello everyone, welcome to the inaugural episode of the Learn System Design podcast with me, your host, benny Kitchell. I'm a senior engineer. I've been writing code for around a decade I've been a senior engineer for about a half that and one of the things I've always struggled a bit with is system design and just learning, like how to build a system from scratch. I get to add new features to it, but building something from the ground up is not something I've had the pleasure of doing. But, yeah, I'm very excited to explore the world of system design with you and learn alongside and do the research and talk about it and hopefully, by the time this journey is over, you will have learned something that will help you pass an interview or get a promotion or even just become a better engineer. In terms of the places I've worked, I cut my teeth, starting out working at a bootstrap startup, writing web apps and jQuery, supporting Internet Explorer 9, you know the real fun stuff. After that, I helped to create a new gameplay engine for a AAA mobile gaming company and, honestly, above it all, the main reason why I wanted to start this podcast and the real thing I learned from pretty much all of my jobs is just that I love the people I work with and I love growing and helping them. And, yeah, I guess I just love helping people level up, especially, you know, because it helps me level up as well. So hopefully, by doing this podcast, I help someone out there, even if it's just, you know, passing an interview, or even if it's just learning something you didn't know already. Hopefully we can level up together on that. So, without wasting too much more time today, I just kind of wanted to talk about scalability, just sort of a high level look at like what it is, what it means, and over the next few episodes we'll dig more into it and, you know, touch on more specific things. So the subject again, itself is extremely verbose and on the episode we won't be covering every minute detail. Instead, I just want to lay like a groundwork so that you, like, whenever you come into this, you sort of know what to expect and sort of get a good idea of like the base framework of what I'm saying and what I mean. I'll be using a lot of high level language and, as such, I'll include in this show notes any brief descriptions that either I find that people might not know or, if anyone messages me, I can add those to the show notes as well. That being said, if anything is missing or just doesn't seem to stick, definitely feel free to reach out to me at Learn System Design Pod at gmailcom.
Benny Kitchell:So yeah, so what is scalability? Scalability, in my eyes, is defined by being able to meet the change in demand without sacrificing a lot in other areas. So, if you've ever tried to get tickets to a concert at midnight and the website won't load because it crashed, or if you and your friends are tried playing the new online multiplayer game and keep getting disconnected or kicked from the lobby, then you've probably been a victim of a poorly scaled system or, in my opinion, a poorly planned system. But, in all honesty, scalability is honestly, much more than just consumer-cited instances or problems. Right, sometimes it's that you scaled correctly and your product is working as intended, but at the end of the month, a thousand times more users means a thousand times bigger AWS bill, right?
Benny Kitchell:The worst case of scaling is properly scaling in an area that no one cares about. By that I mean, imagine building a social media app, right, and you're not paying attention to what the users are doing on your product. You just assume that they're posting pictures of their pets or they're posting videos of themselves dancing, and so you add a CDN and you host all your content on AWS and S3 and all of your content is scaling, everything is horizontally scaling and everything's perfect, and you've even made sure that, if your scale too quickly, that all of your data has replicas and everything is backed up, and you've really gone the extra mile. And then you wake up the next day and you have a ton of emails and your social media DMs are flooding over and everyone's complaining about how the chat messages aren't loading properly or they're loading out of order. In that scenario, it may not seem obvious what happened, but your users weren't doing what you expected them to do, right, you didn't care about having consistency in the order of your messages because you were so busy worried about whether or not that the video upload speed was quick enough, and so, at the end of the day, you know it's about knowing the data and knowing the product and knowing what needs the attention and what things you can sacrifice on. Another example we can think about is not knowing if your application is read or write heavy.
Benny Kitchell:One of the biggest things about consumers that we've learned over the years is that consumers have more patience for uploading something than they do about downloading something. So uploading a picture can take a little bit longer, but if you're taking too long to load, like a feed on TikTok or Instagram, or loading the data, then consumers will give up and they will just go find something else to do. If you're building something like Instagram, the amount of pictures and videos that people are seeing are somewhere in the neighborhood of 50 to 100 times the amount of pictures and videos that they're uploading, right. So if you focus all of your scaling on handling the throughput of uploading pictures and videos and again your scaling fails, it's because you didn't understand the product, you didn't understand the data, and so that's something I wanted to make sure to get out of the way in the beginning.
Benny Kitchell:And episode one is like I'm not just going to be talking about hey, you know, here's Kafka and here's what Kafka does. Instead, I want to talk about why we want to use Kafka, what is the reason, how does our data normally look and what are the scenarios that are happening that we would want to implement Kafka, and so this leads me into probably the biggest advice I have. If you're listening to this episode. It's the last one you listen to. I just wanna give you this advice because I've made this mistake. Chances are, everyone who's listening to this has made this mistake or will make this mistake, and it's just that.
Benny Kitchell:Never assume you know what the ask for system is. If you're going into an interview or if you're creating a new system, if you're building something up, never assume that you know what a system is going to be and what it looks like. Instead, take the time to make sure you know what the product is, what the data looks like and what the actual ask is. Reinforce by asking questions and really digging in and, at the end of the day, the soul of the system is made of the data right. So understanding exactly what the data is and how we use it is going to be instrumental in how we build the system.
Benny Kitchell:So, when scaling a system, one of the first things you will find is when you have a scalability limit, which you know, in layman terms basically just means your current setup is not cutting it. You have a thousand hungry people and one chef making food, I guess. When you have a scalability limit, what are the things you wanna consider? So the first thing is to analyze your data, as I've said before, and understand what your needs are. Much like I mentioned, knowing where your system is failing and the data you're working with is going to be the groundwork for building out your system. The big one that usually comes next is figuring out if you want to scale vertically or horizontally. Vertical scaling loosely means upgrading a system with better hardware, right, although sometimes it could mean buying a whole new system with better hardware. The end game is the same. For example, if you are at a small to mid-sized company and you're finding a modest increase in your traffic, generally you can get away with just increasing your RAM, increasing your disk space or even just increasing your network as a whole. But to understand why upgrading these things are important is to understand what role each of them have in your server.
Benny Kitchell:Take, for instance, ram or Random Access Memory. It's your first line of defense against latency and, in short, latency is just how long a cycle will last, but we'll touch on that in a bit. If you've ever clicked on something and it's taken forever for the action to happen, that's latency. Basically, all you really need to know for this is just it takes time, it's measured in time and the higher the latency, the worse the user experience. So when an application stores anything during runtime, generally speaking, storage and RAM is your best choice. In reality, ram and its speed could be its own separate episodes. So let's just say the higher amount of RAM and higher frequency the better it is for us. When an object is too large to store in RAM, we have to actually use disk storage, and if you've ever had to pull an image locally or a file locally, you know what it's like to have to pull off anything from the disk and the headaches that can happen with that, and especially when it comes to speeds. So when we compare speeds, let's compare something like a solid state drive, which is even faster than a hard disk drive, versus RAM. Typically speaking, you're gonna see somewhere between five and 10 times slower speeds on an SSD than your RAM, and SSDs in themselves are three to four times faster than a traditional disk drive.
Benny Kitchell:So, circling back to what I was saying before, think about your data and what your application is doing. Are there a lot of small objects and state changes? Are you working with large files? Are you working with images, videos? These types of choices can make or break your scaling right.
Benny Kitchell:At the end of the day, vertically scaling, while more straightforward, can be more expensive and sometimes even worse when you scale it completely. And then it's still not enough. The reason for this is because vertically scaling has a theoretical limit. You can't put infinite amount of RAM or infinite amount of storage in a single machine. So, at the end of the day, vertically scaling is great for small, incremental growth, but when it comes time to actually scale in larger numbers, we have to explore a different option, which, of course, is horizontally scaling. Horizontally scaling, on the other hand, it just means instead of one large system, you have multitudes of low to medium size systems. If vertically scaling means hiring a world renowned expert chef, horizontally scaling is the equivalent of hiring a lot of junior to mid-level chefs. One is not better than the other. They serve different needs. Horizontally scaling is best utilized when you're scaling to be more dynamic From a cost perspective.
Benny Kitchell:Imagine you only have large amounts of traffic on certain days. For a lot of companies that's like Black Friday or Boxing Day. If your system handles that amount of a quest every other day, why would you want to vertically scale and put all this money into handling all this extra throughput for one day out of the year. It's just not cost effective, right? While instead you can scale horizontally. You can have that.
Benny Kitchell:Server instances be dynamic and increase and decrease as they're needed. But, of course, because there's positive and negatives to everything, just as there are with vertically scaling, there's problems with horizontally scaling. The issues that come with horizontally scaling are abundant, and one of the biggest ones is the increase of complexity. For instance, imagine you have a system that adds new servers to handle requests every time the request increases. The two main issues you will immediately face is one how do I route traffic through these instances in a dynamic and balanced way? How do I tell my load balancer to go to server A versus server B? When server B is down, how do I tell it don't go to server B? These sorts of things that you don't really think about until it's happening and causing an incident.
Benny Kitchell:The second thing is how do I make sure that every person who visits my page sees the same thing? It's that consistency factor. If I have a sale and my new skateboards are $45 instead of $60, I want everyone to see the same price and I want everyone to have the exact same experience. Furthermore, what if you need to keep track of state. If a person adds something to a cart, comes back an hour later and they don't hit the same server, then they're not going to have a good experience. They're going to expect to see those things in their carts consistently. How do we ensure that these sorts of things are working in tandem and working correctly? I will touch on this later, but these are just things to think.
Benny Kitchell:I want you to think about horizontally versus vertically scaling, and when one's better than the other, and what solutions they really provide. So, dear listener, I'm sure the answer is coming to you now. When do you scale horizontally? When do you scale vertically? The answer, as I alluded to before, is generally, it depends. Generally speaking, at the end of the day, most companies end up scaling horizontally in one way or the other. The reason, honestly, is the increase of complexity is hard but not terrible, especially when that the other side of the coin brings a whole slew of better features. You have more modularity, more flexibility, and one of the biggest ones is decreased latency in different locations around the world. Being able to spin up a server that's in Europe pretty much immediately is way better than having to buy a server rack in Germany and just hope that you get it built in time for people to care.
Benny Kitchell:So I'll expand more on these and a lot more of topics throughout the series in future episodes. Rest assured it won't be the last time you hear about it. Instead, on this episode again, I just want to try and keep things high level and just talk about just little tips and tricks and little things I'm going to be talking about along the way throughout this entire program. So we've come back to what I touched on before latency and let's dig a little deeper into why it's important to keep it low and why it's something you want to pay attention to. And instead of trying to create this cool little metaphor for why latency is so important, I'm just going to use findings by Amazon themselves. In 2006, amazon actually did a study. They found, for every 100 milliseconds of latency in their system, they lost around 1% in sales. So you might be saying 100 milliseconds who could even tell how fast 100 milliseconds is? Well, I can honestly and I can tell you that a snap of your fingers is around 150 milliseconds, so faster than that. If you put that in this perspective, in 2006, 100 milliseconds lost Amazon around $100 million and today, if we just applied that straight across the board, 100 milliseconds would look more like multiple billions of dollars. Can you imagine losing multiple billions of dollars every time you blink your eyes, every time you snap your fingers? This is why latency is important. And now that I've given you that horror story, let's talk about some of the concepts Amazon and, honestly, a lot of other companies have done to keep their latency low.
Benny Kitchell:Your first main weapon when it comes to the war on latency is caching. Caching is just the art of saving responses of common tasks to save time for the future. Let's say you're asked to build a system that shortens the length of a URL. When someone requests a specific URL, you'll need to go fetch that value from the database and then, depending on the size of your database, that could be a large chunk of time if you haven't optimized it with indexes and things like that. And now imagine that that URL is for someone like a celebrity on TikTok, and you can scale your system to meet the demand for 10K people requesting the URL, but you're still taking, however long it takes times 10,000 every time you have to fetch that value. So instead we would utilize a cache that sits in front of our database layer and, much like we talked about earlier with the RAM versus hard disk discussion, the cache is exponentially faster at fetching. So while the first person might have to wait a few hundred milliseconds in a terrible, terrible day, everyone after them actually gets it in a timeframe that's fractions of that.
Benny Kitchell:And so caching can and will be a future episode. I'll explore it, I'll talk about Redis, I'll talk about memcache, I'll talk about the differences, why you wanna use one versus the other, when you wanna use one versus the other. But I also wanted to explore the different strategies, right? So when we talk about cache, there's different ways of caching. There's cache aside, write back, read through in so many more special little things you can do and improve on caching, especially caching databases, because you're in for a little treat here. But the next few episodes will be about databases and the different kinds and, honestly, why a lot of the times they're your biggest bottleneck Not always, but for the most part they can be a real pain in the neck. So definitely make sure to stay tuned for those episodes coming out. I'm gonna be trying to release one episode every other week and hopefully you enjoy it.
Benny Kitchell:And the next tool I wanna talk about are content delivery networks, more commonly probably known as CDNs. Cdns are basically proxy servers that are deployed in different geographical regions, and their main purpose is just to decrease the physical distance between your data and the end user who wants to consume your data. Despite the fact that data can travel somewhere around 200,000 kilometers a second, the real slowness comes from how our internet is connected. Since there's no direct connection between the data and the user, we end up having bottlenecks in places like our routers and our internet service providers. But that's where CDNs can actually save us. By hosting important content on CDNs, you can shorten that actual physical length and improve the overall speed of your site or your app or your game. Another great benefit of CDNs besides speed is the ability to have redundancy. If you're not familiar with what redundancy is, it's basically just the act of having one, two or any number of copies of the same thing to keep it highly available. If, for some reason, your server crashes in Western Europe and then you can serve the exact same stuff from the Western United States. Yeah, it'll be slower, but it will still be available.
Benny Kitchell:Try to instill in your brain this concept of availability and because it's honestly gonna be one of the most important things that you take into consideration when you're scaling a system, the other two, of course, being consistency and partition tolerance, or CAP for short. The CAP theorem, also known as the Brewer's theorem, is to computer science and scaling a system as a wheel is to a car, and that is to say, if you don't apply it, you aren't getting anywhere. Right? In its purest form, the CAP theorem states that when you have a distributed system, you can only have two of the three available things in this theorem, so you can only have a high consistency, high availability or a strong partition tolerance. You can never have a system that has all three. Further expanding on this idea, in recent years, theoretical computer scientists have, of course, grown and leveled up, and now they have the PASL theorem. I'm so sorry if I'm pronouncing that wrong, but chances are if you wrote that theorem, you're not listening to this podcast. Like the CAP theorem, this theorem theorized about consistency, availability and partition tolerance. But it also expands on this idea by stating, and I quote in case of network partitioning P in a distributed system, one has to choose between availability and consistency, as per the CAP theorem, but else E, even when the system is running normally, in the absence of partition, one has to choose between the loss of latency and the loss of consistency. For now, let's stick with the CAP theorem and we can use this and bring ideas as we need from the PASL theorem as we need them, but for this case, cap theorem will be more than enough.
Benny Kitchell:As alluded to before, the C in CAP theorem stands for consistency, or, more specifically, data consistency. Right. Building on a similar example we had before, I imagine selecting seats for a concert. You're purchasing the ticket and someone else might buy that exact same ticket, but if someone buys it, it should be invalid for every other user, right? Otherwise you're gonna have a lot of angry people fighting over seats, sending you death threats and all kinds of crazy things and honestly, that might seem like an extreme example, but it's not. I've seen poorly designed websites where consistency is off. I've seen websites go down and, honestly, the worst part is when you have to decide between consistency, availability and partition tolerance. This is one of those examples where, if I'm buying a ticket, I want it to be consistent, I want it to be available. So we will have to sacrifice on the partition tolerance, and for better or for worse. That is how the cookie crumbles.
Benny Kitchell:The A in-cap theorem, as again I've mentioned multiple times, is availability. While this may seem obvious that every site should be available, right Like everything should always work, there are different levels of availability needed for different systems. For example, if you've ever heard of Amazon web services or cloud computing, then you know just how much of the entire internet relies on these systems. If you don't know, the stats on AWS is, on average, one in every three sites visited by a user relies on AWS. It relies on its infrastructure, it relies on pretty much everything that AWS has to offer, and it offers a lot. So to say that AWS needs to be readily available is a huge understatement, and so that's why they have a system that they call the nines, and so the nines is basically a description on the amount of nines in a percentage of a guarantee, and I'll give you an example. Their single region system offers around two nines worth of availability, or 99% uptime. That's great, right, 99%? But their storage solution something like S3, if you've ever heard that, if you haven't, s3 is generally like a simple storage solution. It's where people like to store large files like videos and pictures and things like that, and then you can have those distributed all across the world and served up via AWS and again, as we've talked about before, reduce that latency and also not have to host all these pictures and videos on physical hardware that you're sitting on all over the world. But anyways, sorry, s3 itself. It offers an astounding 99.999999999%, or 11 nines, uptime.
Benny Kitchell:Aws and its offerings are a great choice when building a system for availability, but, like everything, it has its drawbacks, and one of the biggest ones is the cost. Right, I've seen a lot of companies nowadays where they're like I don't wanna pay this big, giant AWS bill anymore, let's see if we can only use some of it and we'll host this and we'll host that right, like with everything, there's always gonna be bounce back. But if you are designing a system and you need it to be quick, you need it to be reliable, then AWS or Azure or Google Cloud or any of those big, big name systems, they're great choices and they have a lot to offer in terms of building the systems. So finally, we come to the P in cap theorem.
Benny Kitchell:Arguably it's one of the hardest to wrap your mind around. It's partition tolerance. Partition tolerance roughly means that when one node goes down, your entire system doesn't fail catastrophically. The rough idea I can give you for this is if we're scaling horizontally and we have three or four servers up, as I mentioned before, you don't want your load balancer to serve server B because server B is down for maintenance, or server C it crashed for some reason, so you don't want to serve anyone from those. But most of all, if something goes down, if one of those servers goes down, you don't want it to break your entire system. You don't want one server going down, breaking down any of your other servers or anything like that.
Benny Kitchell:So the reason why this is important is because, inevitably, nodes will go down or the network will fail or any number of things, and it's up to you to decide if it's important for your system to have tolerance for these sorts of breaks. To bring it back to an example we had earlier about a social media site how angry would you be if direct messaging was broke meant you can't see your feed. And that's why partition tolerance is important. Right, it's this sort of separation of concerns, this decoupling, this modularity. It means one thing breaking doesn't affect another, and that's extremely important in some systems. And again, as I alluded to before, you have to choose which one applies and in which situation.
Benny Kitchell:There are so many examples online of many of the combinations of the CAP theorem, but I'm going to try and give you a few tips for what to prioritize when designing a system and what to sacrifice. And, honestly, the dirty little secret is, you will almost always be sacrificing consistency or availability. In today's world, not having partition tolerance is almost a deal breaker, right? I want to be clear. This does not apply to databases. There are still a need for CA databases, so instead, I'm saying a system as a whole will almost never be consistent and available.
Benny Kitchell:So then the question is when do you prioritize availability and when do you prioritize consistency? And here is my rule of thumb Prioritize consistency when it comes to anything that, if it wasn't consistent, how upset would my mom be by that? I mean, if my mom's bank account balance was unavailable, she would be a little annoyed, right? But if my mom's bank account showed a different balance than she had or expected to have, she would lose her mind, right, like anyone would.
Benny Kitchell:Prioritize availability when it comes to anything that, if it wasn't online, I would be bored or less likely to spend money. Things like this are your retail sites, your social media sites. I don't care if my Instagram feed is an inconsistent order I mean, it hasn't been in a long time anyway but if it was down altogether then I would be bored out of my mind. There aren't always perfect analogies to apply for every system, but it should give you a good idea of how to trust yourself. Would I rather a system show me inconsistent information or just not always be available for me? In the end, when it comes to scalability, there is seemingly an endless number of topics to cover. However, if you are listening to this and you're saying how do you use cash effectively? What's with all these random databases? You know how do I keep things in sync whenever I'm trying to scale, then go ahead and hit that subscribe button right, because these are actually going to be the topics we cover in the upcoming episodes.
Benny Kitchell:If you have enjoyed this episode, please rate us five stars. It means a lot to us and every rating helps us reach more people and helps keep this podcast free. If there's anything you didn't quite understand that I didn't cover, or anything to that effect, I want to make this podcast the best available resource for free system design knowledge, and I can't do that without your help. So please email me learnsystemdesignpod at gmailcom. One more time learnsystemdesignpod at gmailcom. And please remember to include your name if you want a special shout out. You know, I would love to do it. I love anyone who wants to support and even if it's just to send a message to say, hey, you got this wrong. We all improve when one of us improves. If you would like to support this podcast, help me pay my bills, help this podcast sound better, please jump over to our Patreon and consider becoming a member. Right now there's just one tier, it's just a dollar. Whenever I have more time to record more podcasts and more personal things, then I'll increase the tiers, but for now it's just a dollar. Right, it's just something that you can just share to show that you support the podcast.
Benny Kitchell:All the podcasts are inspired by Crystal Rose. All music is written and performed by the wonderful Aimless Orbiter. You can check out more of his music at sound c l slash Aimless Orbiter Music. But with all that being said, this is Benn KeKtKitchelch. I Scaledo.