Learn System Design

Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Ben Kitchell Season 2 Episode 2

Send us a text

Master the art of system design as Benny Ketchel guides us through the essential skills every senior tech candidate needs to excel, starting with capacity estimates. By the end of this episode, you'll be able to navigate the complexities of bandwidth and data size without getting bogged down in unnecessary arithmetic. We explore how to think like industry leaders at Netflix, Google, and Instagram, focusing on rough estimates, worst-case scenarios, and the use of metric prefixes to simplify calculations. This episode is not just about numbers; it's about understanding the larger picture and harnessing the power of strategic thinking.

Our discussion doesn't stop at capacity. Join us as we tackle the challenges of large-scale systems, offering insights into handling billions of users and managing enormous data streams. Learn to focus on the core components of a system, such as video content for a streaming giant, and balance cost with hardware efficiency. Plus, get a sneak peek at our upcoming special episodes and discover ways to support and engage with our community, from sending feedback to joining us on Patreon. This isn't just a lesson in system design—it's a call to action for aspiring tech leaders to think big and design even bigger.

Support the show

Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.


Speaker 1:

Hello everyone, welcome to Season 2, Episode 2 of the Learn System Design Podcast, with your host, me, benny Ketchel. This week we are going to continue down the primer of building a system. These episodes are geared a little more towards the system design interview that you'll find at most tech companies, especially if you're going into a more senior candidate role, but I also just want to make clear that these are the exact same steps and considerations that I try to take into account when designing a new system, whether that's a fresh product or something from the ground up. If you haven't yet, I definitely recommend listening to the last episode, that's episode number eight, before listening to this one, as we will be talking a lot retrospectively about the functional and non-functional requirements, not just in this episode but the next couple of episodes, because all of these topics are so tightly coupled together. They're very important and they sort of reinforce what you're doing and why you're doing it. Yeah, it's just important to build lost in the weeds on specific topics and, in my personal opinion, today's topic is not only the easiest to get caught in but it's also the easiest to just completely wreck an interview. That topic, of course, is capacity estimates. It is almost a rite of passage to be an engineer. Find yourself in a 45-minute interview system design discussing and spending about 40 of those minutes trying to do math, make broad-stroke assumptions about how much load you might have. Do I have 1.6 million daily active users? Do I have 1.625 daily active users? 6 million daily active users? Do I have 1.625 daily active users? You know these sorts of things and it's nothing to be ashamed of. That's a part of this interview and it's a part of the process to learn how to actually handle these sorts of ideas and what it actually means to scale. But, honestly, here's the little secret that I've learned the answer to how many people are going to be using the product or how much data throughput do you need to handle. It's always going to be a lot, not 1.265, not 1.5. It's just going to be a giant number. That is very giant and doesn't actually help you that much, because here's the catch it doesn't really matter, because the answer will always be the giant number that makes you feel like you need to focus on it and instead of taking all your time focusing on the arithmetic, today I'm going to teach you how to focus on, how to take into consideration the size, but in a more streamlined fashion so that you don't get caught up on it.

Speaker 1:

Let's think about writing an algorithm, for instance, something you've probably done a lot, whether it's in school, at work or in an interview. When you do that, you don't try to estimate the number of people that are going to be using your system. You don't try and estimate how many times a certain piece of data will go through a loop or what have you. You think about the worst case scenario. You call that number relative. In, for example, this loop will take big O of N in time.

Speaker 1:

Complexity, right. Then why are we designing a system so mathematically so precise, like how to find how long 2,435 gigabytes of video being read from sequential memory will take if it's on a spinning disk? You know these sorts of things are important, but the specifics aren't that important. Instead, let's focus on the crux of the problem our estimations and our capacity, not our specifics and the arithmetic. So what then do we estimate? The amount of read and write throughput in your system, for instance, is important, and the amount of storage your system will need to hold.

Speaker 1:

When dealing with the numbers, just keep it in a factor of a thousand, right, because the more you round, the easier it is, and the bigger the number, the less the specifics matter. If I say I'm going to give you $110, you might just tell everyone oh yeah, ben gave me $100, right, the 10 doesn't really matter, just like when you're doing an algorithm. Is it big O of n plus 2? Well then, it's just big O of n, the plus 2 doesn't actually matter. And so what I mean when I say stick to factors of 1,000, matter. And so what I mean when I say stick to factors of a thousand, simply put, it's the difference in 1,500, 670 people and 154,000 people. Right, those extra 200 or so people is not going to bankrupt your company, it's not going to break your system, and it's a lot easier to do that sort of back of napkin math with nice round numbers.

Speaker 1:

So when, then, is it important to do those quick calculations that I speak of? Well, we'll get to that a little later. For now, let's make sure we have a few things memorized. These are the important pieces of information you should bring into any interview or any calculation when you want to consider your capacity. For some of you it might feel like a refresher or common knowledge, but for others it may be the first time considering it, so I want to cover it that way regardless. We're all on the same page going into how to do this back of napkin math and what's really important about it.

Speaker 1:

So the first thing is to always remember your scales of 1000 and how they relate to data sizes. In this case I'll be using bytes, but these factors can technically be applied to anything that is a metric. When it comes to tech, when we think of data sizes, we usually describe them as bits and bytes right, but it is important to understand the levels of data sizes. We usually describe them as bits and bytes, right, but it is important to understand the levels of these sizes relative to one another. For every thousand increments, we use a different prefix. So for every thousand bytes, we have a kilo, like a kilobyte. For every 1 million bytes, which, you might note, is 1000 squared, we use mega, like megabyte, and so on and so on. I'll get to the rest in a minute. The important ones to remember are that 1000 or less is just unit.

Speaker 1:

For example, 560 bytes of data and if you think about it, a thousand raised to zero is one, right? A thousand raised to one is a thousand, so that would be a kilobyte. A thousand raised to two would be a million. A thousand raised to three would be a billion. A thousand raised to four would be a trillion, right. And so the designations for those in order would be just a byte, a kilobyte, which is our thousand, a megabyte, which is a million, a gigabyte, which is a billion, and then terabyte, which is a trillion, right.

Speaker 1:

It's probably easier to do arithmetic rather than having you know nine zeros, uh, just having a thousand raised to three, because, again, once you take the three out, ignore it, perform your arithmetic, then add the cubed back in. You're getting a good idea of the scale without a lot of very complicated um, you know, arithmetic going on in actual, when you do your capacity estimates, it shouldn't take you longer than five to 10 minutes of the interview. It's honestly sometimes possible to just say hey, interviewer, I'm going to skip over this for now. I know it'll be a large number, I'll give it 1.5 based on past examples, and sometimes the interviewer will just say okay, yeah, no problem, I want to know how you think, I want to know how you would approach the problem. I don't care whether or not you can add a bunch of large numbers, bunch of large numbers. But we can even go further than that, right? So let's take into consideration, you have 1.35 trillion bytes, right?

Speaker 1:

It's a lot easier to drop all those zero and just have it be 1.35 and then say well, if we want to 3x scale this, you can multiply 1.35 times 3, rather than some obscure number like 1, 3, 5, 6, 2, 3, 4 times 3. Right, one of those is going to take you significantly less time to parse through and do that back of napkin math if it's necessary. The important part is that you can take those numbers, say, say you know, with a reasonable doubt, this is 1.35 gigabytes or terabytes, and know, you know the difference in scale between those numbers. And if the interviewer presses you on you know what that actually is. You can tell them oh, that's 1.35 trillion or 1.35, you know quadrillion or what have you Um. And it shows that you know what you're talking about, that you know the, the numbers, and it's not just you know, you playing um and that you you're doing calculations, but you're making it a lot easier for yourself. You're working smarter.

Speaker 1:

And speaking on this concept of working smarter, you know, sometimes these tests can take on specific constraints and sometimes you need to think about a budget. And that honestly brings me to my next important factor that you know we need for understanding the dynamics of latency across different constraints on our system. You will remember latency from the episode one and episode three and also kind of across our core episodes and our non-functional requirements from the last episode as well. The comparisons I'm about to give you will directly link to the non-functional requirements from before. So, if you're keeping track, currently we are on step three in this whole process and we are already calling back to step two. By considering the latency and making call-outs about specific hardware, you're already checking this non-functional requirement off the list.

Speaker 1:

So let's talk about the hard numbers. Right, to read one megabyte from memory, the entire process will take around a quarter of a millisecond, which is pretty fast. But, as you may or may not know, memory is temporary, so you can't just hold everything in it. You need some sort of long-term storage, and so the next fastest thing for our process are solid-state drives, and to fetch the exact same amount of data one megabyte for an SSD that you just fetched from memory, if you remember, which was quarter of a millisecond, is actually a 4x slowdown. Yeah, that's right. So fetching one megabyte of data from an SSD actually takes an entire millisecond of data from an SSD actually takes an entire millisecond. But of course, solid-state drives are more expensive than a more traditional spinning disk drive. How much more expensive? Well, if we talked not that long ago, it would have been $40 per gigabyte to $0.05 per gigabyte, but thanks to innovation and beautifully minded, wonderful people, today it's more along the lines of $2 per gigabyte to 5 cents per gigabyte, which again, may not seem like a lot, but when you get into petabytes worth of data, it means a big, big bill. Why, then, do most engineers design with SSDs in mind when we don't have these specific constraints? Because fetching one megabyte of data from a spinning disk hard drive takes a whopping 20 milliseconds, 20 times slower than an SSD, 80 times slower than fetching it from memory.

Speaker 1:

Hard disks still have their place, though they're continued use, in the ability to utilize them as cold storage. If your data is not being accessed a lot or it's not super important that it comes up immediately, maybe you can load it in the background. Storing data on spinning disks is perfectly fine, often encouraged to save some money. Honestly, one clever way I have seen cold storage in this way implemented is with user data, which might seem a little strange. But follow me, think about a video game that's gone viral.

Speaker 1:

Everyone is playing it on day one and they're super excited and everyone's logging in constantly. They've created their login, modified their character and it's been a month and they're over it. Some people might stick around and when they log in they want that process to be quick. You want to get them in game as quickly as possible. Again, see the Amazon reference from episode one and how fast you lose money when things are slow. But if you're someone who hasn't logged on in a while maybe over a year then you're a little more patient with logging in. You have no point of reference for how long it should take to be logged in and you know it's sort of. You have a little bit more flexibility in that sense and so what you can do is on the back end you can have like a cron job that checks for the last login time for a user and if it's been over say, a month, move that user data from the SSD database to the hard drive database. If, for whatever reason, they log back on, you just move that data back to the SSD and if they never log on again, no worries, you aren't being charged a ton of money to store it and the data is always there.

Speaker 1:

The next piece of information I want you to sort of memorize is the rough size of data in a storage capacity. Have you ever thought about how much space the things you interact with on a daily basis take up? Let's talk about like a company like netflix. They get roughly 100 million videos streamed a day. That in itself is a gigantic number, even if you're just talking about pictures. But you know, netflix of course works with video and the rough size of a two-hour movie on average is about one to two gigabytes, and that's not 4k. If we're talking about 4k high-res movies, we're looking somewhere in 10 to 20 gigabytes apiece. And again, as we talked about before, the number here doesn't technically matter. Netflix simply deals with a lot of data. So having these rough estimates are handy when trying to think intelligently about the amount of data you're dealing with. So if a two-hour movie is one to two gigabytes, if you think about a small book worth of text or a high-res photo, you're looking more around a megabyte, whereas a medium-resolution photo you can get as around like a megabyte, whereas like a medium resolution photo, you can get as small as like 100 kilobytes. So it's safe to say if you're building something like Netflix versus building something like Instagram, for instance, you're going to approach it in a different way.

Speaker 1:

And for our final piece of information that I want you to remember, I want to talk about the rough sizes of the company's operations. This is to help you memorize the scale at which you'll need to think about the data being processed, that sort of throughput, for instance, designing a system that will handle the same load as like a social media network. You're looking around a billion daily active users. We talked about Netflix and how they stream 100 million videos a day. That's very important as well. And Google it fights off around 100,000 queries per second, and building an app like Wikipedia means storing data somewhere in the neighborhood of like 100 gigabytes if it's uncompressed.

Speaker 1:

So try to remember these numbers so that if you go into an interview and they say, design Netflix or design Wikipedia, design Google you know these very common questions that you get. You can sort of already know okay, I know not only how much you know. I'm going to need to handle like a billion daily active users, but also you know a billion daily active users that will be uploading, you know, possibly a megabytes worth of photos, but also they're going to be reading a megabyte's worth of photos, or more If they have a feed. You're going to be having a lot of their friends and each one of those friends has a one megabyte photo, and so now you're sending a billion daily active users photos times the number of friends they might have on average. But again, these sort of quick maths, right, it's a lot easier to say, well, one billion, okay, well, that's just one. And so if it's one megabyte, okay, then roughly one megabyte times a billion people. We're looking in the neighborhood of a terabyte worth of data, you know, flooding through our system. And okay, now then, is it a read system or is it a write system? Well, more people are doing reading on Instagram and looking at pictures than they are uploading. So, okay, I need to focus on making sure I can handle that throughput of having a terabyte worth of data going out a day and being read from my system. And so, finally, I want to give you a reminder about the common mistakes to avoid how to approach this step when designing a system.

Speaker 1:

If you're taking an interview, you might be able to brush past this step a bit. As said, saying something like the system will share similarities to netflix. So I want to consider around 100 million videos at two gigabytes a piece. Sometimes that's enough for the interview, but they might push for a little extra math and you want to give it some thought and do those calculations like I just did. Uh with like a social media, like an instagram, but avoid trying to calculate, like how many hard drives it will take to hold it, or getting too low levels with the numbers and getting too specific with the numbers you're using.

Speaker 1:

On the other hand, if you are designing a system in a real world scenario, your capacity considerations are important. You should consider the cost of hardware. You should consider the price of hardware. You should consider using an SSD versus an HDD or both. Sometimes using a spinning disk drive is a lot cheaper. Sometimes it's easier to host your own servers than to use the cloud. Sometimes it's a lot more expensive. It just depends on your situation.

Speaker 1:

Regardless, remember to focus on the crux of the problem. If this is primarily a video-based service, like Netflix, you don't need to worry about calculating the size of text for descriptions or the size of the avatar for the user and what that would mean right. Focus on the big thing the videos, the users consuming those videos, how they're consuming them, and work from there. At the end of the day, the elements of capacity estimates that are always good to remember are your core facts. Think about your numbers and you know factors of a thousand. Keep things high level and focus on what your system should be doing, not necessarily how many specific times it should be doing it. It will almost always be impossible to take every little thing into consideration, especially in an interview, so just try and focus on the crux of the problem, not all the little things that might pop up. It is perfectly fine to make small mistakes. No one is judging you on your ability to multiply a couple of numbers together. Instead, I want to know that you have a good idea of scale and a rough size of the data that's being implemented, and that you can take that into consideration. From there, we can start talking about how to scale the system, how to work with it, et cetera, et cetera. Next episode, we're going to be focused on steps four and five. We'll be talking about DB and API design. These are very important things. It helps you flesh out your models and help understand what an API will look like and that you know how the data will be flowing through your system.

Speaker 1:

I want to give a special thank you for everyone that reached out. Of course, antonio Lettieri, you had a great call out on our load balancers episode, which I greatly appreciate. Gamesply and BeerX you guys have been killing it on the discord, just making everyone feel welcome, and, and I greatly appreciate that, we also got a couple pieces of fan mail. Um, unfortunately I can't reply and I can't see your name. Uh, so to the wonderful person in del mar, california, and the other wonderful person in the united Kingdom Thank you so so much for the feedback and thank you for the fan mail. And, yeah, if you want to have a more specific shout-out, feel free to send me an email. And finally, the biggest thank yous to everyone on the Patreon Jake Mooney, charles Cazals, eduardo Muth-Martinez. Thank you so so much for everything for supporting us on Patreon.

Speaker 1:

Later this month, I'm hoping to release a special episode just for everyone on Patreon. You guys are still getting the episodes a week early, but I also want to do a special episode just for you, focusing on authentication. We did have a couple of people vote on the poll asking for an authentication episode, so I want to do that special episode and eventually I will release it on the main channel, but it might not be for a month or so, so they're getting a special thing over there. So very much appreciated to all of you. I will also be posting a new poll very soon on Patreon, probably around the time this goes out, asking for what specific interviews you guys want me to start tackling.

Speaker 1:

This has all just been a primer, but I actually want to talk about specific interviews, how I would approach them, do some research on the best ways to approach them and sort of flesh that out for you guys. So, definitely, please go to the Patreon, become a supporter if you have the means, if you just enjoy listening. It would mean the world to me if you just provided some feedback and send an email, tell a friend, anything like that. It all means the world to me, so I very much appreciate it. If you would like to suggest specific topics that you want me to jump on, feel free to drop me an email at learnsystemdesignpod at gmailcom. Remember to include your name if you'd like a shout out, if you would like to help support the podcast, help me pay my bills again. Please jump SoundCloudcom. Slash aimless orbiter music and, with all that being said, this has been Thank you.

People on this episode