Mastering System Design Interviews: Building Scalable Web Crawlers Artwork

Learn System Design

A bi-weekly podcast hosted by a senior engineer named Ben Kitchell that takes a deep dive into learning about technical system design by learning together. Each episode we will explore the inner workings of what makes these systems so complex and fascinating while building on our knowledge of how they came together.

All music written and performed by the mysterious Aimless Orbiter. You can find more info about him and his music at https://soundcloud.com/aimlessorbitermusic

All Episodes

Learn System Design

Mastering System Design Interviews: Building Scalable Web Crawlers

December 17, 2024 • Ben Kitchell • Season 2 • Episode 5

Send us a text

Web Crawler Designs

Can a simple idea like building a web crawler teach you the intricacies of system design? Join me, Ben Kitchell, as we uncover this fascinating intersection. Returning from a brief pause, I'm eager to guide you through the essential building blocks of a web crawler, from queuing seed URLs to parsing new links autonomously. These basic functionalities are your gateway to creating a minimum viable product or acing that system design interview. You’ll gain insights into potential extensions like scheduled crawling and page prioritization, ensuring a strong foundation for tackling real-world challenges.

Managing a billion URLs a month is no small feat, and scaling such a system requires meticulous planning. We’ll break down the daunting numbers into digestible pieces, exploring how to efficiently store six petabytes of data annually. By examining different database models, you’ll learn how to handle URLs, track visit timestamps, and keep data searchable. The focus is on creating a robust system that not only scales but does so in a way that meets evolving demands without compromising on performance.

Navigating the complexities of designing a web crawler means making critical decisions about data storage and system architecture. We’ll weigh the benefits of using cloud storage solutions like AWS S3 and Azure Blob Storage against maintaining dedicated servers. Discover the role of REST APIs in seamless user and service interactions, and explore search functionalities using Cassandra, Amazon Athena, or Google’s BigQuery. Flexibility and foresight are key as we build systems that adapt to future needs. Thank you for your continued support—let’s keep learning and growing on this exciting system design journey together.

Support the show

Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.

Speaker 1: 0:00

Hello everyone, Welcome to episode 12 of the Learn System Design podcast with your host me, Ben Kitchell. Right off the bat, I want to take a moment to thank everyone for their patience in between episodes here. Of course, we are, you know, at the holidays, which can make, you know, episode timing a bit hectic. But also I spent the last week in the hospital with my sister, uh, who was battling pneumonia and a pulmonary embolism. I am happy to share that she is in good spirits and she's on her road to recovery and she is now out of the hospital. So, yeah, so uh, with all that out of the way, I am back and I want to let you know, know this episode we were supposed to be focusing on drawing diagrams and that sort of design step of the interview process and instead this episode we will be just jumping straight into it. Into it we're going to jump into a practical example with every step, including the steps over the previous episodes, and then also talking about the design and the diagrams portions as well. Why do this? Well, for one, you know, you guys have been patient and I appreciate it, but also I just feel speaking on everything together will give better context to this final section and help give you a better grasp on where everything fits in and how the whole process, from A to Z, actually looks. With that being said, today is the day I'm sure a lot of you have been waiting for. We're going to look at practical examples on going through designing a system. We will use everything we have learned along the way to build different types of systems and, specifically, today's episode was voted on by our Patreon members and we will be designing a web crawler. So so if you've listened to the last few episodes, you will remember me bringing up these words constantly, but it is time for our ever important first step of defining our functional requirements. In case you forgot, that basically means defining, like what are the things our system needs to accomplish?

Speaker 1: 2:49

If you are new to what a web crawler is, it simply crawls or scours or ventures across the internet with certain URLs and extracts important information. It then stores the data it gets from those pages and allows a user to search that data that's been found across the Internet. While search engines like Google, Bing or DuckDuckGo aren't technically web crawlers, they do rely on web crawlers to help them index the entirety of the Internet. So what are our functional requirements and, furthermore, what are the things that could be a part of the system but might be out of scope for today's project? Well, the first one's probably obvious we want the web crawler to crawl, right? You see, web crawlers themselves should be automatic. They will start off with a list of URLs, but they need to parse new links from those URLs to visit new pages. Right, If I give it like googlecom, right, Google is going to have a bunch of different links. Our web crawler needs to be able to take the links from Google that is on that page and then visit each one of those links specifically, get the data from them, find URLs in those and, you know, continue on doing this whole process of finding URLs, visiting the page, taking the data, finding URLs in that data, and so on and so forth. Let's take an example right, If I wanted to set up a web crawler to find the best price for a new graphics card, I might put in a few seed URLs like Best Buy, Newegg and maybe even Google.

Speaker 1: 4:36

For the first two, we could write a script that searches for GPUs, grabs the information and then returns it to a database, and then I can query that database and look at. You know, Best Buy has this card for this price, that card for that price. Newegg has the same thing for cheaper or more expensive. But again, what about the Google example? Google doesn't sell graphics cards. Instead, it might have some useful links to other websites that have a better price. So therein lies our first functional requirement we need a way for our system to queue up seed URLs as it parses new links. Even when it goes to Best Buy, clicking on the item is still a new URL. It still has to visit that page, parse that information and save the relevant information from there. The next functional requirement we've sort of already hinted at, but we need a way for the user to fetch the relevant data, to search it whenever. We've got all this tons of data about GPUs and their prices, but if we don't have a good way of actually laying out that data or fetching that data, then it's kind of useless, right?

Speaker 1: 5:50

We could also add more things, depending on the ask at the interview or if this product spec wants more. You know verbosity. These extras might include something like a schedule for crawling specific pages, such as crawling a social media which changes frequently, versus crawling something more static like a blog that doesn't have a lot of changes during the day. Another good idea might be implementing duplication checks. This can be helpful for us when we're indexing a single page multiple times. I don't really want that information stored and me paying for that storage, mind you. Um, for the same page you know, hundreds of thousands of times. Finally, we could also consider a page prioritization right where, like, the most popular pages, uh, show up. You know if they show up more times than others, or, uh, how old the link is or how accurate the search is. If I'm searching for GPUs, maybe I don't want to see CPU data. You know things like that.

Speaker 1: 6:52

For now, though, let's just talk about our basic functional requirements, One being the crawler should accept that list of URLs at the start. The next is the crawler should enqueue new URLs it comes across. And finally, the last would be the user should be able to query the data parsed from the web crawler. As mentioned before you're designing an MVP or you're working on an interview, we only need two to three functional requirements for that. If we need more in the future, that's something to tackle at another time. Don't overwhelm or bog yourself down with a lot of functional requirements that aren't needed, and the same rule of thumb applies to our non-functional requirements.

Speaker 1: 7:41

If you recall from episode number eight of this exact podcast, there's an easy pool to choose from when it comes for non-functional requirements. The complexity comes in when deciding which of these are the most important for your system, and if you don't remember this pool, I'll recap it's just a high level, though If you don't remember it fully, I recommend going back and listening to that episode. But the rough pool is your CAP theorem environment constraints, scalability, latency, durability, security, fault tolerance and compliance. The big three from this list, in my opinion, is scalability, availability, which is part of the CAP theorem right, and fault tolerance. Others you might consider are like security, because you'll be crawling unknown URLs and you'll be doing it autonomously, so you want to make sure that the data you're storing is secure. Another one to consider is your compliance. If you're scanning data that could have potential personal identifying information, or PII, it is important to obfuscate this data and not expose it to individuals that shouldn't be seeing other people's personal identifying information. For now, though, again, we'll stick with our two to three of the system should be highly scalable. The system should be highly scalable, the system should be highly available and, finally, the system should be fault tolerant.

Speaker 1: 9:11

When it comes to estimations and back of the napkin math, I've talked in previous episodes about my general disdain for capacity estimates. I want to make it clear before I jump into this that it's not always necessary. It's up to you and the interviewer or your product leader to help you decide how In depth you really want to drill down into this. In the end, regardless of calculations, you're going to end with a giant number of users, a giant number of data and surprise a giant number for your cost. But in reality, 99.99% of people listening to this will not need to design systems with petabytes of data and billions of users. However, I would be doing you all a disservice if I did not address this section.

Speaker 1: 10:08

So let's dive into some raw numbers and estimates, Because the nature of web crawling can grow exponentially. We can assume that the number of pages we will be saving is probably going to be huge and if you remember my advice from previous episodes about how to deal with large numbers, instead of saying something like 986 million pages to sound smart and more accurate, we can instead just say it'll be roughly 1 billion 1 billion a month. And why 1 billion versus 986 million, Because getting your queries per second threshold is a lot easier to divide 1 by 30, by 24, by 3600, rather than 986 divided by 30, divided by 24, divided by 3,600, right Like yes, the 986 might be more accurate, but you're not going to get exactly 986 million. Every month that number can change, it can get bigger, it can get smaller and you will never have the exact number. So giving a number that makes it harder for you to do this sort of backup napkin math is just. It's going to hurt you more than help. So doing that quick calculation for you, we end up with somewhere around 390.

Speaker 1: 11:31

So again, let's round that. Let's say 400 queries per second. But here's a little secret Because we've done that math now, we don't have to do it again in the future. What do I mean by that? Well, when you're doing any calculations in the future now you know that 1 billion of anything, doesn't matter what it is, 1 billion of anything a month is the equivalent to 400 of that same thing per second, Doesn't matter what it is. You can go into an interview or product discussion and not have to write all this out Now. You can just say well, I know that 1 billion requests a month means 400 requests a second.

Speaker 1: 12:16

Taking this a step further, we can also apply to smaller amounts. Now, right? So 100 million is a tenth of a billion, right? Well then, what's one tenth each second, right? Therefore, 100 million of anything is just 40 of anything a second. And of course, 10 million will equate to 4 a second. You see, once you sort of get this logic and you remember this math, you can just apply it without having to spend extra time working on these minute calculations. 400 a second means 1 billion a month. 40 a second means 100 million a month. 40 a second means $100 million a month. 4 a second means $10 million a month, and probably goes without saying. But 1 a second is roughly $2.5 million a month.

Speaker 1: 13:11

So let's talk about the other side of the coin, right? Let's talk about the data size and your storage estimates. We know we have roughly 1 billion URLs a month. So what is the rough size of any of these web pages? You can just choose a reasonable number or look it up if it's easier, but for the sake of this discussion, we're just going to say it's 500 kilobytes. Since we know that 500 times 1 billion is just 500 billion.

Speaker 1: 13:40

We can consult the power of thousands charts that we talked about in our previous episodes. 1 billion bytes is a terabyte, so 500 billion bytes is just 500 terabytes. So we have 500 terabytes worth of URLs a month, worth of URLs a month. Or we have 6,000 terabytes a year, which is roughly the equivalent of six petabytes six petabytes worth of URLs that we are storing a year. From here, it's pretty easy math. How long do we want to store the data for? Conventional wisdom says five years, but maybe we want to store it for three or seven. It's as simple as multiplying six times three or six times five. So roughly we'll have somewhere between 18 to 30 petabytes worth of data at any given time. And so, to summarize this, remember what I said before capacity estimates don't bring a lot of conversation, besides knowing that we have a lot of data and a lot of requests, but none of it actually changes or informs how we design our system. If the number is going to be huge, if the number ends up being in the thousands, sure that might say you know, hey, we can cut back on a lot of these things to have a smaller, more uniform design. But most of the time when you're in these interviews, they're looking to see if you can scale to the level of billions and petabytes and terabytes worth of data, and so getting those numbers extracted just means that you know how to do basic math and you can, you know, add, subtract, divide and multiply, which is not very helpful for you know, actually designing the system as a whole.

Speaker 1: 15:34

Next, we'll talk about the database models, and, luckily for us, the database models for this particular design are pretty straightforward. But let's look at our functional requirements and break them down. The crawler should accept a list of URLs. These URLs are enqueued, and then the data that we get should be searchable. For the first two, we simply need a data model that has a string property that's called a URL, and to help ensure we're not bombarding these websites with requests, we can have like a timestamp of, you know, last visit timestamp. This will allow us to check that if it's been less than a certain timeframe, less than a second, less than a minute, that we don't visit the URL again. Finally, we would need an S3 URL, and all this S3 URL is is just a URL that points to a bucket in S3 where the majority of our HTML parsed data lives. We could also have some extras for metadata like when was it created? How many times did we come across this URL? You know the world's your oyster. You can add as much as you want, but for this exercise let's just stick with our URL, last visit timestamp in our S3 URL. And for that final functional requirement, the user should be able to query that parse data. And for that final functional requirement, the user should be able to query that parse data.

Speaker 1: 17:01

Much like every part of this that we've covered before in this episode in other episodes, from the beginning I've been saying this that there's multiple ways of tackling this right. I would argue that this is one of the most important decisions that we can take here and that's how we store the data and make it efficient to search. If we were dealing with a small number of pages, like in the tens of thousands, we could get away with just parsing the data and keeping it in the database and then doing like a fuzzy search right, If we had millions of pages, we might be able to get away with storing in something like Cassandra and just keep our indexes clean and, you know, really focus on that level of storing. But, as you remember from just a moment ago. We have billions of pages and, as you probably remember, from just a second ago when I said S3 URL, you probably are getting where I'm going with. This is when we have billions worth of pages or data of any kind, we should instead just try and keep our costs a bit lower and store it in something like a large object storage, something like AWS's S3 or Azure's blob storage, Although this might be a bit more costly than running our own servers and keeping everything within our own domain here. It's going to be more of a pain in the neck. It's going to require more engineers, more people keeping an eye on the data, having backups of all of these things, whereas something like S3 and Azure, their blob storage, will help hold all these things, have the backups, have the redundancy and not have to worry about the power going out at your server farm and then no one can actually access the data. That's a part of our availability, non-functional requirement, just right off the bat.

Speaker 1: 19:04

As far as the API design goes, the application will have two folds of communication externally to the user and then internally for other applications, Like if a search engine, for instance, wanted to use our web curler. Like if a search engine, for instance, wanted to use our web curler. Depending on the situation, it can do either one or both at the same time. For the user-facing portion, we will need a way for the client to talk with the application, and a pretty quick and easy way to do this, of course, is using an API. In this case, a REST API will be more than enough. For the internal communication, we could use a REST API or we could use something more conventional like an RPC, but for this use case let's just use a REST API across the board.

Speaker 1: 19:53

To satisfy our first functional requirement of allowing a client to pass in some seed URLs, we would have something like a POST request to slash v1, slash seed, with a JSON body that has like an array of URLs that we would begin to work off of. For our next functional requirement of enqueuing new URLs as we come across them, there's nothing really we can do with the API for that right. That's all internal communication. So you know again, we could have like an internal API call that calls from one service to another service to just add those things to the queue, but we don't really have to lay that out in our API design. For the third functional requirement, we will have a get request structure that calls slash v1, slash search, and that will accept query parameters that will be the keywords that we want to search the data for, and at this point, what I like to do is just take a second to talk about how the data will flow through our system, Because in the next steps, we're going to be drawing the diagrams and applying this, and don't worry, since this is an audio experience, I will be including rough sketches of diagrams and doing call-outs for the next portion. I promise you can actually find a link to that below in the show notes.

Speaker 1: 21:18

For now, though, let's focus on the data flow. The initial step is taking the URLs from the initial seed request, putting those into our queue. From here, the queue will make the request to the URL server to get all the initial HTML for that web page. We'll then store this data in an S3 bucket, store the URL in our database that S3 URL property, if you remember and, of course, update our last visit timestamp in our database. Next, we'll send that raw data along to, maybe, a URL parsing service that will look for new URLs, and then it can just put those back into the queue. Not too bad, right, I've only introduced two new services that should make a lot of sense an HTML parser and a URL parser. One parses the HTML for the data. One parses the data for any new URLs. On the other side is our external user flow for searching. A user will send a get request with certain keywords and things they're looking for to find extra data on. The user sends a query parameter with a number of keywords and here we find ourselves in an important situation once again.

Speaker 1: 22:38

You may be already asking yourself well, how do we search for specific keywords if the data is stored in S3? And there are again, as I say all the time, many solutions to this. The most obvious solution is you can add a new entity in your database model. Right, you can go back and you can modify that model. You have not started designing your system, You're just planning. So it's totally okay to go back and add things that maybe you didn't think about before.

Speaker 1: 23:08

This could be a nice place for Cassandra, since there will be so many rows of data where you have just a key value pair, where the SD URLs are associated with certain keywords. For instance, a keyword for shoes might have a value with the S3 URLs that contain Nike, Adidas, etc. And when someone queries shoes, they would get the data back from these different sites that have been parsed. Another solution, though, would be foregoing this entirely and using a tool like Amazon, Athena or Google's BigQuery. This allows you to search your bucket's contents for that specific information. These types of queries can be more costly, but they can also save you development time, so it's important to remember to bring up the discussion here and explain why you would choose doing one of these things over the other. I might implement the database with Cassandra to save on cost, but I might implement Amazon, Athena or Google's BigQuery to save on time, and it's important to just know why you're doing something and make a call out to say that let's now jump into the design step. This is what most people think of when they think of a system design interview. It's also the thing that most people rush to, and then they find themselves building the road while they're driving the car.

Speaker 1: 24:35

Right In the show notes of this episode, you will see links to pictures with dictation and each of these steps I'll be going through. I will call out specific sections as I walk through the process, and feel free to save these images as needed. We've already been through the first few steps so far and along the way we should be documenting our findings. You can see that in our first image, where we have laid out our current state, showing our functional requirements, non-functional requirements, capacity estimates, DB models and, of course, our API design. And your first action in your design step will actually be very familiar to if you were writing an algorithm for the first time.

Speaker 1: 25:21

You want to get the minimum viable, working design, so you want to start off very basic and then build up accordingly. Don't worry about scaling or anything else except the bare minimum of completing your functional requirements. The reason we do this is because we want to show that we can design a system for just a thousand users, but we can also scale that design up responsibly and easily if our user base grows. And, as you can see in my basic design, in images two through five, I've taken the time to highlight how each part of my system works with each part of my functional requirement. Next, you want to begin building out your system to work with your non-functional requirements. In images six and seven you can see how I chose to scale the system, how I choose to implement strong availability. Some of the things I didn't show, but that should be touched on when you talk about these things, is how splitting that one back-end service into multiple service helps us scale horizontally. Scaling horizontally helps us have higher availability. It helps us with fault tolerance If one of our services breaks, there's plenty of others that can take its place.

Speaker 1: 26:47

Try to pay attention to the changes from one of the images to the other and always feel free to reach out to me if something doesn't make sense. I'm always here. Drop me a line on Discord email, wherever is easiest for you. The important thing is to see how the design phase evolves. Your system doesn't need to look exactly like my system, it just needs to evolve in the same way. There are many routes that we could have touched on past this point.

Speaker 1: 27:19

If you're interviewing for a higher level position, you'll be expected to identify these things on your own. You want to make call-outs before the interviewer actually asks you or tries to help you give them. Don't cut them off, of course, but when you're done with your design, you look at it and you say well, here's the things I didn't address, but they could be implemented by doing X or Y For someone with less experience. It's okay for the interview to bring these things up, give you some tips and then you can just talk about how you would address them.

Speaker 1: 27:52

A good example of things to consider for like our web crawler project today would be something like our deduplication service. If we're fetching URLs over and over and we're fetching the same one, we don't want the same URL to have all of its data stored a thousand times in S3. It's redundant, it's not helpful and it just costs us money. Another thing might be web crawler traps. This is a bit more advanced, but it shows that you're familiar with what a web crawler is right. Some sites design circular dependencies in which they are basically a trap for web crawlers to just be stuck in an infinite loop that they can never leave the page. So how do you defend against these right? Try to think about all these things once your design is done. Think about how you personally could break this and then bring that up and talk about how you would defend against these things.

Speaker 1: 28:48

All in all, I hope this episode has been helpful in understanding not just how to design a web crawler, but also how the process of designing a system in a systematic way, especially during an interview. You know feels I will be following the same structure going forward. In new episodes, I'll be including images and things like that. Next episode, we will be diving into designing a URL shortener. Next episode, we will be diving into designing a URL shortener. A URL shortener is one of the most common problems in system design, so it should be a lot of fun. Again. Just another call out.

Speaker 1: 29:26

Thank you everyone for your patience while I've been getting these episodes out. It is the holiday season, so if they're like a day or two late, that could be what's going on. But yeah, I want to give a huge thank you to Devil's Soul on Discord. They had a fantastic question about where they can read transcripts and if you actually visit the podcast page learnsystemdesignbuzzsproutcom, click on any of those episodes and you can find a full transcript of the entire episodes. Also, feel free to just email me anything I could do to help. I'm always happy to to help you guys. I also wanted to give a big thank you to Martin K on the discord. You had a lot of wonderful feedback. I'll be working on implementing that as much as I can in the future. And finally, I want to give the biggest of thanks to everyone on our Patreon. So a very special thank you to Pranav Leah, Whitelaw, Lucas, Anderson, Kalyan Desika, Sergey Bugarra, Anil Dugar, Jake Mooney, Charles Gazals, Eduardo Muth-Martinez, Wahoo Soares, Adrian Couric, Gabriel Jesper Klotzel, Nathaniel Sutton, Florian Breider and Klavian S for being a part of our Patreon and helping and supporting me.

Speaker 1: 30:43

If you're listening to this and you've been enjoying the episodes, it would mean a lot to me if you shared this with your colleague or friend or just tweet about it. Anything that works. I just appreciate it. I appreciate all of you. You know you all are my Christmas gift. If you'd like to suggest new topics or even be a guest on the podcast, feel free to drop me an email at LearnSystemDesignPod at gmailcom. Remember to include your name if you'd like a shout out. If you'd like to support the podcast, help me pay my bills. Please jump over to our Patreon. Consider becoming a member. You get the episodes a week early. You also get to vote on the different topics that I'll be covering. All podcasts are inspired by Crystal Rose. All music is written and performed by the wonderful Aimless Orbiter. You can check out more of his music at soundcloudcom. Slash aimless orbiter music. And, with all that being said, this has been and I'm scaling down, Thank you.

People on this episode

Ben Kitchell

Host