Learn System Design

Mastering System Design Interview: From Concept to Scale Building Efficient URL Shorteners

Ben Kitchell Season 2 Episode 6

Send us a text

URL Shortener Designs

Unlock the secrets to designing a high-performing URL shortener system in our latest episode with Ben at the helm! Get ready to master the essentials that transform a simple idea into a robust tool, essential for platforms with strict character limits like Twitter. We'll walk you through the core elements of creating effective short URLs, from ensuring seamless redirects to setting expiration dates tailored for various industries like marketing firms. Discover the importance of high availability in read-heavy systems, and learn how to craft a service that not only meets but anticipates user demands.

Dive into the architectural complexities of building a URL shortener that can scale to billions of requests. Ben breaks down the nitty-gritty of data model structuring and the strategic benefits of non-relational databases like Cassandra for horizontal scaling. Learn to harness the power of character hashes and explore innovative ways to keep your URLs unique and efficient. We'll reveal the architectural tactics like adding database replicas and load balancers to maintain system availability and performance. Tune in for a wealth of strategies and insights that promise to elevate your system design skills to new heights!


Support the show

Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.


Speaker 1:

Hello everyone, welcome to episode 13 of the Learn System Design podcast with your host, me Ben. As we jump right off into the new year, I want to thank all of you. Surprisingly, I actually found out that we are in the top 25% of podcasts in 2024, which I could not have done without each and every one of you listening, reviewing and sharing. That being said, you know it's not like I'm exploding with ad opportunities or anything, so, genuinely honestly, jokes aside, if this is the first time you're hearing my voice, or it's the hundredth time you're hearing my voice, I appreciate you so much for just listening and, you know, allowing me to. You know, take up your time. In 2025, going forward, I'm going to try and focus more on making the podcast a more upgraded experience, so be on the lookout for, like new title screens, improved website and ways to boost the community. I'll talk about more of this in the future, not just this episode, but future episodes, but for now, let's just jump into it.

Speaker 1:

Today's subject is building a URL. Shortener is. If you have used something like bitly or tinyurl, then you're probably familiar with the concept. It's simple. Basically, it takes a long URL and then turns it into something a lot shorter and a lot times a lot more readable and turns it into something a lot shorter and a lot times a lot more readable. A simple example would be instead of sending someone the URL learnsystemdesignbuzzsproutcom, you could simply give someone a link that's shortened like bitly forward slash, lsd. That's not a real one, at least not for my uses, so buyer beware. But aside from the readability, url shorteners exploded in popularity due to the rise of apps like Twitter, which restricted the amount of characters you can use. So when sharing things like an image or a link to a news article, which can sometimes be very long, very cumbersome, we needed a way to actually share those things and then also share our thoughts. In comes in the shortened URL. Now users can just share a shortened URL. This will forward them to the corresponding long form URL and you can have all of your thoughts in that same tweet. So now that we know what a URL shortener is, let's talk about how to build it.

Speaker 1:

As always, we will start with our two to three core functional requirements. If you're working on preparing for an interview, you can maybe take the time, pause the podcast, write out a few functional requirements or even challenge yourself more. Try to flesh out the entire system and come back and see how things line up, and it's totally okay if they don't perfectly match up. Everyone builds systems differently. The core takeaway should be how do ours differ and why do they differ? Do you have a good reason for using this tool? While I use that tool, and what I say isn't always the best option, right? Instead, you should be able to build something out and feel confident in what you're building. That's the goal.

Speaker 1:

Our first functional requirement is a pretty easy one. It is that when you are given any URL, it should generate a shortened version that is unique and has a one-to-one mapping. We don't want to accidentally generate the same short URL for two different long URLs. That would be a disaster, right? The next functional requirement is when a user clicks on the shortened URL, they should be returned the longer URL.

Speaker 1:

A shortened URL means nothing if it doesn't actually redirect to the place we want it to go right. And we could stop there, After all. These are the basic core components of a URL shortener, but I want to round us off to an even three. So for our final functional requirement, we will choose to give the URL an expiration date, and there are other choices that we could have used instead, or even added on to with the expiration date, something like custom URLs, where the user gives us a specific you know five, six, seven length word that they want to actually use to be their hash. There's also bulk URL shortening, where you can do a bunch of different URLs at once with the same theme, and finally, of course, there's things like URL metrics, like where did a user click from and how many times has this URL been clicked that sort of stuff. It's worth tailoring your system towards the company you're building the system for. For instance, if it's a marketing firm, the analytics or the custom URL would be a great addition, right. And if you're working at a startup, looking to keep your costs in check, having an expiration date means not having to hold on to those URLs forever, and it's costly to hold on to data for a long time.

Speaker 1:

For now, though, as I said, the functional requirements will be taking a long URL, giving back a short URL, redirecting a user to the long URL when the short URL is given, and allowing the user to set an expiration date for when that URL will be deleted. Technically, it's not a non-functional requirement, but it is an important thing to note here as we start to define our non-functional requirements. Important thing to note here as we start to define our non-functional requirements, the system is going to be extremely read-heavy. We talked about how to deal with read-heavy systems in the first few episodes of the podcast and we'll be bringing up a lot of the techniques during the design phase. I just want to call it out here so that we can keep it in mind as we go forward and I will spare you the time of going over the key non-functional requirements I like to go through. If you're curious about those, you can listen to the last episode. You know it's just a primer, right? It's just what I go through and it's something I recommend to others if you need sort of a way of narrowing down your non-functional requirements. Instead I will get straight to giving you what I would decide to be the strongest arguments for each.

Speaker 1:

The first non-functional requirement comes straight from the CAP theorem and that is high availability. We talked in episode one about the different levels of nines. Different levels of nines For this system. I would say you know, an uptime with 99.99% of uptime is great, or four nines worth of uptime. This in itself is honestly not an easy task, but it's also a shadow and a spec in terms of something like AWS, where they have an astounding 11 nines worth of uptime. Right, but we're building a URL shortener. We're not building a cloud infrastructure for the entire internet here. The next non-functional requirement would be low latency, because the last thing you want to do as a service is be a bottleneck for your end consumer. We talked briefly about how much money Amazon loses on milliseconds in the first few episodes.

Speaker 1:

The same can be said for URL shorteners, albeit on a much smaller scale. If you take too long to redirect to a longer URL, then your end user will have a bad experience. And finally, you want to make sure that your URL generation is secure, because having a system that generates URLs that are predictable make it a lot easier for people like hackers to wreak havoc, because if someone knows how your system transforms the URLs, they can then have a long URL that is malicious in nature that gets shortened to look like a safe common site. Imagine if a long, malicious URL came back with a shortened URL that was like facebook or bitly slash TikTok. They seem safe, but, if generated into the wrong hands, would lead to horrible things for everyone, including you who built the system. And so, to recap, for our non-functional requirements, we need something highly available, has low latency and is secure, especially with URL generation.

Speaker 1:

For our capacity estimates, we can expect a large number of requests a month, as we talked about last episode. 1 billion requests a month is around 400 requests a second. For something like a URL shortener, we can get a pretty good idea of the amount of requests we would get, based on a rough estimate of tweets that Twitter gets in around a month. Now, obviously, not every tweet has a shortened URL, but even if 10% of them do, we're looking at around 2 billion shortened URLs a month. I'm rounding here because the real number I added was around 1.7 billion, but, as I've mentioned before in previous episodes, nice round numbers help us get these capacity estimates a lot easier, a lot to understand, a lot easier to write down and a lot easier to work with. So, two billion reads a month, that's around 800 reads a second. If we say that we have a thousand reads for every right, then we have around two million writes a month from.

Speaker 1:

We need to decide the maximum amount of time that we want to actually hold on to these URLs. We do, after all, have an expiration date that needs to go on the URL when we put it in the system. So having a good idea of how long we want to hold it for at a maximum time helps us nail down these capacity estimates. Hold it for at a maximum time helps us nail down these capacity estimates. For this project, we can say something like five years. If 2 million a month is 24 million a year, that means at peak load we have around 120 million URLs to keep track of at a time, and that's just the writes. At 2 billion reads a month, we're looking at somewhere around 12 billion URL reads every five years. That number is colossal, but it's over five years, so let's not freak ourselves out too much. The most important thing about the reads is that it's about 800 a second.

Speaker 1:

Our next focus will be how we structure our data models. To second, our next focus will be how we structure our data models. We know from before that this is a read-heavy system, which means we will need a way of scaling our databases as quickly as possible. If you recall from the first few episodes, nosql, or non-relational databases, are great at scaling horizontally without a lot of extra fluff, but they don't offer any sort of the acid principles of a more traditional relational database. So in this case we should weigh the pros and cons of each, think about how our data is actually used and then decide the type of database we want to set up.

Speaker 1:

So our first table would be a simple users table. This allows us to keep an idea of how many URLs are being created by a particular user and also gives us a way of giving a user the list of URLs that they have shortened, if that's something we want in the future. Overall, having a user table in general with anything that is public facing is a good idea. The user table will almost always have the following properties A user ID, which will be some sort of number type. An email property, a password hash property, both strings and finally, a created at property that'll be some sort of timestamp. In some cases you might have a username property if the user gets to create their own username or something, but for now we're just going to keep it with the base properties. The second table will be our URLs table. Our first property, of course, will be our URL ID, a number that's a unique identifier. This is common for anything that isn't a mapping table of some kind, to have the id that's specific to that table. You also want to keep track both of the long and short urls, so we'll need a property for that that are of the text type, and we also need a way to keep track of the user who created the URL, and we can signify this with a user ID. That sort of acts as a way of tying it to the user ID in the user's table, sort of like a foreign key if we were using a relational database. And to round it off, we will have two more properties a created at timestamp and an expires at timestamp. This helps us keep track of when the shortened URL hash was created and when it should be removed, but we'll talk more about that later.

Speaker 1:

With our data models all set up, we can now take into account the kind of database we're going to use. We don't have to make a choice yet. We can always do it in design round. But while we're speaking on the data, I want to make sure to try and make a decision for the podcast. As we talked about before, the system is read heavy, which means either using a non-relational database for its ability to scale horizontally quickly or if the data is highly relational, we need to set up read replicas and maybe even a strong sharding technique to keep our data intact. For our case, we do actually have a relation between the two tables in terms of the user ID right of a relation between the two tables in terms of the user ID right. But think of how the data flows when a user clicks on the shortened URL and we fetch that longer URL for the redirect. Will we need the user ID? No, right. So while the data is technically relational in a way, the way we use it most often isn't.

Speaker 1:

Now, I know that's a confusing thing to say, so I'm going to try and relate it a little more to a real world example to help it be easier to digest. If I buy a really expensive camera, does that make me a photographer? What if I use it, say, three times a month? Still, no right. But if I walk everywhere with the camera, if I'm taking pictures of everything and you see me with my camera more than not then I think we can make a strong argument that I am a photographer. And it's the same way with the data. More times than not, we're not using the user ID, it doesn't actually matter for the biggest use case of our system. And if that ever flips, we can then change the kind of database we use. But until that becomes the example we're on, I think sticking with a non-relational database is the way to go here. In terms of which non-relational database, again, it's up to you. But for this high amount of data, for this high amount of throughput, cassandra seems to be one of the biggest choices I've seen.

Speaker 1:

Api design for this system should be pretty straightforward. We can look back at our functional requirements and we can make sure we cover our bases. The first and third functional requirement can be tied into the same route and the second can have its own. And now that we have them sort of logically grouped, we should consider the kind of API technology we want to utilize here. Normally I wouldn't do this in an interview or anything like that, but since the API design is fairly simple, I'm going to take some extra time on the API technologies and why I would rule them out and which one I would choose personally. Again, this is not always necessary, especially for this specific design, but I feel like we have some extra time and it would be a good teaching moment.

Speaker 1:

Our first choice would be something like SOAP, or Simple Object Access Protocol. It relies mostly on XML and is generally very secure. It is both language and protocol agnostic, meaning it can be used with HTTP like REST, or even TCP and others. The main drawback, though, is its complexity, and its rigidness make things like caching very difficult. Because the system isn't in need of that extra security, we can actually just skip SOAP for now.

Speaker 1:

Next is REST, or Representational State Transfer. That's probably the most common technology that most of you are familiar with. It doesn't inherently have any security protocols, which means you have to do more lifting on the side of the request to make things secure. However, its stateless nature also lends itself very well to caching on the client side, making things more speedy, and this is the technology that we'll be reaching for most of the time, and its lack of a learning curve makes it the easiest to do so Most of the time, and its lack of a learning curve makes it the easiest to do so. Even though we've already chosen REST, I still think it's worth, out of fairness, to look at the other technologies as well, because just because it's the first choice doesn't mean it's the best choice.

Speaker 1:

Things like GraphQL. It's a technology that acts as a query engine at the API level. It allows you to have strong, tight schemas that you never overfetch or underfetch data. It returns exactly what you ask for. Unfortunately, that means a higher level of complexity and rigidness that if you don't need specific data to be queried, you actually just run the risk of overengineering your design. Since we have two basic tables that query very little, then we're actually hitting that over-engineering part here. So I think GraphQL will save for something that's a bit more complex later.

Speaker 1:

Lastly, we have gRPC, a remote procedure protocol created by Google, and it relies as protobufs as its interface language. Grpc is really cool. It has a lot of extra features like built-in authentication, load balancing and even code generation for front and back-end communication. It is used a lot in microservices to help them communicate with each other. But again, for our relatively simple design, it would be a classic case of just overengineering.

Speaker 1:

So now that we've settled on using REST as our API technology, we can start to define our API routes. We can start to define our API routes. The first and third, again, will be covered by a simple route that allows the user to create a short URL based on a long URL with an expiration date. It would be a route that accepts a POST request to slash v1 slash URLs and then the parameters in the body would be long URL, short URL, user ID and the optional expiration date. Now, if the user doesn't give us an expiration date, we can probably just set it for one month in the future. That timeline just depends on the use case, right? But the route will eventually return the shortened URL, with the main part of that URL being the URL hash. The second route, which will cover the second functional requirement, is when someone sends a get request to slash v1 slash URLs, they'll have a query parameter that's the URL hash, of course, and from there we'll just return the long URL in which it needs to be redirected. Pretty simple, right? Either one passes in a long URL and gets back a short URL, or passes in a short URL and we send back the long URL, and that in itself is the main data flow of our system.

Speaker 1:

But there's still a big lingering question how do we make our URL hashes unique? What is our flow for ensuring that our URLs are not predictable? So obviously we will need to encode the URLs in some way To try and find a unique way of tracking them. We can use an MD5 hash and then simply encode the URL. The typical encoding for URLs is base 62, which is easy enough, but we also need to decide the length of the URL hash that we are producing.

Speaker 1:

I think it would be absurd for me to ask you to know offhand that 62 raised to 5 is 916 million. Instead, just try to remember that 62 raised to 5 is around a billion. Right, so you can base your rough estimates on this fact Is the amount of URLs we're going to be handling more than a billion or less than a billion? The amount of URLs we're going to be handling more than a billion or less than a billion. To give you a peek behind the curtain, 62 is just the number of possible characters that each of the characters can be, and then five is the length of the hash. Right? In our case, we might be better off using something like 62 to the seven, since we might have, you know, 120 million URLs at any given time. But we want to leave room to make sure that we can handle a lot of growth in the case where we want to handle 12 billion or 120 billion or what have you, because you won't always have the luxury of being able to do the math.

Speaker 1:

Sometimes just doing a large number is best, like 62 to the 10th. For now we'll use 7 as our length, but it's perfectly okay to use a larger number. Again, just have a good argument. Hey, I'm using 62 to the 10th because I know 62 to the 5th is around a billion and we need a ton more than that. So I just want to make sure we cover our basis for scaling, and at this point you're probably asking that well, benny, that's all well and good, but if we pass the same long URL, we're going to hash the same thing, we're going to encode the same thing and then eventually, we're going to return the same hash. And you know what? You're right.

Speaker 1:

We still need a way to make sure the same long URL doesn't return the same short URL when creating a new one, and you could do the classic encryption technique before you hash the URL, add the current time in milliseconds to the end of it, and this is a great way to keep the uniqueness, since the time will never repeat, right? However, it could become a problem if two users happen to create the same URL at exactly the same time. Is it likely? No, is it possible? Yes. Another easy option is to simply increment a number over and over and over again and append it to the end, like the time, and handle it in the same way, since the number would be auto-incremented, you wouldn't get repeats. And in this case, just to keep it simple, we're going to go with the auto-incremented. You wouldn't get repeats. And in this case, just to keep it simple, we're going to go with the auto-incrementing number. There are ways around the time one, obviously, but for now just incrementing a number is easy enough for us.

Speaker 1:

As always for the design portion, you can find in the show notes my rough sketches of each step of this, starting with part one, which covers everything all the way up to design, and then parts two through, I think, five now, which will show each step in the scaling. For the design portion, again, the main focus is to, start with the bare minimum, have a simple front-end, back-end database setup. This covers your base case for one user a month, or even, you know, like 10,000 users a month, and it's great proof of concept that you can scale as you go. You can build to scale. The basic design we have just described actually captures all of our functional requirements when we are supporting a small number of requests. Our main focus for scaling up our system will be just focusing on satisfying our non-functional requirements. And so to scale.

Speaker 1:

Our first step is to add in some database replicas and a cache, and this helps support that read-heavy system that I've been talking about so much. These new additions alone are a great way to support a good amount of growth on our system, and from there we want to start thinking about our non-functional requirements, the first having a system that is highly available. In order to support this, we'll add a couple of load balancers, an API gateway, and allow for horizontally scaling our core services. Having horizontally scaling core services not only separates the services from each other and adds that partition tolerance, but it also creates redundancy. If one of the services fails, then we just point to the next one, right, and the same thing applies with scaling out our databases and having multiple databases. The last thing you want is a single point of failure, and that's why we use horizontally scaling techniques so much In what's basically known as a win-win. These changes also help with our second non-functional requirement low latency.

Speaker 1:

But we can also go further. We can separate our concerns within the system, in this case separating the backend service to a user service, a URL service and a cleanup service. In this case, the cleanup service is just a cron job that runs daily and removes all those URLs that are outdated, Then we should also separate the databases, have a URL database and a user database Again, have the cleanup service pointed just to the URL database that whenever those URLs have an expiration date that's in the past just remove it to help clean up your storage. And for the last non-functional requirement, we cover that with our data flow right. We talked about how we can make things more secure by having that auto-increment and then hash and then encode, and we are effectively removing a lot of the predictability from our URL hashing.

Speaker 1:

And there's many, many more things we could do to scale out the system, but again, there's always going to be more you can do. The point is that we have built a system here that we can scale upon. If we want to add a new service, if we want to add that auto increment and pull it completely and have it be a single service that just returns a number that is auto incremented, we can do that, no problem. A number that is auto-incremented. We can do that, no problem. Honestly, go crazy, try different tools, try different techniques and, as always, if your system turned out different than mine, rather in terms of how the design looks or the tools you use, that's completely okay. The main focus is to teach you how to design a system on your own, then eventually give you a point of contact for how someone else would do it, and if you have a good reason why you are doing something differently than I am, then you are doing it perfectly correctly.

Speaker 1:

I mentioned before in the episode that positive changes are coming to the podcast a new title screen, a new website. I'm going to start putting links to the transcripts and the show notes for each episode. I will also be trying to grow the Discord for a place that's not just for assistant design. I want to grow it to be a hub for discussing the interview process, including posting open jobs and resume feedback. And if you're someone who's listening to this and you have some experience hiring or giving interviews, come join us. Help build out the team right.

Speaker 1:

Be a voice in the community, because the last thing I want is just to be the loudest voice, speaking only from my own experience. I want to help everyone, anyone who needs help. I want to be there to give you a hand, but it would be great if other people who have different experiences were there to also lend their hands. And, speaking of Discord, I want to thank everyone on it for their questions, their answers, their feedback, everything and that includes everyone, but not limited to Martin K, shrikant, uber, vincent and Fosterer. You all rock and I appreciate you all just keeping the Discord alive and giving me ideas and giving me feedback. I also, as always, want to give the biggest of thank yous to everyone in the Patreon it means a lot, and I'm hoping.

Speaker 1:

In the patreon, it means a lot and I'm hoping in the near future, uh, to set up a better system where you can get like an rss feed to all the early episodes instead of having to listen to them on patreon. But you know, you guys are awesome for for just you know helping me at all, and that includes antoniox-Lettiere, shridevi, koluri, durov, chondok, pranav, leah, whitelaw, lucas Anderson, kalyan Disika, sergei Bugarra, anil Dugar, jake Mooney, charles Gazals, edward Muth-Martinez, wahoo Soares, adrian Couric, gabriel just Gabriel, jesper Klotzl, nathaniel Sutton, florian Breider and Klavian S. You all rock. Durov, I actually saw your message and I hope I'm saying your name correctly. Please send me a message if I'm not. I saw your message on Patreon and I really appreciate the kind words. It means so much to me. Message on Patreon and I really appreciate the kind words. It means so much to me. I'm glad I can help provide some sort of value and help fill in those gaps a bit. Never hesitate to reach out to me directly if you have any specific points you'd like me to cover. And then, speaking on things to cover looks like the next episode we will be talking about another common system design interview, the parking lot registration. I look forward to working on this with all of you and if you, the listener, have been enjoying these episodes, please share with your colleague or your friends.

Speaker 1:

Special thanks to those of you who have written some reviews this week. I want to call out the people on Apple Podcasts Reviewers Gateway, bof, edshe994. You both had very kind words, kind things to say. Thank you everyone who's given the podcast a review, both on Spotify and Apple Podcasts and anywhere else you might listen. If there are written reviews that I'm not seeing or comments I'm not seeing, please feel free to email me If you'd like to suggest new topics, be a guest on the podcast. Feel free to drop me an email at learnsystemdesignpod at gmailcom. Remember to include your name if you'd like a shout out, if you'd like to support this podcast, help me pay my bills. Please jump over to Patreon. Consider becoming a member. All podcasts are inspired by Crystal Rose, all music written and performed by the wonderful Aimless Orbiter. You can check out more of his music at soundcloudcom. Slash aimless orbiter music. And, with all that being said, this has been. I'm Scaling Down you.

People on this episode