Learn System Design

4. Databases Decoded: Elevating Data Systems with Smart Replication and Sharding Insights (part 4)

Ben Kitchell Season 1 Episode 4

Send us a text

Unlock the full potential of your database management with our deep dive into scalability strategies that can revolutionize how you handle data growth and system performance. Jessica Ho, a sharp-minded listener, brought forth questions that led us to explore the intricate dance of read-through versus cache-aside caching. We break down when to use each technique for the utmost data consistency across varying applications. Not stopping there, we also shed light on Redis and its dual capabilities as a powerhouse in-memory data structure store, adept at enhancing your caching solution both locally and remotely.

As we navigate through the labyrinth of database replication, you'll gain an understanding of the follower-leader model and its pivotal role in read-heavy applications—think TikTok or URL shorteners. We don't shy away from discussing the risks of single points of failure and the solutions like automatic failover that keep databases humming along. The conversation gets even more exciting as we delve into the advanced territory of multi-leader replication, a strategy that ups the ante on fault tolerance and caters to a global user base, reducing latency and the dread of write losses.

The episode wraps with an exploration of the various sharding methods, each with its own set of benefits and hurdles. Whether it's key-based sharding, range-based sharding, directory-based sharding, or geobased sharding we help you navigate these techniques to find the best fit for your specific needs. And as a bonus, we tease what's on the horizon for our tech-savvy listeners: the enthralling world of messaging queues. Be sure to hit subscribe for this and other forthcoming topics that will arm you with the know-how to stay ahead in the tech game. Join us on this journey, and let's conquer the scalability challenge together!

Support the show

Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.


Speaker 1:

Hello everyone, welcome to part four of our four-part series diving deep into databases and how to scale them. I want to include a disclaimer at the beginning of this episode to clarify on terminology that can be seen as offensive, as we are covering the subject of replication, I want to be clear I will not be using terms such as master or slave, and instead, for the former, I will be using terms such as main, parent or leader, and for the former I will be using terminology such as worker, child or follower. This terminology has been established by Python, and what's good enough for one of the most popular programming languages is good enough for us. I also want to give a special shout out to listener Jessica Ho, who actually had a few questions about our last episode. After talking with her, I realized that this information is probably useful for others as well, so I want to make sure to answer these questions in a public space. So the first question is when would you use read-through cache over cache aside? And so, with read-through cache, data inconsistency is not very likely, since every request goes to the cache first and then it updates the database. However, the upside in this case is also the downside, depending on how your data is structured, since read-through cache caches everything. So, basically, no matter what comes through, whenever you do a write to the database, all of that information gets written to the cache, and so the problem with that is, if you're not accessing all of that data a lot of the time, then it kind of you're you're you're caching data that you're not going to be using, right. So, um, it's great for a high read application, especially one that requires data consistency, so anything that you need to make sure you have the correct information cached and ready at all times. That's when read-through cache is very helpful.

Speaker 1:

A banking system might be good for this, right? So you always want the most up-to-date information on something, and chances are, if someone just deposited something into their account, they're going to check their account to make sure it's in there, and you don't want to have that information be wrong. Whereas, something like cache aside, you can possibly have data inconsistencies, right, but since the application itself is responsible for what does and doesn't get cached, it's far more likely that the data you're fetching is in the cache. So what's a good example for this? I would say something like a user profile on like a social media site. Uh, the reason being is this is a read heavy system, but it's data that doesn't get changed a lot and it can be eventually consistent. And what I mean by that is if I update my profile picture, um, no one is going to be upset that they don't see it immediately, right, you may have to refresh once or twice to let that cache sort of catch up, and that's what I mean about eventual consistency. Her next question and this is a great one, thank you again, jessica, for reaching out. You again, jessica, for reaching out.

Speaker 1:

You classified Redis as a remote cache, but it's renowned for being fast since it's in memory. Is that because Redis can be both a remote and a caching option? And the answer to that is yes, that's exactly it, that's correct. Redis itself can be ran even on your local machine right next to an application. It can also be deployed on the same node as your application. They both can share the same node, but it also can be deployed onto its own node and you can have multiple instances on multiple nodes spread out on a microservice architecture. So I hope that makes sense and I hope that clarifies some things for someone out there. So, without further ado, let's jump into this next episode.

Speaker 1:

So to briefly, recap the last episode we talked about how to scale our databases when our queries were starting to have a negative impact on our customer experience. The main topics covered were basically checking your indexes and adding new ones around the data that's accessed more than other data in your database. Following that change, you will find out that adding some sort of cache usually an external tool like Redis or Memcache is helpful. But what happens when your scale more than doubles in size? Or, alternatively, when you are worried about having a central point of failure Because cache only stores a certain amount of information for a short period of time? In the event of a network outage, you risk losing new data that your system should be saving, or even honestly worse, consider a server failure if the machine hosting your db has a disk failure. You don't just want to lose all that new data coming in. You could potentially lose all of it that you had saved and all of the information that you had on there. We did talk about one way that document databases can help address this issue because they scale horizontally really well. Right To refresh your memory on this, scaling horizontally just means taking a node or in this case, a node is a machine with your entire database and then just makes a copy of it onto another node. But what if you want the structure of a relational database? Maybe your database was set up years ago and moving to a document database is just not an option, or maybe your data just doesn't lend itself well to pulling an entire document every time you try and fetch something. Well then we have to figure out how to scale a relational database, and there are more than two main ways to do it. The first is replication, the second is sharding. I do want to note here that these strategies are examples of not just how to scale a relational database. You can scale a document database. You can scale a document database the same way, and also they're a great way to scale your entire system to provide fault tolerance for basically any sort of situation.

Speaker 1:

When talking about replication, the most common model is the leader-follower replication model. In the leader-follower replication model you only have two types of nodes the leader node its sole responsibility is for new write queries. Basically, any new information will always be passed into the leader in every circumstance. The second type of node is the follower. Follower nodes will never be responsible for writing new information. Instead, they are used only for responding to read queries. So I want to make an important distinction here. While followers never accept write queries, a leader can both handle read and writes if necessary. The other important distinction I want to make is that there is only a single leader node and you can have any number of follower nodes. The leader node will accept all write queries and after that transaction is complete, it is responsible for applying all of these new changes to each follower node over some amount of time, depending on your system. Follower leader replication is really effective for any system that will be read heavy.

Speaker 1:

In other words, any system where we will be reading a lot more data from the database than data being written to it, then it's considered read heavy. Tiktok or a URL shortener are prime examples of read heavy applications, because, despite the amount of writing it has for new content, there will always be way more people consuming that content at any given time, whereas anything that handles more throughput and writing to the database are considered write heavy. Imagine, for instance, a web scraper. This is just a little bot that goes around scraping data from all over the internet. This sort of system will rarely need to read from a database.

Speaker 1:

The question is probably coming to you now, though Isn't this just a single point of failure with extra complex steps? If the leader goes down, then nothing can accept the rights, and we end up with the failure scenario the same as before, and with that I actually offer a very cool solution. That is the backbone of this entire concept. If, for some reason, this leader node fails, you can simply assign a follower node to be the new leader and continue on with minimal intervention, using an automatic failover strategy. The only thing to consider here is that the follower node is completely in sync with the leader that just went down. So in the event where a transaction just happened and you're updating one of the followers and you promote updating one of the followers and you promote a different follower, you may not have the consistency right away. So then how does the leader update the follower nodes with new data? If you recall the first episode, databases keep a sequential log of all of the commits before they even acknowledge them. So when the leader applies the change, it logs the change and then the followers take these changes from that log and apply them accordingly.

Speaker 1:

What happens next truly depends on another design choice you have to make when building out a leader follower system, and that is do we run this change asynchronously or synchronously? The difference is pretty simple. When a leader gets a change and it logs it, it can either wait for every follower to acknowledge that the change has been applied before writing back a success response, which, of course, is a synchronous strategy While on the other hand, if it applies the change and logs it, then it just returns a success and eventually all the followers get updated Again. This is eventual inconsistency and is an asynchronous strategy. Both asynchronous and synchronous strategies have pros and cons, with synchronous one of the big ones that you probably already caught, is the extra time, which, depending on how slow your network is, can be very relevant, but the bigger issue, also having to do with the network, is the event of a network failure or a server crash while the change is being applied to one of its followers. If even one of these followers fail, then your entire write is considered a failure and the change isn't applied. The benefit is that you can guarantee the data is always in sync. There is no eventual consistency. The biggest drawback for the asynchronous strategy, though, is pretty straightforward, considering what we just mentioned. If the leader node fails during an asynchronous setup, then there is no guarantee the follower replaces it and has all the data up to date.

Speaker 1:

Despite this, asynchronous strategies are used, fairly common, and are helpful on slower networks or nodes that are separated by a large distance. Think you have a leader database that's in somewhere like America and then you're writing to a follower database somewhere in maybe India. That large physical distance, as we talked about in the previous episode, will have a slowdown, and you don't really have that time to wait for those things to actually be applied before you leave a success response. The second common type of replication is what's called multi-leader replication. In a multi-leader replication strategy, you have multiple nodes that can accept reads or writes, and they all update each other via logs or events in the same way that they would update a follower in the leader follower strategy we just talked about. The cool thing here is that you can actually add followers onto each one of these leaders that get updated in the traditional way and before you know it, you end up having sort of a distributed database of sorts. However, multi-leader provides a bit more help in nuanced areas that honestly come naturally with a whole new list of complexity issues.

Speaker 1:

For now, let's start with when a multi-leader strategy is needed the most. One of the most obvious is extra fault tolerance. Instead of having to figure out which follower would become the new leader, you can simply just shift the right event to one of the other available leaders. The other situation where a multi-leader design is beneficial, and one that often goes overlooked, is trying to serve other geographical regions. So if you imagine using the leader follower strategy like we talked about before, you have a single leader that is most likely very close to your application, and then, if you want to serve different users around the world, you set up a follower somewhere close to them and then have the leader write to the follower when new updates come in.

Speaker 1:

I'm sure this is sounding familiar, right, but what happens when the user in India or anywhere else needs to write something to the database? They now have to wait for that write, to travel all the way halfway around the world just to update the leader. Then they have to wait all this time for the leader to actually send that change back to the follower closest to them. We've talked about before why this is terrible. To touch on it briefly again, while data travels really fast our infrastructure like ISPs, routers, all this stuff they act as very big speed bumps in the road. So generally, you want to have to keep people's data as close to them as possible for the best possible experience.

Speaker 1:

Now let's use what we learned with the multi-leader strategy and how we can come up with something better. You could at least put a leader on each continent, to say the least right, and that already severely cuts down on latency. The final benefit I want to touch on is the fault tolerance of this sort of design. When you have multiple leaders, the risk of losing a right shrinks by a lot, especially if the nodes are situated far apart geographically. If there's something bad happening in America, chances are that the weather might be nice somewhere else, so you don't end up losing that write if that server goes down for any reason.

Speaker 1:

And as amazing as all of this sounds, if there's one thing you learned from this podcast and honestly should have probably been the name of the show it's that nothing's ever perfect. Nothing has zero downsides when designing a system, and much like everything else, multileader is not immune to this fact. These strategies can and should be applied to document databases. The fact of the matter is, for scaling relational databases, these strategies are almost always necessary. The problem is, when using something like a multi-leader strategy, we lose one of the biggest arguments for a relational database in the first place. If you recall the first episode, you'll probably remember the ACID properties Atomicity, consistency, isolation and durability.

Speaker 1:

And in this case, the main two I want you to think about are isolation and consistency. Isolation again, if you recall, basically means that every read or write happens in a vacuum and will not affect any other read or write, and when this transaction is applied, isolation decides how long it will take for the data to become available for reading. The isolation of data becomes a huge issue when it comes to multi-leader, because what happens if a change is applied to the same data on two different leaders and then they have to update each other? You have what's commonly known as a data conflict. Consistency, of course, is the other side of that coin that when a change is made, that it should be seen by the same person and by anyone else who reads the data as the same thing.

Speaker 1:

We talked about how DB caching can affect consistency and how we can improve upon that. However, there's not a great way for keeping consistency when having a multi-leader design. The more leader nodes you have, the more likely you are going to have inconsistent data. And despite all the bad things I just had to say, replication is hugely beneficial and can be incredibly complex and can help your system scale and do better. Most modern database solutions have at least one of these solutions already built in making it even easier to do and apply these concepts. As with everything, it is all about the specific use case. How complex can we handle the data Do we need to serve?

Speaker 1:

users in other geographical areas etc etc.

Speaker 1:

So we've talked about replication and some of the most common designs that go along with it. We talked about how replication can help relational databases become more available and help any kind of database become more fault-tolerant. However, there is another way that we briefly mentioned before to scale horizontally. It applies to all kinds of databases and is a far more performant way of scaling horizontally than simply making a copy of the data over and over and over again. That strategy is called database sharding, which in itself could probably be an entire episode if we're being honest, but my goal here is to explain it in a way where it's very accessible, while also being detailed enough that you could implement it at a high level in an interview or a strategy meeting with other people that understand these concepts. Database sharding at a high level is not a hard concept. To wrap your mind around. Instead of storing all the data on a single node, you can instead scale horizontally by adding chunks of that data across multiple nodes. The schema or database design is the same across all of the nodes, but the data that lives in each one of them is completely unique.

Speaker 1:

If it's easier to think of it, it's the difference between your in-person friends versus your friends online, right? Your in-person friends are generally very close physically, you can pass messages to them very quickly and you're very efficient in the same space, whereas your online friends, they're just as amazing, but communicating with them requires a network of some kind and a latency for them to actually get the message. Even if someone's online, they're not going to see my message immediately. They may have to wait a couple milliseconds, or, if they're on vacation, they might see it in a week, but it's this sort of pattern of they're distributed right?

Speaker 1:

There are a few different types of sharding architectures we'll talk about today, and they do run the gambit when it comes to techniques, but I'm sure there are infinitely more ways of doing it. These are the core concepts that you should know behind sharding when trying to scale your system. To help you remember the four main concepts of sharding, I want you to picture the four bases Key range, directory and geographic. Those are all the things you need to know when it comes to sharding, key range, directory and geographic. So let's talk about the first key-based sharding.

Speaker 1:

It's very similar to a hash table from traditional computer science algorithms, hence the reason key-based sharding is also sometimes called hash-based or algorithmic-based sharding, respectfully. The core concept is that some subset of the data is run through a hash function and it returns a specific shard key. This shard key ranges, as a maximum value, all the way down to a minimum value, and so the maximum value will should always be the maximum amount of shards available, based on the shard key. The data is then assigned to that corresponding shard. So, in easier terms to understand, you have a certain number of boxes, we'll say five. Then you take a couple of pieces of your data, preferably something that is unique, like an ID or a last login timestamp. You then run these numbers through some sort of hash function and it will return a number of 1 through 5. This number tells you which box to put the entirety of this data in.

Speaker 1:

Then, in the future, when you need to access this data you just use that key to generate the shard key the key in this instance being the ID or the timestamp we mentioned before and then fetch that data from the box that you know it's in, because that's the shard key that you got back. The big advantage of key-based sharding is that, because you're algorithmically spreading the data, you have a lower risk of having one node that handles most, if not all, of the database queries which we call a database hotspot.

Speaker 1:

And another advantage is that, because all of this is done on data that already exists in the database, there's no need for any external mapping of each row to its specific shard. And what I mean by that is when you perform the actual sharding and get the shard key back with the hash function. You're not changing any of the data, you're not adding anything extra. You just need that hash function in your application and it knows where to put things. So, on the other hand, one of the biggest disadvantages of key-based sharding is trying to add new nodes to handle a new shard group. Take our scenario before we had five nodes and we have a hash function that specifically assigns a number, one to five, based on specific columns. Well, we're finding nodes one and three are becoming hotspots, they're being overwhelmed, they're handling too many of the queries and, in conclusion, we're adding one, two or even five new nodes to help spread out the load. This is great, right? We just need to update the function to return a number of one through ten. However, the problem comes when we need to access the data that's been assigned. Before we added those five new nodes and now, with the new limit of 10, we end up being told that that legacy data is in number seven. Instead it's in box number three. Right, because the hash function changes, the place where this data goes will also change. You see, see the problem, right? The keys could be different based on the limit. So now to solve the problem, we have to re-shard every single key all over again and put that data in a new shard group accordingly. If you have huge, large swaths of data and you're adding a new node, it is not a trivial process. It will require a lot of time and a lot of effort. So what data tends to lend itself best for like hash-based sharding architectures? Well, the simple answer is any set of data where you aren't usually fetching or writing in large groups. For example, a user's login data could be sharded using a hash very efficiently based on the user ID, the email and the last login time. Right, you will only ever fetch this or write in a single item at a single time per transaction.

Speaker 1:

Next, we are going to discuss range-based sharding. Range-based sharding is the simplest of all four possibilities to implement and, as we will discuss, it can also be the most prone to uneven distribution issues. The core concept behind range-based sharding is simple Every shard group is assigned based on a range of numbers. Group is assigned based on a range of numbers. So, for example, if the shard key is 0 to 10k, then they go in node 1. If it's 10k to 20k, it goes in node 2.

Speaker 1:

When writing a new record, you simply add the shard key maps to the specific shard group in a lookup table, which consists of two columns your shard key and the node in which the data exists. The thing I want to point out that's different here is that we actually have a lookup table, so this is new data that is added beside our existing data. That is basically just a lookup table, a one-for-one match, right? So what are the benefits of range-based sharding? Well, it's very easy to adjust and add on to. It's fairly easy to implement, since adding a node is as simple as adding a new group to the lookup table. No changes are really needed for previous nodes that are already implemented. So no more having to rewrite the entire database with new shard keys. It is also really beneficial for data that is frequently paired together when reading and writing, since they are most likely to be in the same node when fetching.

Speaker 1:

The drawbacks, however, are fairly abundant. Data tends to be more unevenly distributed, especially when you factor in older accounts, which have been deleted and probably don't get accessed as often. So you have an entire node full of things that just don't get touched, whereas that new node you just added has all of your new fresh users and it's getting beat to all hell with all these new queries. This, as we talked about, is a database hotspot, and it's very bad. There's a possibility to actually increase the size of your ranges to help with this uneven balance, but then you risk having too large of a distributed system and you end up negatively impacting your scaling Right. So what I mean by that is, if you make your shard group way too big, then you just have a giant database here and a giant database there, you know, and you end up having 10 giant databases versus one giant database. So next I want to talk about directory-based sharding. It's our third category in our discussion.

Speaker 1:

Directory-based sharding is a part of the dynamic sharding concept where we have a lookup table. Like the range-based sharding we just talked about, it lives outside of the database and outside of the application code. This lookup table behaves how you would expect it to right. Like we talked about before, it's a simple mapping of specific data and what shard that data exists within. For a simple explanation, imagine you are a delivery company and you have specific delivery zones in the city associated with that company. You could create a lookup table that has zone one mapped to shard one, zone two mapped to shard two, etc. Then, when you are either creating a new entry for delivery or wanting to list all the deliveries in a certain part of town, your application would simply look up the shard associated with that zone and then query that data. The big wins you will see out of directory-based sharding come directly from someone who wants minute level of control over how their data is stored, flexibility where it's stored and, honestly, the ease of scaling by adding new zones and shards. Directory-based sharding can also have drawbacks, though, because every time you make any write or read to your database, you have to perform a lookup. This can give you unnecessary overhead when trying to perform transactions, and a poorly designed directory-based sharding technique can actually still lead to database hotspots. So the concept here is just make sure, if you're going to use directory-based sharding, that you make that lookup table bulletproof. Really, focus in on what shard should be mapped to what in your data.

Speaker 1:

Lastly, here I want to talk about geographic-based sharding, or, as some people call it, geo-based sharding. Geographic-based sharding is exactly what it sounds like. Each shard group is based on a specific geographic region. Depending on the business needs, this could be as small as zones in a city, much like how we discussed with the directory-based sharding, or each shard could belong to a specific continent. If the user base is spread out globally, then having a different shard on each continent might be helpful, sort of like how we talked about before. It's a very similar concept described above, where we have a lookup table that is based on a column concept described above, where we have a lookup table that is based on a column and that column could be anything geographically speaking, again like a continent, country, city, town, zone. What have you?

Speaker 1:

The main benefit of geographic based sharding is the same benefit as a cdn that we we sort of touched on in episode one. It helps keep the data physically close to the users who would be accessing that data. This helps decrease latency, helps decrease data loss and, honestly, it's just a no-brainer to do when you have a global company. The drawbacks depend on the data itself, much like how we discussed before. If most of your user base is in a single area, then you end up having database hotspots within that area or that shard representing that area.

Speaker 1:

And with that I just want to say I'm very thankful for each and every one of you for joining me on this journey to decoding the mysteries of databases. This has been an extremely rewarding process and I hope you learned something. Next episode, I'm very excited to announce that we'll be starting our journey into messaging queues. So if you haven't already, go ahead and subscribe to the podcast, because I have many more episodes coming down the pipeline. They won't all be multi-parters like this one, but some will be, depending on how big the topic is. I really enjoyed everyone's feedback on the last episode and I hope to keep making these episodes better and better.

Speaker 1:

If you would like to suggest new topics or be a guest on the podcast or just give me feedback, ask questions, much like how Jessica did, feel free to drop me an email learnsystemdesignpod at gmailcom. Remember to include your name if you'd like a shout out. If you'd like to support this podcast, help me pay my bills. Jump over to patreoncom. Slash learnsystemdesignpod. Consider becoming a member. All podcasts are inspired by Crystal Rose, all music written and performed by the wonderful Aimless Orbiter. You can check out more of his music at soundcloudcom. Aimless orbiter music and, with all that being said, this is spinny scaling down Bye.

People on this episode