Learn System Design
A bi-weekly podcast hosted by a senior engineer named Ben Kitchell that takes a deep dive into learning about technical system design by learning together. Each episode we will explore the inner workings of what makes these systems so complex and fascinating while building on our knowledge of how they came together.
All music written and performed by the mysterious Aimless Orbiter. You can find more info about him and his music at https://soundcloud.com/aimlessorbitermusic
Learn System Design
3. Databases Decoded: Optimizing Your Data's Potential with Fine-Tuned Scaling and Indexing (Part 3)
The conversation doesn't stop there – we're also lifting the lid on the sophisticated indexing that powers giants like Google, transforming the landscape of full-text keyword searches. If you've ever wondered how your favorite search engine seems to read your mind, you'll find answers here. Then, brace yourself for a deep dive into the world of database caching strategies. From the cache aside pattern, ideal for read-heavy systems, to the intricacies of read-through, write-back, and write-around methods, I dissect their impacts on data consistency and potential for loss. Whether you’re a seasoned database architect or just getting a grip on system design, this episode is crammed with critical insights into making your database system not just robust, but ruthlessly efficient. Join me as we navigate the complexities of database scaling and optimization – it's a journey you won't want to miss.
Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.
Hello everyone, welcome to episode three of the Learn System Design podcast, with your host, me, benny Kitchell, at the top. Here I want to give a big thank you, big shout out for everyone who's sent feedback or, you know, has been supportive or even given the podcast a rating. I really appreciate it. It means a lot to me. I would assume it was just my friends, but frankly I don't have that many friends and they don't take pity on me. So, yeah, I am excited to keep doing this and I'm excited to jump in today's episode. If you haven't listened to the last two, this is a part three of four diving deep into databases and talking about how they work, with scaling and yeah, so I recommend you listen to this after you've listened to those. Without further ado, let's go ahead and jump into it. If you've listened to the last few episodes, then you'll know my main reason that I feel like the databases are usually the bottleneck of a system when trying to scale, because databases are made to be resilient, perform lots of transactions with lots of moving parts. So the inevitable other part of this is that databases tend to be a bottleneck and when they haven't been scaled correctly, especially today we're going to talk about how to scale them correctly and, just as important, when to scale them correctly, because, make no doubt about it, spending all of your time setting up a sharded database with two primaries and four replica means nothing if your throughput could be handled with a single database and some decently written queries. So let's answer the last question first when do we scale a database? When do we know it's time? And the answer is relatively simple, and it's basically when you notice slowdown in your application. So what I mean by that is the first thing you'll want to do when you notice slowdowns is check your queries. The reason why is sometimes a little extra love with your queries is all you really need to get some performance out and, honestly, you don't have to have any downtime for performing and improving those queries. However, if your queries are all in order, then your next step is probably to check with your indices, and we'll talk about what indices are in a bit. But if your queries are in order, your indices are in order, then it's probably time to scale.
Speaker 1:Indices, or indexes if you prefer, are simply pointers that live in the database and point to a specific record of data. If you've ever flipped to the back of a textbook and looked for a certain word and you could see that that word occurs on page 72, 74, 106, etc. You get all the pages and where that word is right, then it's kind of the same concept as indices are in the database. If you haven't, then maybe pick up a book. But seriously, in the back of a textbook you can find what's called an index. It's basically an alphanumeric sorting of important keywords and beside each of these words you can find every page that they're mentioned on. And that same concept honestly applies to a database. I'll spare you the bare details of how the database records are stored on a disk, but the important part is to know that all of this data sits on blocks on the disk. When you access a specific record, you have to search for that specific block you need. It then utilizes a hash function to locate the index quickly and then it returns it. Simple as that.
Speaker 1:When digging into indices, the first one you're most likely to come across is the primary index. Primary indexing, or primary key indexing, is a type of clustered index that sorts data where the primary key is the main index for a table. This indexing strategy is the default for most relational databases and is honestly very efficient. Since primary keys have to be unique most of the time, that makes them great for creating indexes. And I want to make clear that primary keys are different from primary key indexes. A table can only have one primary key, but it can have multiple indexes. The primary key index turns searching for a specific record from around big O of B, where B is the number of blocks in your table, to somewhere around big O of log B. If you aren't familiar with the concept of big O notation, it's basically the worst case scenario for how long something should take. So big O of B means looking through the table one by one and not finding it until you're at the end of the blocks. So hence big O of B. I do recommend having some fundamental knowledge for time and space complexity if you're interested in this series from a functional standpoint. But again, I do want to make sure to cover anything that I think our listeners that don't have as much experience will understand. So therefore I do owe you an explanation.
Speaker 1:I briefly mentioned before that a primary key index is a type of clustered index, and what did I mean by that? Well, when we mention indices, we usually have two main types. The first is a clustered index and the second is a very cleverly named non-clustered index. The difference is clustered indices actually change the order of a table based on an index and require a unique identifier to determine the order of rows, like before, how we use the primary key, which is in itself unique by default, whereas non-clustered indices create a completely separate structure to store the index information, and we'll touch on that. But the main thing to consider right now when deciding between clustered and non-clustered indices is how often you'll be changing the specific column that is indexed.
Speaker 1:Since clustered indices physically reorder your data every time something is added or removed on that index, then an entire table has to be reordered. This reordering in itself can be a problem for large volume databases, but honestly it goes a step further. Since these changes are happening on a physical disk, every reorder can cause your data to become fragmented, which can lead to data performance issues. So be sure to keep that in mind, and the main way to counteract this are doing things like index rebuilds, where you just rebuild the index and it can save you time on actual performance. But regardless, if you have a clustered index, you're going to be reordering every time you had to remove on that specific column that holds that index. I alluded to this earlier. But non-clustered indices do not modify the actual table. Instead, it creates a separate data structure with only the index columns being ordered and then a reference to each corresponding row. This implementation, while making it faster for things, due to its B tree-like structure, it will take up more physical space than a clustered index. And so you know, you have an extra data structure that keeps these indexes and keeps those reordered, and that makes it quicker. But again, you're sacrificing space complexity now, and so that's the downside of non-clustered indexes. So, honestly, there's not one that's better or one that's worse, it's one that's better for you at this time.
Speaker 1:Along with standard form of indexes we discussed before, there's also a large swath of special indices that should be used at specific times. The first that comes to mind is the Bitmap Index. If you aren't familiar with bitmaps, sometimes they're more commonly called bit arrays. They're basically well, let's let Wikipedia tell us. Essentially, it's a mapping of something to a group of bits. An easy way to explain a Bitmap Index is imagining a column in a database that consists of some sort of binary reflection, ie the value is either going to be one thing or another If you think about on off, true, false things like that. So now, when you set a bitmap up on a column, in the background, the database converts the information to the column to be one or zero. Then when you try to read or filter based on this information, instead of scanning the whole table, it will utilize that index instead. This is the base level of a Bitmap Index. They can be utilized on a larger selection of data, more than just one or zero.
Speaker 1:Let's talk a more complex example. Let's say you have a few departments in a company, so these departments are rarely updated in the database. Right, that makes it a great candidate for bit mapping. We rarely add a new department to a company, so it will be rarely updated, which can be sluggish. When you use a Bitmap Index, then each of these departments are assigned a number and when you search on anything that has to do with a department, that database will break it down into zero, one, two, three, whatever the number is for each department, and this makes your queries more performant, makes them quicker and makes your overall speed of your system faster.
Speaker 1:So next let's talk about hash indexing. If you're familiar with the concept of a hash table or hash maps in code, then hash indexing will probably come very easy to you. The core concept is the same. It creates a bucket of hashes that map to an indexed field. Hash indexes are incredibly important when it comes to things like database sharding, which we'll be touching on in the future. Next episode is going to be littered with with sharding.
Speaker 1:The biggest thing is to consider from a hashing index is the field you are indexing monotonically increases. So what does that mean? Well, from MongoDB's own docs, they mentioned that having an index that doesn't monotonically increase will actually increase the risk of collisions and thus have worse performance. So monotonically increasing basically just means it will never decrease, right? It's a number that keeps going up, or a set of numbers that that keeps growing. So if it doesn't always increase, then you end up having collisions. And collisions, if you're not familiar, are basically when you have the same hash for two different variables, and the problem with that is when you have something that is calculated to go in a spot and that spot is already taken, you have to then recalculate everything. It's like trying to park a car in a spot where there's already a car. Now you have to figure out where you went wrong and recalculate and move all the cars and yeah, it's. It's, in a sense, at a high level, a bad time. Maybe it will help to just have some examples of a field that monotonically increases. Mongo's built in object ID actually monotonically increases things like timestamps. So pretty much any database management system. There's going to be some solid choices. There are values that always increase and never go backwards. Right Time always goes forward, no matter what, and so that's why it makes it such a good thing to monotonically use and index on, especially with hash indexes.
Speaker 1:The last special index we're going to talk about is inverted indexing, which is it's a lot like hashing indexes, but it's best used for searching full text for like certain keywords. In fact, search engines like Google rely heavily on these types of indices to make their search engines way more optimal. The way it works is actually really cool. So it tokenizes each word in a sentence, then, like the hash index above, creates a mapping to where it occurs in the database. A simple example of this is all the words you read in a book and every time you read it you get placed into a specific box. Then, when you need to find a specific word, you can just go to which box has that word on it, and then that's basically your mapping. So how do these search engines use inverted indices? Well, imagine you're on a website that sells shoes. Right, each word on that website would get tokenized and placed in the database in a specific area. Then, when a user searches comfortable shoes for running, they would query those words based on their inverted index. This would then return specific rows that match these specific criteria.
Speaker 1:Maybe something to keep in mind the next time you're asked to build a search engine in that system design interview is if I'm using something that requires excellent search. An inverted index is probably the first thing to go to. There's many, many more cool indexes that do a lot of great things. I could spend an entire episode covering them. I could talk about all the unique ones offered from Oracle or Postgres or MongoDB. If that's something that interests you, definitely drop me a line, send me an email at learnsystemdesignpod at gmailcom, but for now we're going to move on to database caching. As the great Phil Carlton was known to say, there are only two hard things in computer science cache, invalidation and naming things. That first one takes on an even bigger meaning when it comes to database consistency with caching. We talked briefly about database caching on the first episode, but today I actually wanted to expand on that more.
Speaker 1:Database caching is a strategy where you utilize temporary memory by holding frequently called data in it to avoid hitting your database directly. There are a few different ways to do this, so let's talk about some of those to help sort of conceptualize how this works. The first concept always reminds me of that teenage girl from Dr Phil. It's the cache aside concept. It's a hugely common pattern for database caching, especially in systems with heavy read. This process basically sits the cache beside the database in terms of importance.
Speaker 1:When an application queries for data, it will actually first hit that cache. If the data exists in that cache, also called the cache hit, then the data is returned. If the data is not in the cache, which we call a cache miss, then the data is actually fetched from the database. Then the application will proceed to add that data to the cache to hopefully help with further queries. The main allure of cache aside is that cache failures occur and if they occur, besides a degradation of performance, there's no real drawbacks since the data can be just fetched from the database. The downside, of course, is that because the cache is not directly tied to the database, it's possible to have inconsistent data. If you imagine someone hits the database for specific record, it then gets cached, and then someone else comes along and goes directly to the database and updates that record. It's not updating the cache right, so that application, whenever it pulls it from the cache, before that cache invalidation happens, you're going to end up getting data that is just not consistent with what's in the database, which, depending on your system, can be a real bad time.
Speaker 1:The next logical step to avoid having your application be responsible for keeping the cache up to date is to move that cache in between the application and the database. This process is called the read through cache strategy and its main difference to cache aside is that it relies on a cache provider to actually update the cache instead of the application. But unfortunately, much like cache aside, read through still has issues with data inconsistency. Another strategy is called the write back cache strategy, which, although the cache still sits between the application and the database, anything that is written to the database is first written in the cache and then immediately updated in the database. The main advantage of this is probably obvious, it's the exact opposite situation, as read through the data when fetched is never still, but it also means you have to wait for two rights to happen, the cache and the database respectively. The big disadvantage to this strategy is the potential for data loss if your cache crashes before writing to your database. The other is your cache churn is extremely high, meaning because you're writing everything to the cache. It doesn't mean that everything is important. So most cache providers do offer a time to live value. That helps this process, but at the same time, it's still something that should be thought of when, when going with the write back strategy.
Speaker 1:The last common strategy is one I don't find myself reaching for a lot, but I have to acknowledge that it is the most performant when utilized with the right kind of data. This, of course, is called the right around or right behind strategy, which is to say, when a right happens from an application, it goes directly to the database, going around the cache. When a read happens, the application checks the cache first and, if it misses, then checks the database and updates the cache. This helps us from piling infrequently accessed data into our cache, which then proceeds to be evicted, and benefits no one. Because of this, you can utilize either cache aside or read through strategies and implement sort of a lazy loading strategy. Lazy loading, if you're not familiar, is the process of updating the cache only when something is needed or called. But the main drawback to this process, besides the extra layer of complexities, that because the data is only updated in on database rights, if we don't handle our cache and validation Then we will be frequently accessing older and stale versions of our data.
Speaker 1:So when it comes time to actually implement the database cache, there are three different ways that you can build this offering. The first is Use the, the cache that's built into the database, run a cache service locally or, if latency permits, using a remote cache option. From the AWS docs on Amazon Aurora, quote some databases such as Amazon Aurora Offer an integrated cache that is managed within the database engine and has built in right through capabilities. When the underlying data changes on the database table, the database updates its cache automatically, which is great. There's nothing within the application tier required to leverage this cache. Where integrated cache falls short is in their size and their capabilities. So integrated caches are typically limited to the available of memory allocated to the cache by that DB instance and Cannot be leveraged for other purposes, such as sharing data with other instances. Now that's Amazon's reasoning to use the built-in Offering of their cache in their database systems.
Speaker 1:The other options the next one is local cache. So local caches are exactly what they sound like their cache instances, which exists locally to your application and it's usually stored on the same node. The benefit to this case is that it cuts out all the problems of network latency, but it also means, if that node goes down, that cache has to be completely rehydrated. So, on the one hand, you have very minute control over what's happening with your cache, you control everything that's happening within it and you cut down on the latency issues. But again, on the other hand, it's more volatile. So if your your cache goes down, you have to be responsible for completely rehydrating it, and what I mean by that is is adding Any of the data that you want to exist in the cache backhand, or your application does so.
Speaker 1:The final option and it's the one that I think most of us see within systems these days is the remote cache option, utilizing one of the more popular choices, which, at the time of writing this, is Redis and memcache, which Both are extremely popular in memory data stores. They're used to cache DB's effectively and they have been used for a long time. I could probably squeeze an entire episode about the difference in these two offerings and the passionate people behind them and the passionate people that use them could probably fill another episode. But what I will say here is that memcache's main selling point is that it does one thing, and it does one thing Really really well. It is a key value store which, unlike Redis, is actually multi-threaded. Despite this, redis has grown increasingly in popularity over the years because, despite it being single-threaded, it offers more than just DB caching. It can be used for, like Pub sub-messaging, geospatial support and replication, which allows you to scale horizontally and offer more highly available clusters.
Speaker 1:We'll talk more about scaling on the next episode. Half the episode is going to be dedicated to replication and the other half is going to be dedicated to sharding, so hope you're looking forward to that. And with that we wrap up part three of four into deep diving into databases, while today mostly focused on two topics indexing and database caching. I cannot stress enough how important these concepts are, both when designing a system for an interview, but also in understanding how you can improve your current system. I'm going to be following up next episode with the final chapter of this deep dive. I'm going to be focusing again on replication and sharding to very, very, very important subjects. So, if you haven't already, go ahead and subscribe to the podcast, because I have many, many more episodes coming down the pipeline covering other subjects.
Speaker 1:Thank you again for all the feedback on the last few episodes. I hope to keep making these better and better. Don't be shy, please. If you would like to provide any feedback or even just provide some support, drop me a line at Learn System Design Pod at gmailcom and remember to include your name if you want a special shout out. If you'd like to support this podcast, help me pay my bills. Help me make this podcast better. Please jump over to our Patreon. Consider becoming a member. All podcasts inspired by Crystal Rose, all music written and performed by the wonderful Aimless Orbiter. You can check out more of his music at soundcloudcom. Slash Aimless Orbiter Music and, with all that being said, this has been Aimless Orbiter. I'm scowling down. Thank you, you, you you you, you, you, you you you, you, you you.
Speaker 1:You.