Learn System Design

2. Databases Decoded: A Journey from Clay Tablets to Graph Paradigms (part 2)

Ben Kitchell Season 1 Episode 2

Send us a text

Embark on a time-traveling adventure with me, your host Benny Kitchell, as we chart the monumental shifts in database technology. From ancient civilizations etching their records into clay to the sophisticated graph databases of the present, this episode uncovers the pivotal moments that have shaped how we manage data. We'll delve into Edgar F. Codd's relational model revolution, understand the origins of the Hierarchical Model System, and discuss the impact of MongoDB's entrance into the field, all while drawing fascinating parallels to the systems we rely on today. As Oracle continues to represent the resiliency of relational databases, I'll also share a personal nod to Postgres for its robustness in the relational scene.

Prepare to look under the hood of database management as we unravel the intricacies of locking mechanisms and their critical role in maintaining data integrity. I'll guide you through the maze of optimistic and pessimistic locking techniques, clarifying their distinct applications and benefits. We'll also confront the concept of durability and the essential practice of write-ahead logging, ensuring our data withstands the unexpected. The rise of NoSQL and document databases throws a welcome curveball into our discussion, challenging the traditional database dominance with their specialized use cases.

Choosing the correct database type can be as complex as the data itself, but we simplify the decision-making process by breaking down the merits of document databases, especially when faced with rapidly evolving data structures. Your datas layout offers clues on whether to harness the power of relational, non-relational, or graph databases, and we'll explore when and why you might choose one over the others. To wrap up, it's a shout out to our listeners, whose insights and feedback shape our podcast's direction. If system design is the symphony, then your voices are the harmonies that enrich our collective journey into the world of database technology.

Show Notes:
A very special shout out to Sumaiya Asif!
Edgar F. Codd's paper

Support the show

Dedicated to the memory of Crystal Rose.
Email me at LearnSystemDesignPod@gmail.com
Join the free Discord
Consider supporting us on Patreon
Special thanks to Aimless Orbiter for the wonderful music.
Please consider giving us a rating on ITunes or wherever you listen to new episodes.


Benny Kitchell:

Hello everyone, welcome to episode number two of the Learn System Design podcast with me, your host, benny Kitchell. I am very excited to explore the world of system design with you. Just a preface here if you haven't listened to the last episode, I recommend you do so. It gives a nice foundation of what we're expanding on today and it helps you learn cool little lingo like Cap. Anyway, today's episode is digging deeper into databases. It is more specifically why, when you tend to find a bottleneck, it tends to be databases, especially when they're not well indexed or you know, they're not well scaled. Well. And yeah, over the next few episodes we'll dig deeper into it, but for now I sort of want to start from the beginning where databases came from and how they work.

Benny Kitchell:

So, without further ado, let's get into it. So around 5000 years ago, a Sumerian or Egyptian, depending on who you ask thought that it might be useful to begin taking record of what was happening around them. They didn't realize it at the time, but putting these symbols into clay tablets would one day lead to a world where we had over 65 Zettabytes worth of data. Next, around 3000 years later, we have the invention of paper in China, which of course led to books and, with that, led to the burning of books. I'm looking at you, caesar. And around 1876, we got one of the most popular forms of what eventually would lay the groundwork for the modern database, which, if you've ever been to a library, you will know as the Dewey Decimal System. Briefly, touching on the Dewey Decimal System, it's a fascinating concept and very, very important, but for its importance for databases, the two main things that it helped introduce were relative location and, more importantly, relative indexing. However, the first concept of a database as we would recognize it today actually came around 1970. It was from an IBM researcher named Edgar F Cod, and Cod published a paper that was simply titled A Relational Model of Data for Large Shared Databanks. While it's a wordy title, it actually introduced some of the fundamental concepts we use in databases to this day. It talked about relational modeling, normal form, index, dependency and a lot more other things, and I will include a link to that paper in the show notes, as I just find it fascinating that something that was written well over 50 years ago can still hold so much relevance and knowledge on the things we work on today.

Benny Kitchell:

And, probably unsurprisingly, the most popular database to come out of the 1970s was the Information Management System from IBM. It was mostly used to handle business data. It was fairly straightforward. It had a data model system that basically just structured its data like a tree. It was called the Hierarchal Model System. Ironically enough, it has a lot of similarities to today's more chase-on structured document databases that have been coming out in the last 10 or so years and decades later.

Benny Kitchell:

Sometime in the 90s, when consumer electronics really started to boom, the throughput of data became so massive that people began trying to make new types of databases to check off other boxes for this new age of the internet and, as such, in 1998, carlo Strozzi coined the term NoSQL when naming his still relational database, strozzi NoSQL. While the term NoSQL wouldn't be technically reborn until like 2009, it was in 2007 when MongoDB, one of the most popular NoSQL databases, was founded and when I feel like most people recognize the alternative. Mongodb was actually created by Dwight Merriman, elliot Horowitz and Kevin Ryan the team behind DoubleClick and they developed it mostly because they needed to make custom data stores to work around the shortcomings of existing databases. It's a fascinating story and probably an episode for another time, but finally we have the newest member of databases. While some would consider them NoSQL databases, I feel as though they deserve their own recognition. That, of course, is graph databases. Graph databases are a fairly newcomer in terms of market share and have yet to agree on a central graph query language. You know, like how relational databases use SQL graph query languages there's really no GQL yet but their presence and importance should not be undervalued.

Benny Kitchell:

How much relevance do relational databases still hold? Well, the first commercial relational database that was actually built on Cod's original concept was released by Oracle yes, that Oracle in 1979, just nine years after the paper was published. If, for some reason, you're not familiar with Oracle or its database system, it is the market leader for database systems as of recording this episode. So to put that in perspective, the single most popular relational database that someone can use to build out their system was created 40 years ago. Of course, it was improved since then, but the core code, the core concept, still was built 40 years ago. And if you don't remember what computers look like in the 1980s, I recommend pausing this podcast and searching it.

Benny Kitchell:

And while I'm sure there will be a lot of people that disagree with this episode's premise, this to me is one of the biggest factors of why relational databases tend to be the first bottleneck you hit when building out a system. While they're incredible, they simply aren't built to scale in a horizontal fashion, that's not to say relational databases can't handle throughput. The amount of work these teams put into their database management system is nothing short of remarkable. Postgres, which is written in C and released in the mid-80s, is still, to this day, one of the most remarkable pieces of technology I've used. It is still my go-to when I build a system that requires a relational database.

Benny Kitchell:

You might be wondering what about the query language? What about SQL? Sql was created in 1974. It was initially developed, of course, at IBM by Donald D Chamberlain and Raymond F Boyce after they learned about the relational model from Cod. It is still, to this day, the main language used by relational databases to interface with data, despite the fact that it's very simple to learn, it's very simple to read, it can be very verbose and it's turned into a very mature language. So at this point you might be asking if they're old and they don't scale, why do we use them? Well, the answer is it's simply as it is hard to swallow. If you need something that holds true to ACID principles, then there is nothing better on the table.

Benny Kitchell:

Despite the fact that this podcast initials are LSD and I just dropped ACID on you, pun intended ACID actually stands for Atomicity, consistency, isolation and Durability. It is the backbone concept of the relational database. It's the one thing that alternatives like NoSQL and GraphQL can't really offer. They offer different things that are better in their own way, but we'll address that later. For now, let's dig into each of these concepts. What it means and when. It's important when developing a system In atomic transaction, at least in terms of databases, is the concept that when a series of operations are performed on data, either all of them occur or none of them occur.

Benny Kitchell:

In other words, there will never be a transaction that only does half of its orders. A prime example of this would be if you're updating a large number of records in the data with a new value and in the middle of the transaction, your entire system crashes. Maybe your network goes down, maybe the disk fails. Instead of having a record that is half complete, the atomicity of relational database simply rolls it back to the last stable entry and allows you to pick things back up when everything's back online With consistency.

Benny Kitchell:

There are two main types of constraints that the data must follow where the transaction will fill. The first and the most common is an integrity constraint. Integrity constraints can be broken into four categories. The first of these is the domain constraint. Domain constraints are based on the type of an attribute, ie a column for height would be set as a number type. If a transaction tried to add a value that was a string, that transaction would fail as before. The state would be rolled back to the stable state before and the error would be thrown. The second integrity constraint is the key to the relational database system and is based on the foreign key constraint.

Benny Kitchell:

If you're unfamiliar with what a foreign key is, you can pause this episode and you can look it up real quick and come back. They are one of the most important concepts of relational databases. But, simply put, a foreign key relation is basically a way of mapping one column in a table to another. An example is you might have a user table with things like a name, age, height and then a foreign key relation where the address column in the user table points to another table with the same address ID and what this does is keep your data decoupled. It keeps your columns smaller and it keeps the data separated. So what's a foreign key constraint? Well, it's exactly what it sounds like. It is a guarantee that the reference you are making in table one exists in table two. It's nothing more complicated than that.

Benny Kitchell:

The third and fourth integrity constraints are pretty close in nature. The first is the primary key constraint, which basically says that all keys in a column must be unique. You cannot have a duplicate key in the primary key column of a table. And then, of course, the final one is very similar. It basically just states that a primary key cannot be known. It is known as the entire integrity constraint. So those are the big four categories when it comes to the integrity constraint. The other main constraint that has to do with consistency is much more flexible and it's called the declarative constraint. A declarative constraint is any constraint created by someone to reject specific transaction changes. One of the most common declarative constraint is the non-negative constraint. It's used a lot in baking and credit unions and it basically stops the transaction from completing if the amount of money being pulled out of a user's account is more than the amount of money that is recorded in the database. Pretty easy, right? So just to recap, for consistency. We have the integrity constraint and the declarative constraint, and the integrity constraint has four small categories within it that we touched about earlier.

Benny Kitchell:

So let's talk about the I in asset. I is for isolation. It's pretty self-explanatory at a high level. It basically means every read and write will happen in a vacuum and won't affect other read and writes. An important distinction here is it's not that transactions can't happen at the same time. Instead, it means that reads and writes cannot happen at the same time if there's a possible conflict.

Benny Kitchell:

Take, for example, a database with people's information. It just has their first name, their last name, their age and the last time they locked in. Say person A and person B have just been married, you can update both of their last names at the same time with different transactions and have zero problems. But if you were to update person A's last login time with two different transactions at the same time, then we have a conflict, right. So which one? You'd have to do some logic to basically say which one is later and then update the last login time with that one.

Benny Kitchell:

But again, this is where isolation comes in handy. If you're trying to update the exact same thing at the exact same time the database will throw an error and keep the last known stable time in that spot. So, as you can probably see, isolation is an important concept, especially when trying to keep data consistent, keep it secure. But the major drawback is it can be one of your biggest bottlenecks for transactions. If you have a specific entry that's just updated, a lot, each of those transactions have to wait in a queue before actually being applied. Depending on how long your schema is locked, this could even mean it's slower to handle other transactions as well.

Benny Kitchell:

And so what do I mean when I say locking your schema? Well, there's two main ways to do it. The first way of locking your schema is what's called optimistic locking. Sometimes it's also known as optimistic concurrency, but the logic is the same. Optimistic locking is a way of reading from the database and verifying that no one else has modified the data before sending it back to the user. How this is done is basically with a series of checks. These checks can verify timestamps, check sums or even like version numbers, and this all happens right when the initial read is fired and then also once again before sending it back to the user. If the check changes in between these two actions, then the user will get back a failure and then they can retry the read.

Benny Kitchell:

Optimistic locking is mostly used either for high throughput systems that can't handle having locks for long periods of times, or in systems where there's a lot of reading and not a lot of writing on the database. And where we have the optimistic locking system we of course have the opposite, our second type of locking for isolation, known as the pessimistic locking system. Pessimistic locking forces the user to have an exclusive use to the specific data until it's done. So if you think about like a lock in an operating system, it's the same basic concept that this data cannot be accessed or changed until this transaction is complete. Common way of actually implementing this is either by connecting directly to the database or even using like a specific transaction ID. And there's a classic example of the pessimistic locking system.

Benny Kitchell:

That is the bank analogy right, and I'm going to read directly from Wikipedia for this one, and I quote two bank clerks attempting to update the same bank account for two different transactions. Clerks wanted to both retrieve the account's records. So I want to pause here for a second because the way Wikipedia words this can get kind of wordy and kind of confusing what I'm going to say is clerk one and clerk two currently have a copy of the original record From here. Just to make sure things are clarifying, I'm going to be paraphrasing Wikipedia a bit, but I'm going to be using three different ways of describing this data the original record, copy one and copy two for clerk one and clerk two's copy respectively. So now clerk one applies and saves a transaction and sends it back to the database. Now the record on the database reflects clerk one's copy. Clerk two applies a different transaction to their saved copy which again, if you remember, is the original and saves the results back to the database as well. What this does is it overwrites the the transaction by clerk one and the record no longer reflects that first transaction, as if it had never taken place to begin with.

Benny Kitchell:

All of this could have been afforded by using a pessimistic locking system, by simply locking the file when a transaction is being performed. This never occurs and it always has the best data integrity between the two locking options. At the end of the day, the choice to use optimistic or pessimistic totally depends on the system and the data that is being processed. If something gets accessed a lot and doesn't really care about, like what happens in between or things being overwritten think like a gamer's inventory or social media fee then it probably would be better at looking at optimistic locking Whereas something that needs to be more secure, locked down and actually handle every single transaction, no matter what. You will want to go with more of a pessimistic lock.

Benny Kitchell:

So before I go into the D of ACID, I just want to reiterate that the ACID concept mostly applies to relational databases. Non-relational databases have their own way of doing things and while neither one is better all the time, one is better for a scenario and the other is better for a different scenario, and rest assured I will be digging into that. So great. So on to durability, which is the D of ACID. I feel like it's one of the most helpful features of the relational database, but it just depends on the person. I suppose.

Benny Kitchell:

Durability protects your data in the event of catastrophic failure, think network outages or system crashes. In short, when a transaction is written to the system, it is permanent. The only way to get rid of it is to destroy the disk entirely and then also not have backups of that disk. It's a whole thing protecting your data being secure, but in terms of your system crashing or a power outage or a network outage. If the transaction has been applied, it will be saved and will be able to be recovered when things come back. The way this is achieved is usually just by keeping a sequential log of all the commits before they even acknowledge them. So as soon as you are about to do something, the database will make a record of it in a log. Then try and apply that transaction.

Benny Kitchell:

This technique, which is used in many places all over the place, is basically called write-ahead logging, and by buffering changes to the disk before they are synchronized from main memory, they actually protect themselves from system crashing and network outages. So again, write-ahead logging is not a new concept for databases. It's not a new concept in software engineering at all, but it is one of the most stable parts of relational databases and, as I mentioned before, no solution is perfect. In the event of a disk failure or someone breaking in and destroying your drives, then if the data is not backed up, it is of course, lost. The only way to protect your data is to have it backed up accordingly. This is why it's so important to use protocols like having replicas that are just backups of data in other places. A lot of these techniques I will cover in the following episodes and they not only help with protecting your data, but they can also be used to help make relational databases more performant and better to scale.

Benny Kitchell:

And so inevitably because there is never truly one answer to a question and where there are people who love one thing, there is an equally passionate group of people who love another With this comes the birth of a new type of database, one that is not better or worse, instead just does things in a different way. I am, of course, talking about no SQL or document databases, which, depending on who you ask it, either stands for not only SQL or non-SQL. But either camp you are in, you can agree that, while relational databases are strict and by the book, document databases tend to be more loosey-goosey and allow the developer more freedom in how they handle their data. And so if you look at a graph about around the time where people are complaining about wanting to lower their cost of storage and around the time of the rise of no SQL databases, you will actually see a point where they crossed, and that point is around 2009-ish.

Benny Kitchell:

This was birth of what would be some of the biggest names of document databases, including powerhouses like MongoDB, but in the last decade alone, we have seen this area of computing really grow and mature to a large number of databases, offering things that run the gambit of not just being an alternative to relational databases but being a document database that handles large data or other things like that. Some of these new types of databases include document databases like MongoDB, which basically just stores your data in a JSON-like configuration, but it also has things like a wide column stores If you think about Google's Bigtable, which can be thought of like a 2D key value store or a relational database without a strict schema on the columns. There are also databases that are more bare bones, like RocksDB, which is basically a giant hash table, and then, of course, lastly but not leastly, we have graph databases, things like Neo4j, which uses node and edges to just create a relationship between all of your data. If you've ever worked with data structures, then you know how graphs work and how they are connected. Imagine that, but for a database. If you'd like a more detailed, deep dive into graph databases, definitely drop me an email at learnsystemdesignpodcom and let me know. Honestly, these things are truly fascinating and expansive and it just keeps getting bigger and bigger, but for now, I'm just going to keep things at a high level for when it comes to graph databases and just use them as they are for the sidelines. For my experience they're great, but they're not mature enough to have their own query language, so I don't want to spend too much time on them.

Benny Kitchell:

So why, then, is it so prevalent, especially with startups and smaller tech companies, to use document databases? Because, honestly, it holds a few golden tickets that are generally too appealing to pass up on. The first, and one we've touched on before, is how easy it is to scale a document database horizontally. Making a copy of your data and then being able to scale with your users is impeccable, it's fantastic and it helps you serve a large volume. The ability to replicate the data across a bunch of nodes very quickly is its main selling point, especially if you are leaning into a microservice architecture, where the more people you serve just means serving up more instances on the front end, the back end and the database very quickly. But the second one is the flexibility of the schema, and this positions itself well against companies that are starting out and may need to pivot quickly, despite what they'd like to tell you, most tech companies didn't start off knowing who their audience was, what they were going to sell them, much less what the data would look like and how to store it properly.

Benny Kitchell:

So being able to turn on a dime and use a whole new set of data in a whole new way, without any large DB changes or migrations, is incredibly crucial. And to put that in perspective, if you are starting off and you know what your data looks like, and then, two years down the road, you want to offer this new feature and now your data is related to each other in a different way your relational database you're going to have to drop all those tables, make new tables, make new connections, make new foreign keys, whereas with a document database, you're still pulling every row when you query, so you're still pulling all of the data anyway, and so by you really have no big change when your data changes, your needs for the data changes. And finally, the final ticket that I don't think a lot of companies will want to admit to is the low cost for infrastructure. Document databases generally require less hardware and less infrastructure to run, and what this means at the end of the day is a cheaper bill for the engineering team. While it may seem trivial at first, this sometimes can be a make or break decision when deciding what kind of database to use for early stages company, and let's say you are one of those companies. Maybe you're asking, ben, if no SQL is newer and there's more diversity, they have to be better than relational databases, right? And, as I've alluded to many times, yes, sometimes Document databases offer a variety of perks from more traditional databases and generally have more flexible schemas.

Benny Kitchell:

This could be a safe haven for when you have data that is constantly changing or maybe not always consistent. Of course, this means when trying to clean up or query that data. A lot of the times this means having properties that are inconsistent or even missing. In relational databases they have a much more rigid schema, it's easier to query, but they're often less efficient in terms of storage because, at the end of the day, when you create a row, regardless if a column is filled with data or not, that amount of memory is still occupied in the database. So, to put that in perspective, if you have, let's say again, a person database that has a first name, last name, age and last log in time, right, like we talked about before, if this data is being host on a relational database and you have a column for each one of those, and maybe you don't have the last name for some people. That column for that row is still taking up space. It still takes up memory no matter what, whereas something like a document database only takes up the amount of space needed for a row because the schema is flexible it can shrink and grow as it needs to.

Benny Kitchell:

The caveat to all of this is that more than likely on large, complex systems, you won't be just using one or the other. I don't think I've ever worked on a large system that didn't utilize both relational and non-relational databases. Sometimes you might not even realize it. If maybe you're working on the front end all day, you should shoot a Slack message to your data team. Ask them how they ingest all their data. It might surprise you or it might not if you've been listening to the show for very long, but my best rule of thumb when designing a new system is try to understand the data you are working with, and if you have to group it into different piles and take care of those piles separately, then sometimes that's for the best. Let's take, for instance you're designing a social network. A graph database is gonna be crucial for all those relationships between all those users. But what about your login? Authentication Seems like a waste of time to have such structured data sitting in such a flexible state, making it even harder to query.

Benny Kitchell:

When do you use a document model? My best advice on this is to consider that how your data is laid out as if it's a large graph where you connect all the relationships, then maybe it's time to use a graph database. On the other hand, maybe you form the data and it starts to look like a tree. There's a lot of one to many relationships. Then it's probably a good idea to break out the document database. And then, of course, when you map out all your data and there are lots of many to many relationships, then it's probably best to use more of a relational style. Again, it just depends on the data and, above all else, just remember that it's okay to use one, two or even all three at the same time. That's it for today's part two on databases. I really hope you enjoyed hearing a little bit about the history and a lot of bit about the niche differences between the different types of databases.

Benny Kitchell:

If you have enjoyed this episode. Please rate us five stars. It means a lot to us and every rating helps us reach more people and helps keep the podcast free. If there's anything you didn't quite understand or anything I didn't cover, or even just something you'd like to hear me talk about more, please reach out to me at LearnSystemDesignPod a gmail. com. A very special shout out to a Asif for sending us an email and being a wonderful person and supporting us. If you want to reach out, definitely send me an email at LearnSystemDesignPod a gmail. com and, just Sumaiya , remember to include your name for a special shout out.

Benny Kitchell:

If you'd like to help support this podcast, help me pay my bills, please jump over to Patreon and consider becoming a member Right now. There's only one tier, it's a dollar and you get to help me decide what I talk about next, after databases, whether it's encoding or messaging queues or really anything. All podcasts are inspired by Crystal Rose. All music is written and performed by the wonderful aimless orbiter. You can check out more of his music at . Soundcloud. o s... om. om slash aimless orbiter music. And, with all of that being said, this is Benny slash LearnSystemDesign. Patreoncom slash LearnSystemDesign.

People on this episode