With a solid understanding of data concepts and the open-source ecosystem, we now turn our attention to the first and most fundamental category of data storage systems: the relational database. For over half a century, relational databases have been the bedrock of data management, providing a reliable, structured, and consistent foundation for countless applications. From banking systems that process trillions of dollars to e-commerce sites that handle millions of orders, the relational model has proven to be remarkably resilient and adaptable. As a data engineer, a deep understanding of relational databases is not optional; it is a prerequisite. You will work with them every day, whether as a primary data source for your pipelines, a target for your transformed data, or a metadata store for your data platform.
This chapter is dedicated to providing a deep, practical understanding of relational databases. We will start by exploring the theoretical foundations of the relational model, understanding the principles that make it so powerful. We will then take a comprehensive look at the two most popular open-source relational databases in the world: PostgreSQL and MySQL. We will go beyond a superficial comparison and dive into their architectures, their unique features, and their ideal use cases. Finally, we will get our hands dirty with a practical exercise that will give you the confidence to start using these systems effectively. By the end of this chapter, you will have a firm grasp of the most important and ubiquitous data storage technology in the industry.
4.1 The Relational Model: A Foundation of Logic and Order¶
To truly master relational databases, we must first understand the elegant theory upon which they are built. The relational model was first proposed by Edgar F. Codd, a researcher at IBM, in his seminal 1970 paper, “A Relational Model of Data for Large Shared Data Banks.” Codd’s work was a breakthrough in computer science, providing a formal mathematical foundation for data management. The model is based on the principles of set theory and first-order predicate logic, which is a fancy way of saying that it provides a simple, logical, and consistent way to represent and manipulate data.
The Core Concepts
At its heart, the relational model has a few simple but powerful concepts:
Relation (or Table): The fundamental data structure is the relation, which is a set of tuples. In more common terms, a relation is a table.
Tuple (or Row): A tuple is a set of attribute values. In more common terms, a tuple is a row in a table.
Attribute (or Column): An attribute is a named column of a relation.
Domain: A domain is the set of all possible values for an attribute. For example, the domain for a
genderattribute might be {Male,Female,Non-binary}.Key: A key is one or more attributes that uniquely identify a tuple in a relation. The most common type of key is the primary key.
Codd’s 12 Rules
Codd went on to define a set of 12 rules (actually 13, numbered 0 to 12) that a database must adhere to in order to be considered truly relational. While no database perfectly implements all 12 rules, they serve as a guiding philosophy for relational database design. Some of the most important rules include:
The Information Rule: All information in the database must be represented in exactly one way—as values in a table.
The Guaranteed Access Rule: Every piece of data must be logically accessible by using a combination of the table name, the primary key value, and the column name.
Systematic Treatment of Null Values: The database must have a systematic way of handling null values (i.e., missing or inapplicable data), distinct from empty strings or zeros.
The Relational-Oriented Language Rule: A relational database must support a comprehensive, data-oriented language that can be used for data definition, data manipulation, and transaction control. This language is SQL.
The Power of SQL (Structured Query Language)
The relational model gave us a powerful tool for interacting with data: SQL. SQL is a declarative language, which means you tell the database what you want, not how to get it. You declare the result you want, and the database’s query optimizer figures out the most efficient way to retrieve it. This is a major advantage over imperative languages, where you have to write the step-by-step procedure for accessing the data.
For example, to find all the employees in the engineering department with a salary over $100,000, you would write a simple SQL query:
SELECT name, salary
FROM employees
WHERE department = 'Engineering' AND salary > 100000;You don’t have to tell the database whether to use an index, how to join tables, or in what order to filter the data. The database figures that out for you. This separation of concerns is one of the key reasons for the enduring success of the relational model.
ACID Transactions: The Guarantee of Reliability
Another cornerstone of relational databases is their support for ACID transactions. ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. It is a set of properties that guarantee that database transactions are processed reliably.
Atomicity: A transaction is an “all or nothing” proposition. It either completes in its entirety, or it is rolled back as if it never happened. You can’t have a situation where only half of a transaction is committed.
Consistency: A transaction must bring the database from one valid state to another. It cannot violate the integrity constraints of the database.
Isolation: Transactions that run concurrently must be isolated from each other. The result of running multiple transactions at the same time must be the same as if they were run sequentially. This prevents issues like “dirty reads” where one transaction reads the uncommitted changes of another.
Durability: Once a transaction has been committed, it will remain committed, even in the event of a power failure, crash, or other system error. This is typically achieved by writing the changes to a durable storage medium, such as a hard drive, in a write-ahead log.
ACID compliance is what makes relational databases the right choice for mission-critical applications where data integrity is paramount, such as financial systems, e-commerce platforms, and booking systems.