Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 4: Relational Databases: The Bedrock of Data Management

With a solid understanding of data concepts and the open-source ecosystem, we now turn our attention to the first and most fundamental category of data storage systems: the relational database. For over half a century, relational databases have been the bedrock of data management, providing a reliable, structured, and consistent foundation for countless applications. From banking systems that process trillions of dollars to e-commerce sites that handle millions of orders, the relational model has proven to be remarkably resilient and adaptable. As a data engineer, a deep understanding of relational databases is not optional; it is a prerequisite. You will work with them every day, whether as a primary data source for your pipelines, a target for your transformed data, or a metadata store for your data platform.

This chapter is dedicated to providing a deep, practical understanding of relational databases. We will start by exploring the theoretical foundations of the relational model, understanding the principles that make it so powerful. We will then take a comprehensive look at the two most popular open-source relational databases in the world: PostgreSQL and MySQL. We will go beyond a superficial comparison and dive into their architectures, their unique features, and their ideal use cases. Finally, we will get our hands dirty with a practical exercise that will give you the confidence to start using these systems effectively. By the end of this chapter, you will have a firm grasp of the most important and ubiquitous data storage technology in the industry.

4.1 The Relational Model: A Foundation of Logic and Order

To truly master relational databases, we must first understand the elegant theory upon which they are built. The relational model was first proposed by Edgar F. Codd, a researcher at IBM, in his seminal 1970 paper, “A Relational Model of Data for Large Shared Data Banks.” Codd’s work was a breakthrough in computer science, providing a formal mathematical foundation for data management. The model is based on the principles of set theory and first-order predicate logic, which is a fancy way of saying that it provides a simple, logical, and consistent way to represent and manipulate data.

The Core Concepts

At its heart, the relational model has a few simple but powerful concepts:

Codd’s 12 Rules

Codd went on to define a set of 12 rules (actually 13, numbered 0 to 12) that a database must adhere to in order to be considered truly relational. While no database perfectly implements all 12 rules, they serve as a guiding philosophy for relational database design. Some of the most important rules include:

The Power of SQL (Structured Query Language)

The relational model gave us a powerful tool for interacting with data: SQL. SQL is a declarative language, which means you tell the database what you want, not how to get it. You declare the result you want, and the database’s query optimizer figures out the most efficient way to retrieve it. This is a major advantage over imperative languages, where you have to write the step-by-step procedure for accessing the data.

For example, to find all the employees in the engineering department with a salary over $100,000, you would write a simple SQL query:

SELECT name, salary
FROM employees
WHERE department = 'Engineering' AND salary > 100000;

You don’t have to tell the database whether to use an index, how to join tables, or in what order to filter the data. The database figures that out for you. This separation of concerns is one of the key reasons for the enduring success of the relational model.

ACID Transactions: The Guarantee of Reliability

Another cornerstone of relational databases is their support for ACID transactions. ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. It is a set of properties that guarantee that database transactions are processed reliably.

ACID compliance is what makes relational databases the right choice for mission-critical applications where data integrity is paramount, such as financial systems, e-commerce platforms, and booking systems.