Chapter 4: Relational Databases with PostgreSQL and MySQL

Reliable data products begin with reliable records. Before a lakehouse can model customer behavior, before a streaming platform can publish events, and before an analytics dashboard can explain revenue, someone must decide how the business records are created, validated, queried, changed, backed up, and recovered. Relational databases remain the standard foundation for this work because they combine a precise data model, a mature query language, transactional guarantees, and decades of operational practice.

This chapter moves from the foundations of Part I into the storage systems of Part II. We will use PostgreSQL and MySQL not as competing brands, but as two production-grade implementations of the relational idea. By the end of the chapter, you will be able to design a small operational schema, reason about keys and constraints, read simple query plans, choose indexes with discipline, perform a controlled migration, and explain why backup/restore testing is part of data engineering rather than an afterthought.

Opening Scenario: TuranMart’s Checkout Problem¶

TuranMart, our fictional regional e-commerce company, has reached the point where spreadsheets and application logs are no longer enough. The mobile application accepts orders from Tashkent, Samarkand, Bukhara, and Fergana. Customers expect a checkout confirmation immediately after payment. Finance expects daily revenue by region. Operations expects a list of orders that are pending, completed, shipped, or cancelled. Marketing wants to know which registration channels drive the highest-value customers.

The team’s first instinct is to stream every click into the analytics platform. That will be useful later, but it does not solve the immediate business risk. TuranMart first needs a system of record. The checkout service must know whether a customer exists, whether the products are active, whether an order contains at least one item, whether the total is non-negative, whether a payment was captured, and whether the database can be restored after a failure. These are not only analytics questions; they are operational integrity questions.

Figure 1:Chapter 4 connects relational modeling, transactions, indexing, migrations, and recovery into one operational workflow for TuranMart’s order system.

The architecture lesson is simple but important. A data lake stores historical facts, a warehouse serves analytical queries, and a stream processor reacts to events, but the relational database often remains the place where the business first decides what is true. If the order database accepts impossible data, every downstream system inherits that confusion.

Learning Objectives¶

After completing this chapter, you should be able to explain why relational databases remain central to production data platforms and how their guarantees differ from file-based storage, document stores, and analytical engines.

Objective	What you should be able to do in practice
Model relational data	Translate business entities into tables, primary keys, foreign keys, constraints, and relationship patterns.
Compare PostgreSQL and MySQL	Explain the operational strengths of PostgreSQL and MySQL without reducing the decision to popularity or habit.
Use transactions correctly	Describe atomicity, consistency, isolation, and durability, and identify where transaction boundaries belong in an application workflow.
Optimize responsibly	Read basic `EXPLAIN` output, choose useful indexes, and avoid the common mistake of indexing every column.
Manage schema change	Apply a small migration in a controlled way and understand why migrations require review, testing, and rollback planning.
Practice recovery	Connect logical backups, restore checks, and recovery objectives to business continuity.

4.1 Why Relational Databases Still Matter¶

Relational databases organize data as relations, commonly implemented as tables. Each table contains rows, each row represents a fact about an entity or event, and each column has a defined meaning and type. PostgreSQL’s own tutorial introduces the system through relational concepts and the SQL language, which is precisely why it remains a useful teaching platform for data engineers as well as application developers.^[1]

The relational approach is durable because it solves a recurring organizational problem. Businesses do not only need to store data; they need to agree on rules. A customer should have one identity. An order should belong to an existing customer. A product price should not be negative. A completed payment should not disappear if the server restarts. These requirements are easier to enforce when the data model, constraints, transactions, and recovery mechanisms live close to the records they protect.

Definition. In this book, a relational database is an operational data system that stores structured records in tables, exposes them through SQL, enforces integrity through keys and constraints, and protects state changes through transactions, logging, and recovery.

The relational database is not the only storage system in a modern data platform. It is, however, usually the most disciplined place to model operational truth. The data engineer should therefore understand relational databases even when the day-to-day job focuses on Airflow pipelines, lakehouse tables, Kafka topics, or BI semantic layers.

Platform role	Typical relational responsibility	Downstream dependency
Checkout and order capture	Validate orders, customers, products, payments, and statuses.	Event streams, finance marts, fulfillment dashboards.
Customer master data	Preserve customer identity, consent, contact fields, and account state.	Marketing segmentation, personalization, compliance exports.
Inventory and catalog	Maintain product identity, availability, price, and category rules.	Search indexes, recommendation systems, procurement reports.
Finance operations	Record invoices, payments, refunds, and reconciliations.	Accounting reports, fraud monitoring, revenue analytics.

A popular misunderstanding says that relational databases are old while big data systems are modern. A better mental model is that relational databases and analytical platforms answer different questions. Relational systems are optimized for correct state transitions in operational workflows. Analytical systems are optimized for large-scale reading, aggregation, and historical analysis. A mature data architecture uses both.

4.2 The Relational Model in Practice¶

A relational schema is a contract between the business and the database. The contract names the entities, defines the attributes, and states which combinations of records are allowed. PostgreSQL’s constraints documentation emphasizes that data types alone are often too coarse; SQL constraints give designers more precise control and raise errors when a proposed row violates the rule.^[2]

For TuranMart, the first operational schema has four core tables: customers, products, orders, and order_items. This is intentionally small, but it already contains the ideas that appear in much larger production systems.

Figure 2:TuranMart’s OLTP schema separates customers, products, orders, and line items so that identity, validity, and relationships can be enforced close to the data.

4.2.1 Tables, Keys, and Constraints¶

A table should represent one kind of thing or event. A primary key identifies each row. A foreign key links rows across tables. A check constraint limits valid values. A not-null constraint says that missing values are not acceptable for a specific attribute. These definitions may sound theoretical, but they become concrete when an application tries to insert bad data at 02:00 during a sales campaign.

Design element	Example in TuranMart	Engineering purpose
Primary key	`customers.customer_id`	Gives every customer a stable identity.
Foreign key	`orders.customer_id` references `customers.customer_id`	Prevents an order from belonging to a non-existent customer.
Composite primary key	`order_items(order_id, product_id)`	Prevents duplicate product lines for the same order in the starter schema.
Check constraint	`status IN ('pending', 'completed', 'shipped', 'cancelled')`	Prevents inconsistent status values such as `done`, `complete`, or `closed`.
Not-null constraint	`orders.order_date NOT NULL`	Makes required business facts explicit.

Foreign keys are especially important for data engineering because they document lineage inside the operational model. MySQL’s documentation describes foreign keys as a way to cross-reference related data and keep distributed table data consistent; InnoDB also requires foreign-key columns to be indexed, which links integrity design to physical access paths.^[3]

4.2.2 Relationship Patterns¶

Most relational schemas use a small set of relationship patterns. A one-to-one relationship is useful when a row has an optional or sensitive extension. A one-to-many relationship is the everyday pattern: one customer can have many orders, and one order can have many order items. A many-to-many relationship is represented through a junction table, such as student_courses, product_promotions, or user_roles.

Figure 3:Relational schemas express one-to-one, one-to-many, and many-to-many business rules through primary keys, foreign keys, uniqueness constraints, and junction tables.

The TuranMart schema uses one-to-many relationships from customers to orders and from orders to order items. It also uses order_items as a line-item table that connects orders to products. In a later version, the team may allow multiple discounts per item, split shipments, refunds, or partial captures. The relational model can support those changes, but only if each change is introduced deliberately.

4.2.3 Normalization and the Cost of Duplication¶

Normalization is the process of organizing data so that each fact is stored in the right place, with minimal unnecessary duplication. If the product name is copied into every order item, a later product rename creates ambiguity. If the customer region is copied into both the customer and order tables, the team must decide whether the order should record the region at checkout time or the customer’s current region. That decision is a business rule, not merely a technical detail.

Figure 4:Relational design moves from normalized correctness toward selective performance optimizations only when query evidence justifies the trade-off.

A normalized schema is not automatically fast, and a denormalized schema is not automatically wrong. The discipline is to begin with correctness, measure real access patterns, and then optimize where the workload proves the need. In operational systems, premature denormalization often creates long-term data quality problems that are more expensive than the performance issue it was meant to solve.

4.3 Transactions and ACID Guarantees¶

A transaction bundles multiple database operations into a single unit of work. PostgreSQL’s transaction tutorial states the central idea clearly: a transaction groups multiple steps into an all-or-nothing operation, hides intermediate states from concurrent transactions, and cancels partial changes if the transaction cannot complete.^[4]

For TuranMart, checkout should not create an order without its order items, capture a payment without an order, or mark an order as completed while the payment failed. The application should define a transaction boundary around the state change that must succeed or fail together.

ACID property	Meaning for TuranMart checkout	Example failure it prevents
Atomicity	Related writes commit together or not at all.	Order header exists but line items are missing.
Consistency	Constraints preserve valid database state.	Order status is set to an unsupported value.
Isolation	Concurrent transactions do not see unsafe intermediate state.	Finance reads half-finished checkout data.
Durability	Committed changes survive crashes according to the database’s recovery model.	Confirmed order disappears after restart.

MySQL’s documentation describes the ACID model as a set of reliability principles for business data and mission-critical applications, with InnoDB providing features that closely adhere to those principles.^[5] PostgreSQL implements transaction isolation through a multiversion model. Its MVCC documentation explains that each SQL statement sees a snapshot of the database and that readers do not block writers while writers do not block readers under normal MVCC behavior.^[6]

The engineering implication is that transaction design belongs in architecture reviews. Teams should ask which state changes belong in one transaction, which operations can be eventually consistent, and where idempotency is required because a message, job, or API request may be retried.

4.4 PostgreSQL Architecture for Data Engineers¶

PostgreSQL is often chosen when teams want rich SQL, strong standards orientation, advanced indexing, extensibility, and mature transactional behavior. It is widely used as an operational database, but it also appears in data engineering workflows as a metadata store, orchestration database, small warehouse, reverse-ETL target, and source for change data capture.

Figure 5:A PostgreSQL deployment combines client sessions, SQL planning and execution, shared memory, background processes, write-ahead logging, and persistent data files.

From a data engineering perspective, three PostgreSQL ideas matter immediately. First, the query planner chooses an execution plan based on query structure and statistics. PostgreSQL’s EXPLAIN documentation notes that the system devises a plan for each query and that choosing the right plan is critical for performance.^[7] Second, MVCC gives concurrent readers and writers a practical way to share the system without simple read/write blocking.^[6] Third, backup choices include SQL dumps, file-system-level backups, and continuous archiving, each with different trade-offs.^[8]

PostgreSQL capability	Why a data engineer should care
Rich SQL and extensions	Enables sophisticated transformations, validation queries, and operational analytics near the source.
MVCC	Helps explain snapshot behavior, replica reads, long-running queries, and vacuum-related operational issues.
`EXPLAIN` and planner statistics	Makes query performance a measurable engineering problem rather than guesswork.
Logical dump and continuous archiving options	Connects database operations to reproducible environments and recovery planning.
Strong constraint support	Keeps malformed operational records from contaminating downstream pipelines.

PostgreSQL is not magic. It needs vacuum maintenance, statistics, connection management, disk planning, and backup testing. The point is not that a data engineer must become a full-time database administrator. The point is that data pipelines depend on database behavior, and pipeline reliability improves when the engineer understands that behavior.

4.5 MySQL Architecture for Data Engineers¶

MySQL is common in web applications, commerce systems, content platforms, and SaaS products. Its default transactional engine, InnoDB, is the part most data engineers encounter when extracting operational data for analytics. The MySQL documentation for InnoDB covers ACID behavior, multiversioning, locking, in-memory structures, on-disk structures, backup/recovery, and replication.^[9]

Figure 6:A MySQL deployment with InnoDB separates SQL parsing and optimization from the storage engine that manages buffer pools, logs, indexes, locking, and persistent pages.

MySQL is especially important in data engineering because many organizations run application databases on MySQL while analytics happens elsewhere. A pipeline may read MySQL binlogs, run incremental extracts, replicate tables into a warehouse, or reconcile MySQL transactional data against a lakehouse fact table. Understanding InnoDB indexes, transactions, and backups helps the data engineer avoid extracts that overload the source.

MySQL/InnoDB capability	Why a data engineer should care
InnoDB transactional storage	Most production MySQL application data depends on InnoDB consistency and recovery semantics.
Binary logging and replication ecosystem	Many CDC and read-scaling patterns originate from MySQL replication concepts.
Index optimization guidance	MySQL explicitly warns that unnecessary indexes waste space and increase write costs.^[10]
Backup and recovery options	MySQL documentation emphasizes backups for crashes, hardware failures, accidental deletion, upgrades, transfers, and replicas.^[11]
Foreign-key support in InnoDB	Referential integrity may exist in the database or may be absent because the application enforces it instead; the data engineer must verify.

A practical data engineer does not ask whether PostgreSQL or MySQL is universally better. The better question is which system fits the application workload, operational team, ecosystem requirements, and long-term data architecture.

Figure 7:PostgreSQL and MySQL decisions should consider workload shape, ecosystem fit, operational expertise, extensibility needs, and existing platform standards.

4.6 Indexes, Query Plans, and Performance Discipline¶

Indexes are access structures that help the database find rows without scanning every row in a table. PostgreSQL’s documentation describes indexes as a common way to improve performance, while also warning that they add overhead and should be used sensibly.^[12] MySQL’s index optimization documentation makes the same engineering trade-off explicit: indexes can speed SELECT operations, but unnecessary indexes waste space and add cost to inserts, updates, and deletes.^[10]

The beginner’s mistake is to add indexes wherever a query feels slow. The professional workflow is different. First, identify the business query. Second, measure the query plan. Third, add the smallest useful index. Fourth, measure again. Fifth, watch write performance and storage growth.

Figure 8:Performance tuning is an evidence loop: observe the workload, inspect the plan, change one design element, measure again, and keep or revert the change.

For TuranMart, two obvious access patterns appear early. Customer support often looks up a customer’s recent orders, and finance groups completed revenue by region. The lab therefore creates indexes such as idx_orders_customer_date and idx_orders_status_region. These indexes are not random decorations. They reflect actual questions the application and finance team ask repeatedly.

Query pattern	Candidate index	Why it helps
Recent orders for one customer	`orders(customer_id, order_date DESC)`	Filters by customer and returns records in useful date order.
Completed revenue by region	`orders(status, region)`	Filters by status and groups completed orders by region.
Product-level order analysis	`order_items(product_id)`	Helps join or filter line items by product.

EXPLAIN does not simply say whether a query is good or bad. It shows the plan the optimizer selected, including scan types, row estimates, joins, sorts, and cost estimates. PostgreSQL’s documentation describes the plan as a tree of plan nodes, where lower nodes retrieve rows and upper nodes join, aggregate, or sort them.^[7] The data engineer should learn to recognize sequential scans, index scans, filters, joins, and aggregation nodes, then connect those observations to workload evidence.

4.7 Schema Migration as a Production Discipline¶

No relational schema survives first contact with a growing business. TuranMart’s first schema records orders, but the finance team soon asks for payment reconciliation. The engineering team must add a payments table, connect it to orders, and create a query that compares captured payments with completed order totals.

A migration is more than an ALTER TABLE statement. It is a controlled change to a shared contract. Application code, analytics jobs, dashboards, reverse-ETL syncs, data quality checks, and documentation may all depend on the schema. A safe migration therefore includes a reason, a reviewed SQL change, a test dataset, a rollback or forward-fix plan, and a post-deploy validation query.

Migration concern	Practical question
Compatibility	Can old and new application versions operate during deployment?
Backfill	Do existing rows need new values, and can the update run safely?
Locking	Will the change block writes or reads on a large table?
Validation	What query proves the new table or column is correct?
Rollback	Is it safer to revert, disable usage, or apply a forward fix?

In the Chapter 4 lab, the migration file 04_migration_add_payments.sql starts a transaction, creates a payments table, creates an index on (order_id, payment_status), inserts sample captured payments, and runs a reconciliation query. This small migration models a large production habit: database changes should be auditable and testable.

4.8 Backup, Restore, and Recovery Objectives¶

Backups are only useful if they can be restored. PostgreSQL’s backup documentation states that databases containing valuable data should be backed up regularly and identifies three broad approaches: SQL dump, file-system-level backup, and continuous archiving.^[8] MySQL’s backup documentation makes the operational motivation explicit: backups support recovery from crashes, hardware failures, accidental deletion, upgrades, transfers, replica setup, scheduling, compression, encryption, and point-in-time recovery.^[11]

Data engineers should know two recovery metrics. Recovery Point Objective (RPO) describes how much data the business can afford to lose. Recovery Time Objective (RTO) describes how long the business can wait before service is restored. These objectives should shape backup frequency, retention, replication, restore automation, and incident drills.

Recovery concept	TuranMart example	Architecture implication
RPO	“We can lose at most five minutes of order data.”	Continuous archiving, replication, or CDC may be required.
RTO	“Checkout must be restored within thirty minutes.”	Restore runbooks and tested automation are required.
Retention	“Finance may request month-end snapshots.”	Backup lifecycle policy must match audit and reporting needs.
Restore test	“A backup is restored into staging every week.”	Confidence comes from rehearsal, not from backup files alone.

A relational database backup strategy is part of the data platform. If the operational database is lost, downstream analytics can report history but cannot safely recreate the exact current state unless all changes were captured and replayable. For that reason, backup/restore practice belongs in the same conversation as ingestion, orchestration, and lineage.

4.9 Production Design Pattern: OLTP Source Feeding Analytics¶

The most common data engineering pattern begins with an OLTP database and ends with an analytical model. The application writes normalized operational records into PostgreSQL or MySQL. A data pipeline extracts changes, lands them in object storage or a warehouse, transforms them into facts and dimensions, and serves business metrics.

The danger is to let analytical convenience leak backward into the operational schema too early. For example, finance may want one wide order_report table with customer, product, order, payment, and shipment fields. That shape may be useful in a warehouse, but it is risky as the primary checkout schema because it duplicates facts and weakens constraints. The operational schema should protect state changes. The analytical model should optimize reading.

Layer	Data shape	Main optimization goal
OLTP database	Normalized tables with constraints and transactions.	Correct writes and reliable operational state.
Raw ingestion zone	Source-shaped tables or files with metadata.	Completeness, replayability, lineage, and auditability.
Curated warehouse/lakehouse	Facts, dimensions, aggregates, or data products.	Fast analytical queries and business usability.
BI and reverse ETL	Metrics, segments, extracts, and operational syncs.	Decision support and activation.

This pattern also clarifies team responsibilities. Application engineers own transaction boundaries and user-facing correctness. Database administrators or platform engineers own database operations. Data engineers own extraction, validation, transformation, and downstream contracts. The boundaries overlap, so communication is part of the architecture.

Guided Lab: Build and Validate TuranMart’s Relational Order Database¶

In this lab, you will work with the Chapter 4 relational database assets. The lab is intentionally small enough to run locally, but it reflects production activities: creating a schema, seeding data, inspecting query plans, adding indexes, applying a migration, reconciling payments, and reviewing backup/restore practice.

Lab element	Location
Lab README	`shared/labs/ch04_relational_databases/README.md`
Docker Compose services	`shared/labs/ch04_relational_databases/docker-compose.yml`
Schema SQL	`shared/labs/ch04_relational_databases/sql/01_schema.sql`
Seed SQL	`shared/labs/ch04_relational_databases/sql/02_seed.sql`
Indexes and query plans	`shared/labs/ch04_relational_databases/sql/03_indexes_and_queries.sql`
Payments migration	`shared/labs/ch04_relational_databases/sql/04_migration_add_payments.sql`
Backup/restore checklist	`shared/labs/ch04_relational_databases/05_backup_restore.md`
Expected output	`shared/labs/ch04_relational_databases/expected_output/revenue_by_region.csv`
Validator	`shared/labs/ch04_relational_databases/tests/validate_lab_assets.py`

Lab Setup¶

Start from the repository root and inspect the lab package before running commands. The Compose file provides PostgreSQL 16 and MySQL 8.4 services so that you can compare operational behavior while keeping the chapter’s primary SQL path focused on PostgreSQL.

cd shared/labs/ch04_relational_databases
python3 tests/validate_lab_assets.py
docker compose up -d

The validator should confirm that required SQL, data, expected output, and supporting files are present. If Docker is not available in your environment, read the SQL files in order and complete the design questions manually. The conceptual learning still applies.

Step 1: Create the Schema¶

Apply the schema file to PostgreSQL. The schema creates customers, products, orders, and order_items with primary keys, foreign keys, not-null rules, and check constraints.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/01_schema.sql

After applying the schema, inspect the table definitions. Confirm that orders.customer_id references customers.customer_id, that order status values are constrained, and that order_items uses a composite primary key.

Step 2: Seed Operational Data¶

Load the deterministic sample data.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/02_seed.sql

This dataset is small on purpose. A small dataset lets you reason carefully about each record before scaling the same pattern to millions of rows.

Step 3: Inspect Query Plans and Add Indexes¶

Run the query-plan script. It shows the revenue and lookup queries before and after index creation.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/03_indexes_and_queries.sql

Do not treat the index statements as magic. For each index, write one sentence explaining which query pattern it supports and which write operation may become slightly more expensive because the index must also be maintained.

Step 4: Verify the Revenue Output¶

Run the final revenue query and compare it with the expected output.

region,completed_revenue
Samarkand,420.00
Tashkent,900.00

The expected result is stored at expected_output/revenue_by_region.csv. A mismatch means either the seed data changed, the filter on completed orders changed, or the aggregation query no longer matches the lab contract.

Step 5: Apply the Payments Migration¶

Apply the migration file that adds payments and performs a reconciliation query.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/04_migration_add_payments.sql

Study the transaction boundary in the migration. The goal is not only to create a table; it is to change the operational contract safely. In a production setting, you would also review lock behavior, backfill strategy, rollout order, monitoring, and rollback options.

Step 6: Review Backup and Restore¶

Read 05_backup_restore.md and explain the difference between creating a backup and proving a backup can be restored. Then write a short RPO/RTO statement for TuranMart’s checkout database.

A strong answer connects business impact to engineering design. For example, “TuranMart can lose at most five minutes of confirmed order data and must restore checkout within thirty minutes; therefore, daily logical dumps alone are insufficient for the production checkout database.”

Common Pitfalls and Operational Lessons¶

Relational databases are mature, but teams still fail with them in predictable ways. Most failures come from treating the database as a passive storage bucket rather than an active consistency boundary.

Pitfall	Why it hurts	Better practice
Missing constraints	Bad records enter the source and spread downstream.	Encode required business rules with keys, checks, not-null rules, and reference data.
Indexing every column	Writes slow down and storage grows without evidence of query benefit.	Index measured access patterns and review unused indexes.
Ignoring transaction boundaries	Partial state becomes visible or durable.	Group operations that must succeed or fail together.
Running migrations casually	Application code, pipelines, and dashboards break unexpectedly.	Version migrations, test them, and communicate schema contracts.
Backing up without restoring	Teams discover too late that backup files are incomplete or unusable.	Schedule restore drills and document recovery steps.
Extracting too aggressively	Analytics jobs overload the operational source.	Use replicas, CDC, incremental extraction, and workload-aware scheduling.

The most important operational lesson is that a relational database is a shared dependency. Application teams, data teams, finance teams, and customer support may all rely on the same records. Changes to schema, indexes, retention, or extraction schedules should therefore be treated as product changes with users, not as private implementation details.

Mini-Capstone: Design a Relational Source for a Regional Marketplace¶

Design a relational source schema for a marketplace that supports customers, sellers, products, orders, order items, payments, and refunds. Your design should include primary keys, foreign keys, at least three check constraints, and at least two indexes tied to named query patterns.

Your submission should include a one-page explanation of transaction boundaries for checkout and refund workflows, a migration plan for adding refunds after the initial launch, and a backup/restore policy with RPO and RTO assumptions. The design will be considered complete when another engineer can explain which facts belong in the OLTP database and which facts should be modeled later in the analytical warehouse.

Acceptance criterion	Evidence required
Correct relational model	Entity list, table definitions, primary keys, foreign keys, and relationship explanation.
Integrity rules	Check constraints and not-null rules tied to business meaning.
Transaction design	Clear checkout and refund transaction boundaries.
Performance reasoning	Two indexes justified by query patterns and write-cost trade-offs.
Migration safety	Step-by-step migration plan with validation and rollback or forward-fix approach.
Recovery thinking	RPO, RTO, backup type, retention period, and restore test cadence.

Exercises¶

Extend the TuranMart schema with a shipments table. Define the primary key, foreign key to orders, allowed shipment statuses, and one index that supports operations lookup.
Rewrite the completed revenue query so that it reports revenue by registration_channel instead of region. Explain which join is required and why.
Add a refunds table on paper. Decide whether refunds reference orders, payments, or both, and justify the choice.
Choose one query from the lab and explain how the query plan changes before and after indexing. Use plain English rather than copying the entire plan output.
Write a short incident scenario in which a backup exists but recovery still fails. Identify the missing operational practice.
Compare PostgreSQL and MySQL for a small SaaS product whose team already has strong MySQL expertise but needs advanced analytical SQL in the operational database. Recommend a decision and explain the trade-off.

Review Questions¶

Question	What a strong answer should include
Why do relational databases remain important in modern data platforms?	They protect operational truth through schemas, constraints, transactions, SQL, and recovery, while downstream systems optimize analytics and scale-out processing.
What is the difference between a primary key and a foreign key?	A primary key identifies rows in its own table; a foreign key references another table’s key to enforce relationships.
Why should indexes be added carefully?	They speed selected read patterns but consume storage and add maintenance cost to writes.
What does `EXPLAIN` help you understand?	The query plan selected by the optimizer, including scans, joins, filters, estimates, and costs.
Why is a migration a production discipline?	It changes a shared schema contract and can affect applications, pipelines, dashboards, locks, and recovery.
Why must backups be restored during tests?	Backup files alone do not prove recoverability; restore drills validate completeness, procedures, permissions, timing, and operational readiness.

Summary and Next Step¶

Relational databases are not merely legacy systems that sit behind applications. They are active guardians of operational truth. PostgreSQL and MySQL show how the relational model becomes production infrastructure through SQL, constraints, transactions, query planning, indexing, migrations, and backup/recovery practice.

In this chapter, TuranMart’s checkout problem gave us a practical reason to design a normalized OLTP schema, enforce business rules, read query plans, add measured indexes, introduce a payments migration, and reason about recovery. The larger lesson is that data engineering starts before data reaches the warehouse. It starts when the first operational record is created correctly.

The next chapter expands the storage conversation from relational databases to NoSQL and semi-structured data. You will learn why document, key-value, column-family, and graph systems exist, what trade-offs they make, and how to decide when relational structure is a strength versus when flexible or distributed models better fit the workload.

References¶

Footnotes¶

PostgreSQL Global Development Group, “PostgreSQL Tutorial,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/tutorial.html
↩
PostgreSQL Global Development Group, “Constraints,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/ddl-constraints.html
↩
Oracle, “FOREIGN KEY Constraints,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/constraint-foreign-key.html
↩
PostgreSQL Global Development Group, “Transactions,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/tutorial-transactions.html
↩
Oracle, “MySQL and the ACID Model,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/mysql-acid.html
↩
PostgreSQL Global Development Group, “Introduction to MVCC,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/mvcc-intro.html
↩↩
PostgreSQL Global Development Group, “Using EXPLAIN,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/using-explain.html
↩↩
PostgreSQL Global Development Group, “Backup and Restore,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/backup.html
↩↩
Oracle, “Introduction to InnoDB,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/innodb-storage-engine.html
↩
Oracle, “Optimization and Indexes,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/optimization-indexes.html
↩↩
Oracle, “Backup and Recovery,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/backup-and-recovery.html
↩↩
PostgreSQL Global Development Group, “Indexes,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/indexes.html
↩