Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 4: Relational Databases with PostgreSQL and MySQL

Reliable data products begin with reliable records. Before a lakehouse can model customer behavior, before a streaming platform can publish events, and before an analytics dashboard can explain revenue, someone must decide how the business records are created, validated, queried, changed, backed up, and recovered. Relational databases remain the standard foundation for this work because they combine a precise data model, a mature query language, transactional guarantees, and decades of operational practice.

This chapter moves from the foundations of Part I into the storage systems of Part II. We will use PostgreSQL and MySQL not as competing brands, but as two production-grade implementations of the relational idea. By the end of the chapter, you will be able to design a small operational schema, reason about keys and constraints, read simple query plans, choose indexes with discipline, perform a controlled migration, and explain why backup/restore testing is part of data engineering rather than an afterthought.

Opening Scenario: TuranMart’s Checkout Problem

TuranMart, our fictional regional e-commerce company, has reached the point where spreadsheets and application logs are no longer enough. The mobile application accepts orders from Tashkent, Samarkand, Bukhara, and Fergana. Customers expect a checkout confirmation immediately after payment. Finance expects daily revenue by region. Operations expects a list of orders that are pending, completed, shipped, or cancelled. Marketing wants to know which registration channels drive the highest-value customers.

The team’s first instinct is to stream every click into the analytics platform. That will be useful later, but it does not solve the immediate business risk. TuranMart first needs a system of record. The checkout service must know whether a customer exists, whether the products are active, whether an order contains at least one item, whether the total is non-negative, whether a payment was captured, and whether the database can be restored after a failure. These are not only analytics questions; they are operational integrity questions.

Chapter 4 connects relational modeling, transactions, indexing, migrations, and recovery into one operational workflow for TuranMart’s order system.

Figure 1:Chapter 4 connects relational modeling, transactions, indexing, migrations, and recovery into one operational workflow for TuranMart’s order system.

The architecture lesson is simple but important. A data lake stores historical facts, a warehouse serves analytical queries, and a stream processor reacts to events, but the relational database often remains the place where the business first decides what is true. If the order database accepts impossible data, every downstream system inherits that confusion.

Learning Objectives

After completing this chapter, you should be able to explain why relational databases remain central to production data platforms and how their guarantees differ from file-based storage, document stores, and analytical engines.

ObjectiveWhat you should be able to do in practice
Model relational dataTranslate business entities into tables, primary keys, foreign keys, constraints, and relationship patterns.
Compare PostgreSQL and MySQLExplain the operational strengths of PostgreSQL and MySQL without reducing the decision to popularity or habit.
Use transactions correctlyDescribe atomicity, consistency, isolation, and durability, and identify where transaction boundaries belong in an application workflow.
Optimize responsiblyRead basic EXPLAIN output, choose useful indexes, and avoid the common mistake of indexing every column.
Manage schema changeApply a small migration in a controlled way and understand why migrations require review, testing, and rollback planning.
Practice recoveryConnect logical backups, restore checks, and recovery objectives to business continuity.

4.1 Why Relational Databases Still Matter

Relational databases organize data as relations, commonly implemented as tables. Each table contains rows, each row represents a fact about an entity or event, and each column has a defined meaning and type. PostgreSQL’s own tutorial introduces the system through relational concepts and the SQL language, which is precisely why it remains a useful teaching platform for data engineers as well as application developers.[1]

The relational approach is durable because it solves a recurring organizational problem. Businesses do not only need to store data; they need to agree on rules. A customer should have one identity. An order should belong to an existing customer. A product price should not be negative. A completed payment should not disappear if the server restarts. These requirements are easier to enforce when the data model, constraints, transactions, and recovery mechanisms live close to the records they protect.

Definition. In this book, a relational database is an operational data system that stores structured records in tables, exposes them through SQL, enforces integrity through keys and constraints, and protects state changes through transactions, logging, and recovery.

The relational database is not the only storage system in a modern data platform. It is, however, usually the most disciplined place to model operational truth. The data engineer should therefore understand relational databases even when the day-to-day job focuses on Airflow pipelines, lakehouse tables, Kafka topics, or BI semantic layers.

Platform roleTypical relational responsibilityDownstream dependency
Checkout and order captureValidate orders, customers, products, payments, and statuses.Event streams, finance marts, fulfillment dashboards.
Customer master dataPreserve customer identity, consent, contact fields, and account state.Marketing segmentation, personalization, compliance exports.
Inventory and catalogMaintain product identity, availability, price, and category rules.Search indexes, recommendation systems, procurement reports.
Finance operationsRecord invoices, payments, refunds, and reconciliations.Accounting reports, fraud monitoring, revenue analytics.

A popular misunderstanding says that relational databases are old while big data systems are modern. A better mental model is that relational databases and analytical platforms answer different questions. Relational systems are optimized for correct state transitions in operational workflows. Analytical systems are optimized for large-scale reading, aggregation, and historical analysis. A mature data architecture uses both.

4.2 The Relational Model in Practice

A relational schema is a contract between the business and the database. The contract names the entities, defines the attributes, and states which combinations of records are allowed. PostgreSQL’s constraints documentation emphasizes that data types alone are often too coarse; SQL constraints give designers more precise control and raise errors when a proposed row violates the rule.[2]

For TuranMart, the first operational schema has four core tables: customers, products, orders, and order_items. This is intentionally small, but it already contains the ideas that appear in much larger production systems.

TuranMart’s OLTP schema separates customers, products, orders, and line items so that identity, validity, and relationships can be enforced close to the data.

Figure 2:TuranMart’s OLTP schema separates customers, products, orders, and line items so that identity, validity, and relationships can be enforced close to the data.

4.2.1 Tables, Keys, and Constraints

A table should represent one kind of thing or event. A primary key identifies each row. A foreign key links rows across tables. A check constraint limits valid values. A not-null constraint says that missing values are not acceptable for a specific attribute. These definitions may sound theoretical, but they become concrete when an application tries to insert bad data at 02:00 during a sales campaign.

Design elementExample in TuranMartEngineering purpose
Primary keycustomers.customer_idGives every customer a stable identity.
Foreign keyorders.customer_id references customers.customer_idPrevents an order from belonging to a non-existent customer.
Composite primary keyorder_items(order_id, product_id)Prevents duplicate product lines for the same order in the starter schema.
Check constraintstatus IN ('pending', 'completed', 'shipped', 'cancelled')Prevents inconsistent status values such as done, complete, or closed.
Not-null constraintorders.order_date NOT NULLMakes required business facts explicit.

Foreign keys are especially important for data engineering because they document lineage inside the operational model. MySQL’s documentation describes foreign keys as a way to cross-reference related data and keep distributed table data consistent; InnoDB also requires foreign-key columns to be indexed, which links integrity design to physical access paths.[3]

4.2.2 Relationship Patterns

Most relational schemas use a small set of relationship patterns. A one-to-one relationship is useful when a row has an optional or sensitive extension. A one-to-many relationship is the everyday pattern: one customer can have many orders, and one order can have many order items. A many-to-many relationship is represented through a junction table, such as student_courses, product_promotions, or user_roles.

Relational schemas express one-to-one, one-to-many, and many-to-many business rules through primary keys, foreign keys, uniqueness constraints, and junction tables.

Figure 3:Relational schemas express one-to-one, one-to-many, and many-to-many business rules through primary keys, foreign keys, uniqueness constraints, and junction tables.

The TuranMart schema uses one-to-many relationships from customers to orders and from orders to order items. It also uses order_items as a line-item table that connects orders to products. In a later version, the team may allow multiple discounts per item, split shipments, refunds, or partial captures. The relational model can support those changes, but only if each change is introduced deliberately.

4.2.3 Normalization and the Cost of Duplication

Normalization is the process of organizing data so that each fact is stored in the right place, with minimal unnecessary duplication. If the product name is copied into every order item, a later product rename creates ambiguity. If the customer region is copied into both the customer and order tables, the team must decide whether the order should record the region at checkout time or the customer’s current region. That decision is a business rule, not merely a technical detail.

Relational design moves from normalized correctness toward selective performance optimizations only when query evidence justifies the trade-off.

Figure 4:Relational design moves from normalized correctness toward selective performance optimizations only when query evidence justifies the trade-off.

A normalized schema is not automatically fast, and a denormalized schema is not automatically wrong. The discipline is to begin with correctness, measure real access patterns, and then optimize where the workload proves the need. In operational systems, premature denormalization often creates long-term data quality problems that are more expensive than the performance issue it was meant to solve.

4.3 Transactions and ACID Guarantees

A transaction bundles multiple database operations into a single unit of work. PostgreSQL’s transaction tutorial states the central idea clearly: a transaction groups multiple steps into an all-or-nothing operation, hides intermediate states from concurrent transactions, and cancels partial changes if the transaction cannot complete.[4]

For TuranMart, checkout should not create an order without its order items, capture a payment without an order, or mark an order as completed while the payment failed. The application should define a transaction boundary around the state change that must succeed or fail together.

ACID propertyMeaning for TuranMart checkoutExample failure it prevents
AtomicityRelated writes commit together or not at all.Order header exists but line items are missing.
ConsistencyConstraints preserve valid database state.Order status is set to an unsupported value.
IsolationConcurrent transactions do not see unsafe intermediate state.Finance reads half-finished checkout data.
DurabilityCommitted changes survive crashes according to the database’s recovery model.Confirmed order disappears after restart.

MySQL’s documentation describes the ACID model as a set of reliability principles for business data and mission-critical applications, with InnoDB providing features that closely adhere to those principles.[5] PostgreSQL implements transaction isolation through a multiversion model. Its MVCC documentation explains that each SQL statement sees a snapshot of the database and that readers do not block writers while writers do not block readers under normal MVCC behavior.[6]

The engineering implication is that transaction design belongs in architecture reviews. Teams should ask which state changes belong in one transaction, which operations can be eventually consistent, and where idempotency is required because a message, job, or API request may be retried.

4.4 PostgreSQL Architecture for Data Engineers

PostgreSQL is often chosen when teams want rich SQL, strong standards orientation, advanced indexing, extensibility, and mature transactional behavior. It is widely used as an operational database, but it also appears in data engineering workflows as a metadata store, orchestration database, small warehouse, reverse-ETL target, and source for change data capture.

A PostgreSQL deployment combines client sessions, SQL planning and execution, shared memory, background processes, write-ahead logging, and persistent data files.

Figure 5:A PostgreSQL deployment combines client sessions, SQL planning and execution, shared memory, background processes, write-ahead logging, and persistent data files.

From a data engineering perspective, three PostgreSQL ideas matter immediately. First, the query planner chooses an execution plan based on query structure and statistics. PostgreSQL’s EXPLAIN documentation notes that the system devises a plan for each query and that choosing the right plan is critical for performance.[7] Second, MVCC gives concurrent readers and writers a practical way to share the system without simple read/write blocking.[6] Third, backup choices include SQL dumps, file-system-level backups, and continuous archiving, each with different trade-offs.[8]

PostgreSQL capabilityWhy a data engineer should care
Rich SQL and extensionsEnables sophisticated transformations, validation queries, and operational analytics near the source.
MVCCHelps explain snapshot behavior, replica reads, long-running queries, and vacuum-related operational issues.
EXPLAIN and planner statisticsMakes query performance a measurable engineering problem rather than guesswork.
Logical dump and continuous archiving optionsConnects database operations to reproducible environments and recovery planning.
Strong constraint supportKeeps malformed operational records from contaminating downstream pipelines.

PostgreSQL is not magic. It needs vacuum maintenance, statistics, connection management, disk planning, and backup testing. The point is not that a data engineer must become a full-time database administrator. The point is that data pipelines depend on database behavior, and pipeline reliability improves when the engineer understands that behavior.

4.5 MySQL Architecture for Data Engineers

MySQL is common in web applications, commerce systems, content platforms, and SaaS products. Its default transactional engine, InnoDB, is the part most data engineers encounter when extracting operational data for analytics. The MySQL documentation for InnoDB covers ACID behavior, multiversioning, locking, in-memory structures, on-disk structures, backup/recovery, and replication.[9]

A MySQL deployment with InnoDB separates SQL parsing and optimization from the storage engine that manages buffer pools, logs, indexes, locking, and persistent pages.

Figure 6:A MySQL deployment with InnoDB separates SQL parsing and optimization from the storage engine that manages buffer pools, logs, indexes, locking, and persistent pages.

MySQL is especially important in data engineering because many organizations run application databases on MySQL while analytics happens elsewhere. A pipeline may read MySQL binlogs, run incremental extracts, replicate tables into a warehouse, or reconcile MySQL transactional data against a lakehouse fact table. Understanding InnoDB indexes, transactions, and backups helps the data engineer avoid extracts that overload the source.

MySQL/InnoDB capabilityWhy a data engineer should care
InnoDB transactional storageMost production MySQL application data depends on InnoDB consistency and recovery semantics.
Binary logging and replication ecosystemMany CDC and read-scaling patterns originate from MySQL replication concepts.
Index optimization guidanceMySQL explicitly warns that unnecessary indexes waste space and increase write costs.[10]
Backup and recovery optionsMySQL documentation emphasizes backups for crashes, hardware failures, accidental deletion, upgrades, transfers, and replicas.[11]
Foreign-key support in InnoDBReferential integrity may exist in the database or may be absent because the application enforces it instead; the data engineer must verify.

A practical data engineer does not ask whether PostgreSQL or MySQL is universally better. The better question is which system fits the application workload, operational team, ecosystem requirements, and long-term data architecture.

PostgreSQL and MySQL decisions should consider workload shape, ecosystem fit, operational expertise, extensibility needs, and existing platform standards.

Figure 7:PostgreSQL and MySQL decisions should consider workload shape, ecosystem fit, operational expertise, extensibility needs, and existing platform standards.

4.6 Indexes, Query Plans, and Performance Discipline

Indexes are access structures that help the database find rows without scanning every row in a table. PostgreSQL’s documentation describes indexes as a common way to improve performance, while also warning that they add overhead and should be used sensibly.[12] MySQL’s index optimization documentation makes the same engineering trade-off explicit: indexes can speed SELECT operations, but unnecessary indexes waste space and add cost to inserts, updates, and deletes.[10]

The beginner’s mistake is to add indexes wherever a query feels slow. The professional workflow is different. First, identify the business query. Second, measure the query plan. Third, add the smallest useful index. Fourth, measure again. Fifth, watch write performance and storage growth.

Performance tuning is an evidence loop: observe the workload, inspect the plan, change one design element, measure again, and keep or revert the change.

Figure 8:Performance tuning is an evidence loop: observe the workload, inspect the plan, change one design element, measure again, and keep or revert the change.

For TuranMart, two obvious access patterns appear early. Customer support often looks up a customer’s recent orders, and finance groups completed revenue by region. The lab therefore creates indexes such as idx_orders_customer_date and idx_orders_status_region. These indexes are not random decorations. They reflect actual questions the application and finance team ask repeatedly.

Query patternCandidate indexWhy it helps
Recent orders for one customerorders(customer_id, order_date DESC)Filters by customer and returns records in useful date order.
Completed revenue by regionorders(status, region)Filters by status and groups completed orders by region.
Product-level order analysisorder_items(product_id)Helps join or filter line items by product.

EXPLAIN does not simply say whether a query is good or bad. It shows the plan the optimizer selected, including scan types, row estimates, joins, sorts, and cost estimates. PostgreSQL’s documentation describes the plan as a tree of plan nodes, where lower nodes retrieve rows and upper nodes join, aggregate, or sort them.[7] The data engineer should learn to recognize sequential scans, index scans, filters, joins, and aggregation nodes, then connect those observations to workload evidence.

4.7 Schema Migration as a Production Discipline

No relational schema survives first contact with a growing business. TuranMart’s first schema records orders, but the finance team soon asks for payment reconciliation. The engineering team must add a payments table, connect it to orders, and create a query that compares captured payments with completed order totals.

A migration is more than an ALTER TABLE statement. It is a controlled change to a shared contract. Application code, analytics jobs, dashboards, reverse-ETL syncs, data quality checks, and documentation may all depend on the schema. A safe migration therefore includes a reason, a reviewed SQL change, a test dataset, a rollback or forward-fix plan, and a post-deploy validation query.

Migration concernPractical question
CompatibilityCan old and new application versions operate during deployment?
BackfillDo existing rows need new values, and can the update run safely?
LockingWill the change block writes or reads on a large table?
ValidationWhat query proves the new table or column is correct?
RollbackIs it safer to revert, disable usage, or apply a forward fix?

In the Chapter 4 lab, the migration file 04_migration_add_payments.sql starts a transaction, creates a payments table, creates an index on (order_id, payment_status), inserts sample captured payments, and runs a reconciliation query. This small migration models a large production habit: database changes should be auditable and testable.

4.8 Backup, Restore, and Recovery Objectives

Backups are only useful if they can be restored. PostgreSQL’s backup documentation states that databases containing valuable data should be backed up regularly and identifies three broad approaches: SQL dump, file-system-level backup, and continuous archiving.[8] MySQL’s backup documentation makes the operational motivation explicit: backups support recovery from crashes, hardware failures, accidental deletion, upgrades, transfers, replica setup, scheduling, compression, encryption, and point-in-time recovery.[11]

Data engineers should know two recovery metrics. Recovery Point Objective (RPO) describes how much data the business can afford to lose. Recovery Time Objective (RTO) describes how long the business can wait before service is restored. These objectives should shape backup frequency, retention, replication, restore automation, and incident drills.

Recovery conceptTuranMart exampleArchitecture implication
RPO“We can lose at most five minutes of order data.”Continuous archiving, replication, or CDC may be required.
RTO“Checkout must be restored within thirty minutes.”Restore runbooks and tested automation are required.
Retention“Finance may request month-end snapshots.”Backup lifecycle policy must match audit and reporting needs.
Restore test“A backup is restored into staging every week.”Confidence comes from rehearsal, not from backup files alone.

A relational database backup strategy is part of the data platform. If the operational database is lost, downstream analytics can report history but cannot safely recreate the exact current state unless all changes were captured and replayable. For that reason, backup/restore practice belongs in the same conversation as ingestion, orchestration, and lineage.

4.9 Production Design Pattern: OLTP Source Feeding Analytics

The most common data engineering pattern begins with an OLTP database and ends with an analytical model. The application writes normalized operational records into PostgreSQL or MySQL. A data pipeline extracts changes, lands them in object storage or a warehouse, transforms them into facts and dimensions, and serves business metrics.

The danger is to let analytical convenience leak backward into the operational schema too early. For example, finance may want one wide order_report table with customer, product, order, payment, and shipment fields. That shape may be useful in a warehouse, but it is risky as the primary checkout schema because it duplicates facts and weakens constraints. The operational schema should protect state changes. The analytical model should optimize reading.

LayerData shapeMain optimization goal
OLTP databaseNormalized tables with constraints and transactions.Correct writes and reliable operational state.
Raw ingestion zoneSource-shaped tables or files with metadata.Completeness, replayability, lineage, and auditability.
Curated warehouse/lakehouseFacts, dimensions, aggregates, or data products.Fast analytical queries and business usability.
BI and reverse ETLMetrics, segments, extracts, and operational syncs.Decision support and activation.

This pattern also clarifies team responsibilities. Application engineers own transaction boundaries and user-facing correctness. Database administrators or platform engineers own database operations. Data engineers own extraction, validation, transformation, and downstream contracts. The boundaries overlap, so communication is part of the architecture.

Guided Lab: Build and Validate TuranMart’s Relational Order Database

In this lab, you will work with the Chapter 4 relational database assets. The lab is intentionally small enough to run locally, but it reflects production activities: creating a schema, seeding data, inspecting query plans, adding indexes, applying a migration, reconciling payments, and reviewing backup/restore practice.

Lab elementLocation
Lab READMEshared/labs/ch04_relational_databases/README.md
Docker Compose servicesshared/labs/ch04_relational_databases/docker-compose.yml
Schema SQLshared/labs/ch04_relational_databases/sql/01_schema.sql
Seed SQLshared/labs/ch04_relational_databases/sql/02_seed.sql
Indexes and query plansshared/labs/ch04_relational_databases/sql/03_indexes_and_queries.sql
Payments migrationshared/labs/ch04_relational_databases/sql/04_migration_add_payments.sql
Backup/restore checklistshared/labs/ch04_relational_databases/05_backup_restore.md
Expected outputshared/labs/ch04_relational_databases/expected_output/revenue_by_region.csv
Validatorshared/labs/ch04_relational_databases/tests/validate_lab_assets.py

Lab Setup

Start from the repository root and inspect the lab package before running commands. The Compose file provides PostgreSQL 16 and MySQL 8.4 services so that you can compare operational behavior while keeping the chapter’s primary SQL path focused on PostgreSQL.

cd shared/labs/ch04_relational_databases
python3 tests/validate_lab_assets.py
docker compose up -d

The validator should confirm that required SQL, data, expected output, and supporting files are present. If Docker is not available in your environment, read the SQL files in order and complete the design questions manually. The conceptual learning still applies.

Step 1: Create the Schema

Apply the schema file to PostgreSQL. The schema creates customers, products, orders, and order_items with primary keys, foreign keys, not-null rules, and check constraints.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/01_schema.sql

After applying the schema, inspect the table definitions. Confirm that orders.customer_id references customers.customer_id, that order status values are constrained, and that order_items uses a composite primary key.

Step 2: Seed Operational Data

Load the deterministic sample data.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/02_seed.sql

This dataset is small on purpose. A small dataset lets you reason carefully about each record before scaling the same pattern to millions of rows.

Step 3: Inspect Query Plans and Add Indexes

Run the query-plan script. It shows the revenue and lookup queries before and after index creation.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/03_indexes_and_queries.sql

Do not treat the index statements as magic. For each index, write one sentence explaining which query pattern it supports and which write operation may become slightly more expensive because the index must also be maintained.

Step 4: Verify the Revenue Output

Run the final revenue query and compare it with the expected output.

region,completed_revenue
Samarkand,420.00
Tashkent,900.00

The expected result is stored at expected_output/revenue_by_region.csv. A mismatch means either the seed data changed, the filter on completed orders changed, or the aggregation query no longer matches the lab contract.

Step 5: Apply the Payments Migration

Apply the migration file that adds payments and performs a reconciliation query.

psql postgresql://postgres:postgres@localhost:5432/turanmart \
  -f sql/04_migration_add_payments.sql

Study the transaction boundary in the migration. The goal is not only to create a table; it is to change the operational contract safely. In a production setting, you would also review lock behavior, backfill strategy, rollout order, monitoring, and rollback options.

Step 6: Review Backup and Restore

Read 05_backup_restore.md and explain the difference between creating a backup and proving a backup can be restored. Then write a short RPO/RTO statement for TuranMart’s checkout database.

A strong answer connects business impact to engineering design. For example, “TuranMart can lose at most five minutes of confirmed order data and must restore checkout within thirty minutes; therefore, daily logical dumps alone are insufficient for the production checkout database.”

Common Pitfalls and Operational Lessons

Relational databases are mature, but teams still fail with them in predictable ways. Most failures come from treating the database as a passive storage bucket rather than an active consistency boundary.

PitfallWhy it hurtsBetter practice
Missing constraintsBad records enter the source and spread downstream.Encode required business rules with keys, checks, not-null rules, and reference data.
Indexing every columnWrites slow down and storage grows without evidence of query benefit.Index measured access patterns and review unused indexes.
Ignoring transaction boundariesPartial state becomes visible or durable.Group operations that must succeed or fail together.
Running migrations casuallyApplication code, pipelines, and dashboards break unexpectedly.Version migrations, test them, and communicate schema contracts.
Backing up without restoringTeams discover too late that backup files are incomplete or unusable.Schedule restore drills and document recovery steps.
Extracting too aggressivelyAnalytics jobs overload the operational source.Use replicas, CDC, incremental extraction, and workload-aware scheduling.

The most important operational lesson is that a relational database is a shared dependency. Application teams, data teams, finance teams, and customer support may all rely on the same records. Changes to schema, indexes, retention, or extraction schedules should therefore be treated as product changes with users, not as private implementation details.

Mini-Capstone: Design a Relational Source for a Regional Marketplace

Design a relational source schema for a marketplace that supports customers, sellers, products, orders, order items, payments, and refunds. Your design should include primary keys, foreign keys, at least three check constraints, and at least two indexes tied to named query patterns.

Your submission should include a one-page explanation of transaction boundaries for checkout and refund workflows, a migration plan for adding refunds after the initial launch, and a backup/restore policy with RPO and RTO assumptions. The design will be considered complete when another engineer can explain which facts belong in the OLTP database and which facts should be modeled later in the analytical warehouse.

Acceptance criterionEvidence required
Correct relational modelEntity list, table definitions, primary keys, foreign keys, and relationship explanation.
Integrity rulesCheck constraints and not-null rules tied to business meaning.
Transaction designClear checkout and refund transaction boundaries.
Performance reasoningTwo indexes justified by query patterns and write-cost trade-offs.
Migration safetyStep-by-step migration plan with validation and rollback or forward-fix approach.
Recovery thinkingRPO, RTO, backup type, retention period, and restore test cadence.

Exercises

  1. Extend the TuranMart schema with a shipments table. Define the primary key, foreign key to orders, allowed shipment statuses, and one index that supports operations lookup.

  2. Rewrite the completed revenue query so that it reports revenue by registration_channel instead of region. Explain which join is required and why.

  3. Add a refunds table on paper. Decide whether refunds reference orders, payments, or both, and justify the choice.

  4. Choose one query from the lab and explain how the query plan changes before and after indexing. Use plain English rather than copying the entire plan output.

  5. Write a short incident scenario in which a backup exists but recovery still fails. Identify the missing operational practice.

  6. Compare PostgreSQL and MySQL for a small SaaS product whose team already has strong MySQL expertise but needs advanced analytical SQL in the operational database. Recommend a decision and explain the trade-off.

Review Questions

QuestionWhat a strong answer should include
Why do relational databases remain important in modern data platforms?They protect operational truth through schemas, constraints, transactions, SQL, and recovery, while downstream systems optimize analytics and scale-out processing.
What is the difference between a primary key and a foreign key?A primary key identifies rows in its own table; a foreign key references another table’s key to enforce relationships.
Why should indexes be added carefully?They speed selected read patterns but consume storage and add maintenance cost to writes.
What does EXPLAIN help you understand?The query plan selected by the optimizer, including scans, joins, filters, estimates, and costs.
Why is a migration a production discipline?It changes a shared schema contract and can affect applications, pipelines, dashboards, locks, and recovery.
Why must backups be restored during tests?Backup files alone do not prove recoverability; restore drills validate completeness, procedures, permissions, timing, and operational readiness.

Summary and Next Step

Relational databases are not merely legacy systems that sit behind applications. They are active guardians of operational truth. PostgreSQL and MySQL show how the relational model becomes production infrastructure through SQL, constraints, transactions, query planning, indexing, migrations, and backup/recovery practice.

In this chapter, TuranMart’s checkout problem gave us a practical reason to design a normalized OLTP schema, enforce business rules, read query plans, add measured indexes, introduce a payments migration, and reason about recovery. The larger lesson is that data engineering starts before data reaches the warehouse. It starts when the first operational record is created correctly.

The next chapter expands the storage conversation from relational databases to NoSQL and semi-structured data. You will learn why document, key-value, column-family, and graph systems exist, what trade-offs they make, and how to decide when relational structure is a strength versus when flexible or distributed models better fit the workload.

References

Footnotes
  1. PostgreSQL Global Development Group, “PostgreSQL Tutorial,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/tutorial.html

  2. PostgreSQL Global Development Group, “Constraints,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/ddl-constraints.html

  3. Oracle, “FOREIGN KEY Constraints,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/constraint-foreign-key.html

  4. PostgreSQL Global Development Group, “Transactions,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/tutorial-transactions.html

  5. Oracle, “MySQL and the ACID Model,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/mysql-acid.html

  6. PostgreSQL Global Development Group, “Introduction to MVCC,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/mvcc-intro.html

  7. PostgreSQL Global Development Group, “Using EXPLAIN,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/using-explain.html

  8. PostgreSQL Global Development Group, “Backup and Restore,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/backup.html

  9. Oracle, “Introduction to InnoDB,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/innodb-storage-engine.html

  10. Oracle, “Optimization and Indexes,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/optimization-indexes.html

  11. Oracle, “Backup and Recovery,” MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/backup-and-recovery.html

  12. PostgreSQL Global Development Group, “Indexes,” PostgreSQL Documentation. https://www.postgresql.org/docs/current/indexes.html