In Chapter 1, you learned that data engineering is not simply the movement of files from one place to another. It is the practice of building reliable data products that preserve meaning as data travels from operational systems to analytics, automation, and AI. This chapter turns that mindset into the first concrete design skill: deciding how data should be represented, stored, modeled, and tested.
The practical outcome of this chapter is a small but realistic TuranMart data design package. You will inspect e-commerce order data, design an operational entity-relationship model, reshape the same business process into a dimensional model for analytics, and run a notebook that converts CSV and JSON source files into Parquet while applying quality checks. The goal is not to memorize every possible file format or modeling technique. The goal is to learn how professional data engineers translate business reality into structures that are easy to trust, query, and evolve.
Figure 1:Chapter overview covering data types, modeling, storage formats, quality checks, and the guided lab artifact.
Opening Scenario: TuranMart’s Revenue Problem¶
TuranMart’s marketing team has launched a regional campaign. The website shows thousands of new orders, the payment provider reports a different revenue number, and the logistics dashboard says many of those orders have not shipped. Finance asks for a reliable daily revenue table by region, product category, and campaign. Marketing asks for conversion metrics. Logistics asks which cities are creating fulfillment delays. The CEO asks a simple question: “What happened yesterday?”
That question becomes difficult because each source system describes the business differently. The checkout database stores normalized tables such as customers, orders, order_items, payments, and shipments. The web application emits JSON events where each event may contain nested campaign metadata. The finance team sometimes receives CSV exports from partners. The analytics team wants a fast table that can be queried by date, region, customer segment, product category, and campaign. The same business event must therefore survive several transformations without losing its meaning.
| Stakeholder | Pain point | Desired outcome | Data engineering design question |
|---|---|---|---|
| Finance analyst | Revenue differs between checkout, payment, and spreadsheet exports. | Audited daily revenue by region and currency. | Which source is authoritative for payment status, refunds, and reconciliation? |
| Marketing manager | Campaign performance changes depending on event interpretation. | Conversion metrics by campaign, channel, and product category. | How do clickstream events connect to orders and customers? |
| Logistics operator | Late shipments are visible only after customers complain. | Shipment-delay indicators by city and carrier. | How do shipments relate to orders, addresses, warehouses, and service-level targets? |
| Data analyst | Dashboards require repeated joins and repeated metric logic. | Queryable, documented tables for self-service analysis. | Should the analytics model be normalized, dimensional, or both? |
| Platform engineer | Files arrive in different shapes and are hard to validate consistently. | Repeatable conversion into typed, testable storage. | Which formats preserve schema, reduce cost, and support validation? |
The central lesson is that modeling, format selection, and quality checks are one design conversation. A Parquet file with the wrong schema is not a good data product. A beautiful star schema built from unvalidated source records is not trustworthy. A normalized database without clear analytical outputs creates friction for every reporting question. This chapter teaches you to connect these choices deliberately.
Learning Objectives¶
By the end of this chapter, you should be able to explain the difference between structured, semi-structured, and unstructured data and choose appropriate formats for each. You should be able to design a normalized e-commerce entity-relationship model for operational integrity and a dimensional model for analytical performance. You should also be able to run a small conversion workflow that reads CSV and JSON data, validates basic quality rules, writes Parquet, and queries the result with DuckDB.
| Objective | What you should be able to produce | Evidence in the guided lab |
|---|---|---|
| Classify data by structure | A source inventory that distinguishes relational tables, API payloads, logs, documents, and media. | A short explanation of why orders.csv and events.jsonl need different handling. |
| Choose file formats | A reasoned comparison of CSV, JSON, Avro, Parquet, and ORC for ingestion, exchange, streaming, and analytics. | Parquet outputs generated from raw TuranMart inputs. |
| Model operational data | An ERD with primary keys, foreign keys, cardinalities, and integrity constraints for e-commerce orders. | SQL DDL aligned with the operational model. |
| Model analytical data | A star schema with facts, dimensions, grains, and measures for revenue analysis. | A revenue query by date and region. |
| Apply quality checks | Tests for schema validity, required fields, uniqueness, referential integrity, and reconciliation totals. | A passing quality report and validation script. |
| Explain trade-offs | A clear argument for why the same business process may need different models and formats at different layers. | Written answers to exercises and review questions. |
Conceptual Foundation: Data Has Shape Before It Has Scale¶
A common beginner mistake is to think about scale before thinking about shape. Large data systems do need scalable storage and processing engines, but the first design question is more basic: what kind of data is this, and how much structure does it already have? Data engineers usually work with three broad categories.
Structured data has a predefined schema. A relational table, spreadsheet, or controlled CSV export usually contains rows and columns where each column has an expected meaning and type. Structured data is easiest to validate with constraints such as primary keys, foreign keys, uniqueness rules, and type checks. It is also the natural starting point for SQL analysis.
Semi-structured data has internal organization but does not always fit a fixed table. JSON and XML documents often contain nested objects, optional fields, arrays, and changing attributes. This makes semi-structured data useful for APIs, logs, clickstream events, and document databases. The flexibility is valuable, but it moves more responsibility to the data engineer: nested fields must be interpreted, optional fields must be handled, and schema drift must be detected.
Unstructured data has no obvious row-and-column structure. Product images, support-call audio, PDF contracts, chat transcripts, and long-form documents belong here. In modern data platforms, unstructured data often becomes useful after metadata extraction, text parsing, embedding generation, speech recognition, or computer vision. Later chapters on AI and ML data engineering return to this point, because retrieval-augmented generation and semantic search depend heavily on preparing unstructured content correctly.
| Data type | TuranMart example | Typical storage | Engineering risk |
|---|---|---|---|
| Structured | orders, payments, customers tables | PostgreSQL, MySQL, warehouse tables, Parquet datasets | Incorrect constraints or mismatched business definitions. |
| Semi-structured | Web events, API responses, product attributes | JSON Lines, document databases, raw data lake zones | Schema drift, missing optional fields, deeply nested attributes. |
| Unstructured | Reviews, support tickets, images, invoices | Object storage, document stores, vector indexes after processing | Hard-to-measure quality, privacy exposure, expensive extraction. |
The practical rule is simple: store raw evidence faithfully, then create curated structures for use cases. A data lake may keep the original JSON event because it is useful for debugging. A warehouse may expose a clean fact_order_line table because analysts need stable revenue metrics. Both can be correct if each layer has a clear purpose.
2.1 File Formats Are Engineering Contracts¶
A file format is not only a storage detail. It is a contract between producers, pipelines, query engines, and downstream users. The format influences how schemas are stored, how efficiently data can be scanned, whether compression works well, and how difficult it is to debug failures.
CSV is popular because it is simple and universal. Almost every tool can open it. That advantage is also its weakness. CSV does not reliably carry schema information, so types must be inferred or specified externally. A column that looks numeric may contain empty strings, leading zeros, local currency separators, or text codes. CSV is therefore useful for small exports and interoperability, but it is a fragile long-term storage format for analytical datasets.
JSON is the natural language of many APIs and web applications. It represents nested structures better than CSV and is easy for humans to inspect. However, JSON is verbose because field names are repeated in every record. Large JSON collections are commonly stored as JSON Lines, where each line is one JSON object, because this makes streaming and line-by-line processing easier. JSON is excellent for raw ingestion and event capture, but analytical systems usually convert it into a typed tabular format.
Apache Avro is a compact binary serialization format designed around schemas. The Avro documentation describes Avro as a data serialization system that provides rich data structures, a compact binary data format, container files, remote procedure calls, and schema-based processing.[1] Avro is common in streaming pipelines because events can carry or reference schemas, and producers and consumers can evolve in controlled ways.
Apache Parquet is a columnar file format designed for analytical workloads. The Apache Parquet project provides the public file-format specification, and DuckDB describes Parquet files as compressed columnar files that are efficient to load and process.[2] [3] Columnar storage matters because analytical queries often read a few columns from many rows. If a query only needs order_date, region, and gross_revenue, a columnar file can avoid scanning unrelated columns such as customer email or shipment tracking number. DuckDB also supports reading, writing, schema inspection, filter pushdown, and projection pushdown for Parquet files.[3]
ORC is another columnar format that is strong in parts of the Hadoop and Hive ecosystem. In new open-source lakehouse projects, Parquet is often more common because it is widely supported across Spark, DuckDB, Trino, pandas, Polars, Delta Lake, Iceberg, and many cloud query engines. The important principle is not that one format wins forever. The principle is to align the format with the workload.
| Format | Best use | Strength | Weakness | Typical layer |
|---|---|---|---|---|
| CSV | Small exports and human exchange | Universal and easy to inspect | Weak schema and fragile parsing | Raw exchange, ad hoc export |
| JSON / JSON Lines | APIs, logs, events, nested payloads | Flexible and expressive | Verbose and slower for analytics | Raw bronze layer |
| Avro | Streaming events and schema evolution | Compact binary records with schema support | Less convenient for ad hoc SQL analytics | Ingestion and event pipelines |
| Parquet | Analytical data lakes and lakehouses | Columnar reads, compression, embedded schema | Not human-readable | Silver and gold analytical layers |
| ORC | Hive-oriented analytics | Columnar performance and compression | Less universal outside some ecosystems | Warehouse and Hadoop analytics |
Design rule: Use human-readable formats when humans must inspect or exchange small files. Use schema-aware binary formats when machines must process large or repeated datasets reliably.
2.2 Row-Oriented and Column-Oriented Thinking¶
The difference between row-oriented and column-oriented storage is one of the most useful mental models in data engineering. In row-oriented storage, the values for one record are stored together. This is efficient when an application reads or updates a complete business object, such as one customer account or one order. Operational databases usually prefer this pattern because transactions touch small numbers of complete records.
In column-oriented storage, values from the same column are stored together. This is efficient when an analytical query scans many rows but only a few columns. Column-oriented layouts also compress well because neighboring values often share the same type and similar patterns. A column containing region values, for example, may repeat the same few city or region names many times.
| Workload | Main question | Preferred layout | Example query |
|---|---|---|---|
| OLTP | “Can we safely create or update this order now?” | Row-oriented tables with indexes and constraints | Insert an order and its items in one transaction. |
| OLAP | “What was revenue by campaign and region last month?” | Column-oriented files or analytical tables | Scan date, region, campaign, and revenue columns across millions of rows. |
| Streaming | “Can each event be serialized and consumed safely?” | Record-oriented event format with schema control | Publish one OrderPlaced event to Kafka. |
| Data science | “Can features be loaded repeatedly and reproducibly?” | Columnar datasets with metadata | Load selected feature columns for model training. |
A professional platform often uses several layouts at once. TuranMart may use PostgreSQL for order transactions, JSON for web events, Avro for streaming events, Parquet for the lakehouse, and a dimensional schema for BI. That is not inconsistency; it is purposeful specialization.
2.3 Operational Modeling: The Entity-Relationship View¶
Operational systems need integrity. When a customer places an order, the system must ensure that the order belongs to a real customer, that each order item refers to a real product, that payment status is recorded consistently, and that shipment updates do not create contradictory states. This is the world of entity-relationship modeling.
An entity is a business object or concept that deserves its own identity. A relationship describes how entities connect. A cardinality describes how many records can participate in the relationship. In TuranMart, one customer can place many orders, one order can contain many order items, one product can appear in many order items, and one order may have one current shipment. The ERD in Figure 2 shows a simplified operational model.
Figure 2:A normalized TuranMart e-commerce ERD for customers, orders, items, products, payments, and shipments.
Good ER modeling is not decoration. It protects production systems from ambiguity. If orders.customer_id is a foreign key, an order cannot silently refer to a non-existent customer. If products.sku is unique, the business avoids two product rows with the same stock-keeping unit. If order_items is separated from orders, one order can contain many products without repeating order-level information.
| ERD element | Meaning | TuranMart example | Quality implication |
|---|---|---|---|
| Primary key | Stable identifier for one entity record | customer_id, order_id, product_id | Enables uniqueness checks and joins. |
| Foreign key | Reference to another entity | orders.customer_id | Enables referential integrity checks. |
| Cardinality | Allowed number of related records | One order has many order items | Prevents incorrect assumptions about grain. |
| Constraint | Rule enforced by the database or pipeline | email UNIQUE, amount >= 0 | Stops invalid data early. |
| Audit column | Operational trace of change | created_at, updated_at, status_changed_at | Supports debugging and lineage. |
Normalization is the discipline of reducing unnecessary duplication in operational models. A normalized design stores each fact in one logical place. Customer email belongs in customers, not copied into every order. Product category belongs in products or categories, not repeated manually in every order item. For most operational systems, a design close to third normal form is a good default because it reduces update anomalies and protects consistency.
Normalization is not free. Analytical queries over highly normalized schemas can require many joins. That is why the same business process often needs a second model for analytics.
2.4 Analytical Modeling: Facts, Dimensions, and Grain¶
Analytical systems optimize for questions, not transactions. Finance does not want to reconstruct revenue by joining ten operational tables every morning. Analysts want a stable model where business measures and descriptive attributes are easy to find. Dimensional modeling solves this problem by organizing data into fact tables and dimension tables.
The Kimball Group describes dimensional modeling as a discipline that divides the world into measurements and context. Measurements become facts, while the surrounding descriptive context becomes dimensions.[4] A fact table stores measurements at a declared grain. The grain is the most important sentence in a dimensional model because it defines what one row means. For TuranMart, a useful grain is one row per order item after payment authorization. Measures at this grain include quantity, gross revenue, discount amount, net revenue, and fulfillment delay. Dimension tables provide descriptive context such as date, customer segment, product category, region, and campaign.
Figure 3:A TuranMart star schema that reshapes operational order data into analytics-ready facts and dimensions.
The star schema deliberately duplicates some descriptive attributes. For example, dim_product may include category and brand so analysts can group revenue without joining the operational products, categories, and brands tables. This is denormalization, but it is controlled denormalization. The pipeline owns the logic that copies and standardizes those attributes.
| Key concept | Working definition | TuranMart example |
|---|---|---|
| Fact | A measurable business event at a declared grain. | One paid order item. |
| Dimension | Descriptive context used to filter, group, or label facts. | Customer segment, product category, date, campaign, region. |
| Grain | The business meaning of one row in a fact table. | “One row per order item after payment authorization.” |
| Measure | Numeric value that can be aggregated or compared. | Quantity, gross revenue, discount amount, net revenue. |
| Surrogate key | Warehouse-generated key used to connect facts and dimensions. | customer_key in dim_customer. |
| Degenerate dimension | Operational identifier kept in the fact table without a separate dimension. | order_id in fact_order_line. |
| Modeling choice | Operational ER model | Analytical dimensional model |
|---|---|---|
| Primary goal | Correct transactions and integrity | Fast, understandable analysis |
| Typical grain | One entity record per table | One measurable business event per fact row |
| Structure | Normalized tables | Fact tables surrounded by dimensions |
| Redundancy | Minimized | Accepted when it improves usability |
| Main users | Applications and service teams | Analysts, BI tools, data scientists |
| Failure mode | Inconsistent transactions | Confusing metrics or slow queries |
A good data engineer can explain both models to stakeholders. The operational model answers, “How does the business process work?” The analytical model answers, “How should people measure and compare the process?” When those models disagree, the disagreement should be explicit and documented, not hidden in a dashboard query.
2.5 Quality: Trust Is Built Before the Dashboard¶
Data quality is the difference between a table that exists and a table people trust. Quality is not a single property. It is a set of expectations that depend on the business use case. A support-ticket sentiment model may tolerate approximate labels but must protect privacy. A finance revenue table may tolerate slightly slower delivery but must reconcile exactly to payment records.
Common quality dimensions include validity, completeness, uniqueness, consistency, accuracy, and timeliness. Validity asks whether values match the expected schema and business rules. Completeness asks whether required fields are present. Uniqueness asks whether records are duplicated. Consistency asks whether related systems agree. Accuracy asks whether data represents the real-world event. Timeliness asks whether data arrived when users needed it. ISO/IEC 25012:2008 defines a general data quality model for structured data that can be used to establish requirements, measures, and evaluations.[5] Great Expectations describes GX Core as a Python library and workflow for creating data validation workflows with sample data, which reflects the broader industry practice of treating expectations as executable checks.[6]
| Quality dimension | Example rule for TuranMart | Where to check it |
|---|---|---|
| Validity | order_status must be one of created, paid, shipped, cancelled, or refunded. | Source ingestion and silver transformation. |
| Completeness | order_id, customer_id, order_ts, and payment_status must not be null. | Raw-to-curated conversion. |
| Uniqueness | order_id is unique in the orders table. | Operational database and pipeline tests. |
| Consistency | Sum of paid order items reconciles to authorized payment totals. | Finance validation job. |
| Referential integrity | Every order item refers to an existing order and product. | Database constraints and batch tests. |
| Timeliness | Yesterday’s orders are available before 08:00 local time. | Orchestration and observability. |
Quality checks should live near the pipeline, not only in human review. A notebook can demonstrate the idea, but a production implementation should run tests automatically during ingestion, transformation, deployment, and scheduled operations. Later chapters will connect these checks to dbt-style tests, orchestration, observability, contracts, and incident response.
Production Design Pattern: Bronze, Silver, and Gold Data Contracts¶
A simple way to organize data lake work is the bronze-silver-gold pattern. The bronze layer keeps raw source evidence with minimal changes. The silver layer standardizes schemas, types, and quality checks. The gold layer exposes business-ready tables for dashboards, applications, and models. This pattern is useful because it separates preservation, correction, and consumption.
| Layer | Purpose | TuranMart example | Recommended format | Quality gate |
|---|---|---|---|---|
| Bronze | Preserve source evidence | Raw CSV partner exports and JSON web events | Original format, JSON Lines, or Avro | Arrival, file integrity, source metadata. |
| Silver | Clean, type, deduplicate, and conform | Typed orders, customers, products, payments | Parquet with documented schema | Required fields, type checks, uniqueness, referential integrity. |
| Gold | Serve business use cases | fact_order_line, revenue marts, campaign metrics | Parquet, warehouse tables, or lakehouse tables | Reconciliation, metric definitions, freshness, access rules. |
Figure 4:The Chapter 2 lab workflow for converting raw CSV and JSON source data into validated Parquet and checking the result with DuckDB.
The lab for this chapter implements a small version of the pattern. You will keep raw CSV and JSON sample files, parse them into typed pandas DataFrames, run quality checks, write Parquet, and query the result with DuckDB. This is intentionally small, but it represents the same control points used in larger production pipelines.
Production habit: Never treat a format conversion as a mechanical copy. Treat it as the moment when the platform states what the data means, which fields are required, which assumptions are testable, and which outputs are safe for consumption.
Guided Lab: Convert Raw Orders to Validated Parquet¶
The lab files are stored in shared/labs/ch02_data_models_formats_quality/, and the executable notebook is stored in shared/notebooks/ch02_formats_quality_lab.ipynb. This chapter’s lab is intentionally lightweight: it does not require Docker Compose because the purpose is to focus on modeling, format conversion, local validation, and SQL inspection before heavier services appear in later chapters.
| Lab material | Link | Purpose |
|---|---|---|
| Lab README | shared | Step-by-step lab instructions, cleanup notes, and troubleshooting guidance. |
| Starter notebook | shared | Executable conversion workflow from CSV and JSON Lines to Parquet. |
| Sample data | orders.csv, order_items.csv, events.jsonl | Raw structured and semi-structured TuranMart inputs. |
| SQL models | operational DDL and star-schema DDL | Reference schemas corresponding to the ERD and star-schema ideas. |
| Expected output | revenue | Expected DuckDB revenue result after conversion. |
| Validation script | tests | Lightweight test that compares generated Parquet results to expected output. |
| Exercises | exercises/README.md | Independent practice tasks with increasing difficulty. |
| Solution guide | shared | Reference answers separated from starter materials. |
Start from the repository root and install dependencies if your environment is not ready:
python -m pip install -r requirements.txt
jupyter lab shared/notebooks/ch02_formats_quality_lab.ipynbThe notebook performs five steps. First, it loads orders.csv, order_items.csv, and events.jsonl. Second, it casts identifiers, timestamps, integers, and decimal-like fields into predictable types. Third, it checks required fields, uniqueness, valid statuses, positive quantities, and referential integrity between orders and order items. Fourth, it writes Parquet files to shared/labs/ch02_data_models_formats_quality/output/parquet/. Fifth, it uses DuckDB SQL to validate row counts and revenue totals from the Parquet files.
After the notebook has generated the Parquet files, run the repository validation script:
python shared/labs/ch02_data_models_formats_quality/tests/validate_lab_outputs.pyThe script compares the DuckDB revenue query against expected_output/revenue_by_date_region.csv. This is a small example of how production teams turn “the output looks right” into an executable regression check.
| Lab step | Evidence of success |
|---|---|
| Load raw files | The notebook prints row counts for orders, order items, and events. |
| Apply types | Timestamp columns parse correctly and numeric columns can be aggregated. |
| Run quality checks | The quality_report table shows all checks as PASS. |
| Write Parquet | The output directory contains orders.parquet, order_items.parquet, and events.parquet. |
| Query with DuckDB | The final SQL query returns revenue by region and order date. |
| Validate outputs | The validation script prints a pass message after comparing actual and expected results. |
A simplified version of the quality logic looks like this:
checks = {
"orders_have_unique_ids": orders["order_id"].is_unique,
"orders_have_customers": orders["customer_id"].notna().all(),
"items_have_positive_quantity": (order_items["quantity"] > 0).all(),
"items_reference_orders": order_items["order_id"].isin(orders["order_id"]).all(),
}
quality_report = pd.DataFrame(
[{"check": name, "status": "PASS" if ok else "FAIL"} for name, ok in checks.items()]
)This example is deliberately transparent. In production, you may implement the same idea with Great Expectations, dbt tests, custom SQL checks, data contracts, or orchestration-level assertions. The tool matters less than the habit: make assumptions executable.
Common Pitfalls and Operational Lessons¶
Many data problems begin as small modeling shortcuts. A team stores JSON blobs without documenting which fields are required, then discovers that half the events do not contain campaign identifiers. Another team exposes raw normalized operational tables directly to analysts, then every dashboard implements revenue differently. A third team converts everything to Parquet but never validates uniqueness, so duplicate orders become trusted facts. These failures are avoidable when modeling, format selection, and quality checks are designed together.
| Pitfall | Why it happens | Professional response |
|---|---|---|
| Treating CSV as a database | CSV is easy to create and exchange. | Use CSV for exchange, but convert important recurring datasets to typed storage. |
| Ignoring grain | Fact tables are created without defining what one row means. | Write the grain sentence before designing measures. |
| Over-normalizing analytics | Operational schemas are copied directly into BI. | Build dimensional or semantic models for consumption. |
| Under-validating raw data | Pipelines assume source systems never change. | Add schema, null, uniqueness, referential-integrity, and reconciliation checks. |
| Losing lineage | Transformations overwrite evidence. | Preserve raw data and record how curated outputs were produced. |
| Choosing formats by fashion | Engineers copy a popular architecture. | Match format, model, and storage to workload and user needs. |
| Mixing business meanings | Similar fields are joined as if they mean the same thing. | Document definitions and reconcile source-of-truth decisions with stakeholders. |
The operational lesson is that the earliest modeling decisions become the foundation for later storage, processing, orchestration, observability, governance, and AI work. A weak model creates weak pipelines. A strong model with no tests still creates risk. A tested model in the wrong format creates performance and cost problems. Professional data engineering requires all three: meaning, structure, and proof.
Exercises¶
The following exercises extend the guided lab. They are designed to make you practice engineering judgment rather than copy a single correct answer.
| Level | Exercise | Task | Expected artifact |
|---|---|---|---|
| Basic | Add a new order status | Add a status called returned and update the validation logic. | A passing quality report and a note explaining whether returned belongs before or after payment. |
| Basic | Force a quality failure | Add a duplicate order row intentionally and confirm that the quality check fails. | Notebook output or terminal output showing the failed check. |
| Intermediate | Add campaign context | Add a dim_campaign table from JSON event metadata and connect it to the star schema. | A short schema description and one SQL query by campaign channel. |
| Advanced | Model customer history | Design a slowly changing dim_customer table that keeps customer segment history. | A table design with surrogate key, natural key, effective dates, and current flag. |
| Advanced | Compare CSV and Parquet | Add 10,000 synthetic rows and compare file size and query time. | A small benchmark table with observations. |
| Team | Review the model | Split into finance, marketing, logistics, and platform roles and review the same design. | A short decision record documenting trade-offs and unresolved questions. |
Review Questions¶
These questions are meant for self-assessment, classroom discussion, or instructor review. A strong answer should explain the reasoning, not only name a tool.
| Question | What a strong answer should include |
|---|---|
| Why should a data engineer classify data shape before choosing a storage engine? | The answer should connect structure, validation, schema drift, and workload fit. |
| When is CSV appropriate, and when should it be converted to Parquet or another typed format? | The answer should distinguish human exchange from recurring analytical processing. |
| Why might TuranMart need both an operational ERD and an analytical star schema for the same order process? | The answer should compare transaction integrity with analytical usability. |
What does the grain of fact_order_line mean, and why is it dangerous to skip the grain sentence? | The answer should explain row meaning, aggregation safety, and metric consistency. |
| Which quality checks would you run before allowing finance to use a daily revenue table? | The answer should mention required fields, uniqueness, valid statuses, payment reconciliation, and freshness. |
| How does the bronze-silver-gold pattern reduce operational risk? | The answer should explain raw preservation, standardized correction, business-ready consumption, and lineage. |
| What evidence proves that the Chapter 2 lab ran correctly? | The answer should mention generated Parquet files, passing quality checks, DuckDB query output, and validation against expected results. |
Chapter Summary¶
This chapter introduced the first design vocabulary of data engineering. You learned that data has structure before it has scale, and that structured, semi-structured, and unstructured data require different handling. You compared CSV, JSON, Avro, Parquet, and ORC as engineering contracts rather than isolated file extensions. You designed a normalized ERD for operational integrity and a dimensional star schema for analytical usability. Finally, you connected those ideas to quality checks and a guided lab that converts raw files into validated Parquet.
The next chapter moves from data design to the broader open-source ecosystem. You will learn how databases, object stores, processing engines, orchestration tools, quality frameworks, catalogs, and notebooks fit together into a practical platform. Chapter 2 gives you the core question to carry forward: what meaning must the data preserve, and how will the system prove that it preserved it?