Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 2: Data Modeling and Storage Paradigms

In Chapter 1, you learned that data engineering is not simply the movement of files from one place to another. It is the practice of building reliable data products that preserve meaning as data travels from operational systems to analytics, automation, and AI. This chapter turns that mindset into the first concrete design skill: deciding how data should be represented, stored, modeled, and tested.

The practical outcome of this chapter is a small but realistic TuranMart data design package. You will inspect e-commerce order data, design an operational entity-relationship model, reshape the same business process into a dimensional model for analytics, and run a notebook that converts CSV and JSON source files into Parquet while applying quality checks. The goal is not to memorize every possible file format or modeling technique. The goal is to learn how professional data engineers translate business reality into structures that are easy to trust, query, and evolve.

Chapter overview covering data types, modeling, storage formats, quality checks, and the guided lab artifact.

Figure 1:Chapter overview covering data types, modeling, storage formats, quality checks, and the guided lab artifact.

Opening Scenario: TuranMart’s Revenue Problem

TuranMart’s marketing team has launched a regional campaign. The website shows thousands of new orders, the payment provider reports a different revenue number, and the logistics dashboard says many of those orders have not shipped. Finance asks for a reliable daily revenue table by region, product category, and campaign. Marketing asks for conversion metrics. Logistics asks which cities are creating fulfillment delays. The CEO asks a simple question: “What happened yesterday?”

That question becomes difficult because each source system describes the business differently. The checkout database stores normalized tables such as customers, orders, order_items, payments, and shipments. The web application emits JSON events where each event may contain nested campaign metadata. The finance team sometimes receives CSV exports from partners. The analytics team wants a fast table that can be queried by date, region, customer segment, product category, and campaign. The same business event must therefore survive several transformations without losing its meaning.

StakeholderPain pointDesired outcomeData engineering design question
Finance analystRevenue differs between checkout, payment, and spreadsheet exports.Audited daily revenue by region and currency.Which source is authoritative for payment status, refunds, and reconciliation?
Marketing managerCampaign performance changes depending on event interpretation.Conversion metrics by campaign, channel, and product category.How do clickstream events connect to orders and customers?
Logistics operatorLate shipments are visible only after customers complain.Shipment-delay indicators by city and carrier.How do shipments relate to orders, addresses, warehouses, and service-level targets?
Data analystDashboards require repeated joins and repeated metric logic.Queryable, documented tables for self-service analysis.Should the analytics model be normalized, dimensional, or both?
Platform engineerFiles arrive in different shapes and are hard to validate consistently.Repeatable conversion into typed, testable storage.Which formats preserve schema, reduce cost, and support validation?

The central lesson is that modeling, format selection, and quality checks are one design conversation. A Parquet file with the wrong schema is not a good data product. A beautiful star schema built from unvalidated source records is not trustworthy. A normalized database without clear analytical outputs creates friction for every reporting question. This chapter teaches you to connect these choices deliberately.

Learning Objectives

By the end of this chapter, you should be able to explain the difference between structured, semi-structured, and unstructured data and choose appropriate formats for each. You should be able to design a normalized e-commerce entity-relationship model for operational integrity and a dimensional model for analytical performance. You should also be able to run a small conversion workflow that reads CSV and JSON data, validates basic quality rules, writes Parquet, and queries the result with DuckDB.

ObjectiveWhat you should be able to produceEvidence in the guided lab
Classify data by structureA source inventory that distinguishes relational tables, API payloads, logs, documents, and media.A short explanation of why orders.csv and events.jsonl need different handling.
Choose file formatsA reasoned comparison of CSV, JSON, Avro, Parquet, and ORC for ingestion, exchange, streaming, and analytics.Parquet outputs generated from raw TuranMart inputs.
Model operational dataAn ERD with primary keys, foreign keys, cardinalities, and integrity constraints for e-commerce orders.SQL DDL aligned with the operational model.
Model analytical dataA star schema with facts, dimensions, grains, and measures for revenue analysis.A revenue query by date and region.
Apply quality checksTests for schema validity, required fields, uniqueness, referential integrity, and reconciliation totals.A passing quality report and validation script.
Explain trade-offsA clear argument for why the same business process may need different models and formats at different layers.Written answers to exercises and review questions.

Conceptual Foundation: Data Has Shape Before It Has Scale

A common beginner mistake is to think about scale before thinking about shape. Large data systems do need scalable storage and processing engines, but the first design question is more basic: what kind of data is this, and how much structure does it already have? Data engineers usually work with three broad categories.

Structured data has a predefined schema. A relational table, spreadsheet, or controlled CSV export usually contains rows and columns where each column has an expected meaning and type. Structured data is easiest to validate with constraints such as primary keys, foreign keys, uniqueness rules, and type checks. It is also the natural starting point for SQL analysis.

Semi-structured data has internal organization but does not always fit a fixed table. JSON and XML documents often contain nested objects, optional fields, arrays, and changing attributes. This makes semi-structured data useful for APIs, logs, clickstream events, and document databases. The flexibility is valuable, but it moves more responsibility to the data engineer: nested fields must be interpreted, optional fields must be handled, and schema drift must be detected.

Unstructured data has no obvious row-and-column structure. Product images, support-call audio, PDF contracts, chat transcripts, and long-form documents belong here. In modern data platforms, unstructured data often becomes useful after metadata extraction, text parsing, embedding generation, speech recognition, or computer vision. Later chapters on AI and ML data engineering return to this point, because retrieval-augmented generation and semantic search depend heavily on preparing unstructured content correctly.

Data typeTuranMart exampleTypical storageEngineering risk
Structuredorders, payments, customers tablesPostgreSQL, MySQL, warehouse tables, Parquet datasetsIncorrect constraints or mismatched business definitions.
Semi-structuredWeb events, API responses, product attributesJSON Lines, document databases, raw data lake zonesSchema drift, missing optional fields, deeply nested attributes.
UnstructuredReviews, support tickets, images, invoicesObject storage, document stores, vector indexes after processingHard-to-measure quality, privacy exposure, expensive extraction.

The practical rule is simple: store raw evidence faithfully, then create curated structures for use cases. A data lake may keep the original JSON event because it is useful for debugging. A warehouse may expose a clean fact_order_line table because analysts need stable revenue metrics. Both can be correct if each layer has a clear purpose.

2.1 File Formats Are Engineering Contracts

A file format is not only a storage detail. It is a contract between producers, pipelines, query engines, and downstream users. The format influences how schemas are stored, how efficiently data can be scanned, whether compression works well, and how difficult it is to debug failures.

CSV is popular because it is simple and universal. Almost every tool can open it. That advantage is also its weakness. CSV does not reliably carry schema information, so types must be inferred or specified externally. A column that looks numeric may contain empty strings, leading zeros, local currency separators, or text codes. CSV is therefore useful for small exports and interoperability, but it is a fragile long-term storage format for analytical datasets.

JSON is the natural language of many APIs and web applications. It represents nested structures better than CSV and is easy for humans to inspect. However, JSON is verbose because field names are repeated in every record. Large JSON collections are commonly stored as JSON Lines, where each line is one JSON object, because this makes streaming and line-by-line processing easier. JSON is excellent for raw ingestion and event capture, but analytical systems usually convert it into a typed tabular format.

Apache Avro is a compact binary serialization format designed around schemas. The Avro documentation describes Avro as a data serialization system that provides rich data structures, a compact binary data format, container files, remote procedure calls, and schema-based processing.[1] Avro is common in streaming pipelines because events can carry or reference schemas, and producers and consumers can evolve in controlled ways.

Apache Parquet is a columnar file format designed for analytical workloads. The Apache Parquet project provides the public file-format specification, and DuckDB describes Parquet files as compressed columnar files that are efficient to load and process.[2] [3] Columnar storage matters because analytical queries often read a few columns from many rows. If a query only needs order_date, region, and gross_revenue, a columnar file can avoid scanning unrelated columns such as customer email or shipment tracking number. DuckDB also supports reading, writing, schema inspection, filter pushdown, and projection pushdown for Parquet files.[3]

ORC is another columnar format that is strong in parts of the Hadoop and Hive ecosystem. In new open-source lakehouse projects, Parquet is often more common because it is widely supported across Spark, DuckDB, Trino, pandas, Polars, Delta Lake, Iceberg, and many cloud query engines. The important principle is not that one format wins forever. The principle is to align the format with the workload.

FormatBest useStrengthWeaknessTypical layer
CSVSmall exports and human exchangeUniversal and easy to inspectWeak schema and fragile parsingRaw exchange, ad hoc export
JSON / JSON LinesAPIs, logs, events, nested payloadsFlexible and expressiveVerbose and slower for analyticsRaw bronze layer
AvroStreaming events and schema evolutionCompact binary records with schema supportLess convenient for ad hoc SQL analyticsIngestion and event pipelines
ParquetAnalytical data lakes and lakehousesColumnar reads, compression, embedded schemaNot human-readableSilver and gold analytical layers
ORCHive-oriented analyticsColumnar performance and compressionLess universal outside some ecosystemsWarehouse and Hadoop analytics

Design rule: Use human-readable formats when humans must inspect or exchange small files. Use schema-aware binary formats when machines must process large or repeated datasets reliably.

2.2 Row-Oriented and Column-Oriented Thinking

The difference between row-oriented and column-oriented storage is one of the most useful mental models in data engineering. In row-oriented storage, the values for one record are stored together. This is efficient when an application reads or updates a complete business object, such as one customer account or one order. Operational databases usually prefer this pattern because transactions touch small numbers of complete records.

In column-oriented storage, values from the same column are stored together. This is efficient when an analytical query scans many rows but only a few columns. Column-oriented layouts also compress well because neighboring values often share the same type and similar patterns. A column containing region values, for example, may repeat the same few city or region names many times.

WorkloadMain questionPreferred layoutExample query
OLTP“Can we safely create or update this order now?”Row-oriented tables with indexes and constraintsInsert an order and its items in one transaction.
OLAP“What was revenue by campaign and region last month?”Column-oriented files or analytical tablesScan date, region, campaign, and revenue columns across millions of rows.
Streaming“Can each event be serialized and consumed safely?”Record-oriented event format with schema controlPublish one OrderPlaced event to Kafka.
Data science“Can features be loaded repeatedly and reproducibly?”Columnar datasets with metadataLoad selected feature columns for model training.

A professional platform often uses several layouts at once. TuranMart may use PostgreSQL for order transactions, JSON for web events, Avro for streaming events, Parquet for the lakehouse, and a dimensional schema for BI. That is not inconsistency; it is purposeful specialization.

2.3 Operational Modeling: The Entity-Relationship View

Operational systems need integrity. When a customer places an order, the system must ensure that the order belongs to a real customer, that each order item refers to a real product, that payment status is recorded consistently, and that shipment updates do not create contradictory states. This is the world of entity-relationship modeling.

An entity is a business object or concept that deserves its own identity. A relationship describes how entities connect. A cardinality describes how many records can participate in the relationship. In TuranMart, one customer can place many orders, one order can contain many order items, one product can appear in many order items, and one order may have one current shipment. The ERD in Figure 2 shows a simplified operational model.

A normalized TuranMart e-commerce ERD for customers, orders, items, products, payments, and shipments.

Figure 2:A normalized TuranMart e-commerce ERD for customers, orders, items, products, payments, and shipments.

Good ER modeling is not decoration. It protects production systems from ambiguity. If orders.customer_id is a foreign key, an order cannot silently refer to a non-existent customer. If products.sku is unique, the business avoids two product rows with the same stock-keeping unit. If order_items is separated from orders, one order can contain many products without repeating order-level information.

ERD elementMeaningTuranMart exampleQuality implication
Primary keyStable identifier for one entity recordcustomer_id, order_id, product_idEnables uniqueness checks and joins.
Foreign keyReference to another entityorders.customer_idEnables referential integrity checks.
CardinalityAllowed number of related recordsOne order has many order itemsPrevents incorrect assumptions about grain.
ConstraintRule enforced by the database or pipelineemail UNIQUE, amount >= 0Stops invalid data early.
Audit columnOperational trace of changecreated_at, updated_at, status_changed_atSupports debugging and lineage.

Normalization is the discipline of reducing unnecessary duplication in operational models. A normalized design stores each fact in one logical place. Customer email belongs in customers, not copied into every order. Product category belongs in products or categories, not repeated manually in every order item. For most operational systems, a design close to third normal form is a good default because it reduces update anomalies and protects consistency.

Normalization is not free. Analytical queries over highly normalized schemas can require many joins. That is why the same business process often needs a second model for analytics.

2.4 Analytical Modeling: Facts, Dimensions, and Grain

Analytical systems optimize for questions, not transactions. Finance does not want to reconstruct revenue by joining ten operational tables every morning. Analysts want a stable model where business measures and descriptive attributes are easy to find. Dimensional modeling solves this problem by organizing data into fact tables and dimension tables.

The Kimball Group describes dimensional modeling as a discipline that divides the world into measurements and context. Measurements become facts, while the surrounding descriptive context becomes dimensions.[4] A fact table stores measurements at a declared grain. The grain is the most important sentence in a dimensional model because it defines what one row means. For TuranMart, a useful grain is one row per order item after payment authorization. Measures at this grain include quantity, gross revenue, discount amount, net revenue, and fulfillment delay. Dimension tables provide descriptive context such as date, customer segment, product category, region, and campaign.

A TuranMart star schema that reshapes operational order data into analytics-ready facts and dimensions.

Figure 3:A TuranMart star schema that reshapes operational order data into analytics-ready facts and dimensions.

The star schema deliberately duplicates some descriptive attributes. For example, dim_product may include category and brand so analysts can group revenue without joining the operational products, categories, and brands tables. This is denormalization, but it is controlled denormalization. The pipeline owns the logic that copies and standardizes those attributes.

Key conceptWorking definitionTuranMart example
FactA measurable business event at a declared grain.One paid order item.
DimensionDescriptive context used to filter, group, or label facts.Customer segment, product category, date, campaign, region.
GrainThe business meaning of one row in a fact table.“One row per order item after payment authorization.”
MeasureNumeric value that can be aggregated or compared.Quantity, gross revenue, discount amount, net revenue.
Surrogate keyWarehouse-generated key used to connect facts and dimensions.customer_key in dim_customer.
Degenerate dimensionOperational identifier kept in the fact table without a separate dimension.order_id in fact_order_line.
Modeling choiceOperational ER modelAnalytical dimensional model
Primary goalCorrect transactions and integrityFast, understandable analysis
Typical grainOne entity record per tableOne measurable business event per fact row
StructureNormalized tablesFact tables surrounded by dimensions
RedundancyMinimizedAccepted when it improves usability
Main usersApplications and service teamsAnalysts, BI tools, data scientists
Failure modeInconsistent transactionsConfusing metrics or slow queries

A good data engineer can explain both models to stakeholders. The operational model answers, “How does the business process work?” The analytical model answers, “How should people measure and compare the process?” When those models disagree, the disagreement should be explicit and documented, not hidden in a dashboard query.

2.5 Quality: Trust Is Built Before the Dashboard

Data quality is the difference between a table that exists and a table people trust. Quality is not a single property. It is a set of expectations that depend on the business use case. A support-ticket sentiment model may tolerate approximate labels but must protect privacy. A finance revenue table may tolerate slightly slower delivery but must reconcile exactly to payment records.

Common quality dimensions include validity, completeness, uniqueness, consistency, accuracy, and timeliness. Validity asks whether values match the expected schema and business rules. Completeness asks whether required fields are present. Uniqueness asks whether records are duplicated. Consistency asks whether related systems agree. Accuracy asks whether data represents the real-world event. Timeliness asks whether data arrived when users needed it. ISO/IEC 25012:2008 defines a general data quality model for structured data that can be used to establish requirements, measures, and evaluations.[5] Great Expectations describes GX Core as a Python library and workflow for creating data validation workflows with sample data, which reflects the broader industry practice of treating expectations as executable checks.[6]

Quality dimensionExample rule for TuranMartWhere to check it
Validityorder_status must be one of created, paid, shipped, cancelled, or refunded.Source ingestion and silver transformation.
Completenessorder_id, customer_id, order_ts, and payment_status must not be null.Raw-to-curated conversion.
Uniquenessorder_id is unique in the orders table.Operational database and pipeline tests.
ConsistencySum of paid order items reconciles to authorized payment totals.Finance validation job.
Referential integrityEvery order item refers to an existing order and product.Database constraints and batch tests.
TimelinessYesterday’s orders are available before 08:00 local time.Orchestration and observability.

Quality checks should live near the pipeline, not only in human review. A notebook can demonstrate the idea, but a production implementation should run tests automatically during ingestion, transformation, deployment, and scheduled operations. Later chapters will connect these checks to dbt-style tests, orchestration, observability, contracts, and incident response.

Production Design Pattern: Bronze, Silver, and Gold Data Contracts

A simple way to organize data lake work is the bronze-silver-gold pattern. The bronze layer keeps raw source evidence with minimal changes. The silver layer standardizes schemas, types, and quality checks. The gold layer exposes business-ready tables for dashboards, applications, and models. This pattern is useful because it separates preservation, correction, and consumption.

LayerPurposeTuranMart exampleRecommended formatQuality gate
BronzePreserve source evidenceRaw CSV partner exports and JSON web eventsOriginal format, JSON Lines, or AvroArrival, file integrity, source metadata.
SilverClean, type, deduplicate, and conformTyped orders, customers, products, paymentsParquet with documented schemaRequired fields, type checks, uniqueness, referential integrity.
GoldServe business use casesfact_order_line, revenue marts, campaign metricsParquet, warehouse tables, or lakehouse tablesReconciliation, metric definitions, freshness, access rules.
The Chapter 2 lab workflow for converting raw CSV and JSON source data into validated Parquet and checking the result with DuckDB.

Figure 4:The Chapter 2 lab workflow for converting raw CSV and JSON source data into validated Parquet and checking the result with DuckDB.

The lab for this chapter implements a small version of the pattern. You will keep raw CSV and JSON sample files, parse them into typed pandas DataFrames, run quality checks, write Parquet, and query the result with DuckDB. This is intentionally small, but it represents the same control points used in larger production pipelines.

Production habit: Never treat a format conversion as a mechanical copy. Treat it as the moment when the platform states what the data means, which fields are required, which assumptions are testable, and which outputs are safe for consumption.

Guided Lab: Convert Raw Orders to Validated Parquet

The lab files are stored in shared/labs/ch02_data_models_formats_quality/, and the executable notebook is stored in shared/notebooks/ch02_formats_quality_lab.ipynb. This chapter’s lab is intentionally lightweight: it does not require Docker Compose because the purpose is to focus on modeling, format conversion, local validation, and SQL inspection before heavier services appear in later chapters.

Lab materialLinkPurpose
Lab READMEshared/labs/ch02_data_models_formats_quality/README.mdStep-by-step lab instructions, cleanup notes, and troubleshooting guidance.
Starter notebookshared/notebooks/ch02_formats_quality_lab.ipynbExecutable conversion workflow from CSV and JSON Lines to Parquet.
Sample dataorders.csv, order_items.csv, events.jsonlRaw structured and semi-structured TuranMart inputs.
SQL modelsoperational DDL and star-schema DDLReference schemas corresponding to the ERD and star-schema ideas.
Expected outputrevenue_by_date_region.csvExpected DuckDB revenue result after conversion.
Validation scripttests/validate_lab_outputs.pyLightweight test that compares generated Parquet results to expected output.
Exercisesexercises/README.mdIndependent practice tasks with increasing difficulty.
Solution guideshared/solutions/ch02_data_models_formats_quality/solution.mdReference answers separated from starter materials.

Start from the repository root and install dependencies if your environment is not ready:

python -m pip install -r requirements.txt
jupyter lab shared/notebooks/ch02_formats_quality_lab.ipynb

The notebook performs five steps. First, it loads orders.csv, order_items.csv, and events.jsonl. Second, it casts identifiers, timestamps, integers, and decimal-like fields into predictable types. Third, it checks required fields, uniqueness, valid statuses, positive quantities, and referential integrity between orders and order items. Fourth, it writes Parquet files to shared/labs/ch02_data_models_formats_quality/output/parquet/. Fifth, it uses DuckDB SQL to validate row counts and revenue totals from the Parquet files.

After the notebook has generated the Parquet files, run the repository validation script:

python shared/labs/ch02_data_models_formats_quality/tests/validate_lab_outputs.py

The script compares the DuckDB revenue query against expected_output/revenue_by_date_region.csv. This is a small example of how production teams turn “the output looks right” into an executable regression check.

Lab stepEvidence of success
Load raw filesThe notebook prints row counts for orders, order items, and events.
Apply typesTimestamp columns parse correctly and numeric columns can be aggregated.
Run quality checksThe quality_report table shows all checks as PASS.
Write ParquetThe output directory contains orders.parquet, order_items.parquet, and events.parquet.
Query with DuckDBThe final SQL query returns revenue by region and order date.
Validate outputsThe validation script prints a pass message after comparing actual and expected results.

A simplified version of the quality logic looks like this:

checks = {
    "orders_have_unique_ids": orders["order_id"].is_unique,
    "orders_have_customers": orders["customer_id"].notna().all(),
    "items_have_positive_quantity": (order_items["quantity"] > 0).all(),
    "items_reference_orders": order_items["order_id"].isin(orders["order_id"]).all(),
}
quality_report = pd.DataFrame(
    [{"check": name, "status": "PASS" if ok else "FAIL"} for name, ok in checks.items()]
)

This example is deliberately transparent. In production, you may implement the same idea with Great Expectations, dbt tests, custom SQL checks, data contracts, or orchestration-level assertions. The tool matters less than the habit: make assumptions executable.

Common Pitfalls and Operational Lessons

Many data problems begin as small modeling shortcuts. A team stores JSON blobs without documenting which fields are required, then discovers that half the events do not contain campaign identifiers. Another team exposes raw normalized operational tables directly to analysts, then every dashboard implements revenue differently. A third team converts everything to Parquet but never validates uniqueness, so duplicate orders become trusted facts. These failures are avoidable when modeling, format selection, and quality checks are designed together.

PitfallWhy it happensProfessional response
Treating CSV as a databaseCSV is easy to create and exchange.Use CSV for exchange, but convert important recurring datasets to typed storage.
Ignoring grainFact tables are created without defining what one row means.Write the grain sentence before designing measures.
Over-normalizing analyticsOperational schemas are copied directly into BI.Build dimensional or semantic models for consumption.
Under-validating raw dataPipelines assume source systems never change.Add schema, null, uniqueness, referential-integrity, and reconciliation checks.
Losing lineageTransformations overwrite evidence.Preserve raw data and record how curated outputs were produced.
Choosing formats by fashionEngineers copy a popular architecture.Match format, model, and storage to workload and user needs.
Mixing business meaningsSimilar fields are joined as if they mean the same thing.Document definitions and reconcile source-of-truth decisions with stakeholders.

The operational lesson is that the earliest modeling decisions become the foundation for later storage, processing, orchestration, observability, governance, and AI work. A weak model creates weak pipelines. A strong model with no tests still creates risk. A tested model in the wrong format creates performance and cost problems. Professional data engineering requires all three: meaning, structure, and proof.

Exercises

The following exercises extend the guided lab. They are designed to make you practice engineering judgment rather than copy a single correct answer.

LevelExerciseTaskExpected artifact
BasicAdd a new order statusAdd a status called returned and update the validation logic.A passing quality report and a note explaining whether returned belongs before or after payment.
BasicForce a quality failureAdd a duplicate order row intentionally and confirm that the quality check fails.Notebook output or terminal output showing the failed check.
IntermediateAdd campaign contextAdd a dim_campaign table from JSON event metadata and connect it to the star schema.A short schema description and one SQL query by campaign channel.
AdvancedModel customer historyDesign a slowly changing dim_customer table that keeps customer segment history.A table design with surrogate key, natural key, effective dates, and current flag.
AdvancedCompare CSV and ParquetAdd 10,000 synthetic rows and compare file size and query time.A small benchmark table with observations.
TeamReview the modelSplit into finance, marketing, logistics, and platform roles and review the same design.A short decision record documenting trade-offs and unresolved questions.

Review Questions

These questions are meant for self-assessment, classroom discussion, or instructor review. A strong answer should explain the reasoning, not only name a tool.

QuestionWhat a strong answer should include
Why should a data engineer classify data shape before choosing a storage engine?The answer should connect structure, validation, schema drift, and workload fit.
When is CSV appropriate, and when should it be converted to Parquet or another typed format?The answer should distinguish human exchange from recurring analytical processing.
Why might TuranMart need both an operational ERD and an analytical star schema for the same order process?The answer should compare transaction integrity with analytical usability.
What does the grain of fact_order_line mean, and why is it dangerous to skip the grain sentence?The answer should explain row meaning, aggregation safety, and metric consistency.
Which quality checks would you run before allowing finance to use a daily revenue table?The answer should mention required fields, uniqueness, valid statuses, payment reconciliation, and freshness.
How does the bronze-silver-gold pattern reduce operational risk?The answer should explain raw preservation, standardized correction, business-ready consumption, and lineage.
What evidence proves that the Chapter 2 lab ran correctly?The answer should mention generated Parquet files, passing quality checks, DuckDB query output, and validation against expected results.

Chapter Summary

This chapter introduced the first design vocabulary of data engineering. You learned that data has structure before it has scale, and that structured, semi-structured, and unstructured data require different handling. You compared CSV, JSON, Avro, Parquet, and ORC as engineering contracts rather than isolated file extensions. You designed a normalized ERD for operational integrity and a dimensional star schema for analytical usability. Finally, you connected those ideas to quality checks and a guided lab that converts raw files into validated Parquet.

The next chapter moves from data design to the broader open-source ecosystem. You will learn how databases, object stores, processing engines, orchestration tools, quality frameworks, catalogs, and notebooks fit together into a practical platform. Chapter 2 gives you the core question to carry forward: what meaning must the data preserve, and how will the system prove that it preserved it?

References

Footnotes