Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 2: Core Concepts of Data

In the world of data engineering, data is our raw material. To build robust, efficient, and scalable data systems, we must have a deep understanding of this raw material in all its forms. This chapter is dedicated to the fundamental concepts of data that every data engineer must master. We will move beyond a superficial understanding and dive deep into the nuances of how data is structured, formatted, and modeled. We will explore the trade-offs between different data formats, the principles of effective data modeling, and the critical importance of data quality and governance. A solid grasp of these core concepts is not just academic; it is the foundation upon which all practical data engineering work is built. Without it, you will be like a chef who does not understand their ingredients.

2.1 The Spectrum of Data: From Structured to Unstructured

Data is not a monolith. It exists on a spectrum of structure, from the highly organized and predictable to the completely chaotic and unstructured. As a data engineer, you will be expected to work with data from all points on this spectrum. Understanding the characteristics, challenges, and opportunities of each type of data is the first step in designing effective data pipelines.

Structured Data: The World of Rows and Columns

What is it?

Structured data is the most traditional and well-understood form of data. It is highly organized and conforms to a rigid, predefined schema. Think of a table in a relational database, an Excel spreadsheet, or a CSV file. In structured data, we have clear, consistent columns (attributes) and rows (records). Each column has a specific data type (e.g., integer, string, date), and every row must adhere to this structure. The schema acts as a contract, guaranteeing that the data will be in a predictable format.

Examples:

Advantages:

Disadvantages:

Semi-Structured Data: The Best of Both Worlds?

What is it?

Semi-structured data is the middle ground between the rigid world of structured data and the chaotic world of unstructured data. It does not conform to a formal data model like a relational database table, but it does contain tags, keys, or other markers that create a self-describing, hierarchical structure. This makes it more flexible than structured data but still provides a degree of organization that is missing in unstructured data.

Examples:

Advantages:

Disadvantages:

Unstructured Data: The Wild West

What is it?

Unstructured data is data that has no predefined data model or is not organized in a predefined manner. It is the vast, messy, and chaotic majority of the world’s data. It is estimated that over 80% of all enterprise data is unstructured. This data is often qualitative rather than quantitative and does not fit neatly into rows and columns.

Examples:

Advantages:

Disadvantages:

As a data engineer, your job is to build pipelines that can handle this entire spectrum of data. You might build a pipeline that extracts structured data from a relational database, combines it with semi-structured log data from a web server, and uses a machine learning model to extract sentiment from unstructured customer reviews. The ability to work with all three types of data is a key skill for the modern data engineer.

2.2 Data Formats: The Language of Data Storage

Once you understand the structure of your data, the next step is to decide how to store it on disk. The choice of data format is one of the most important decisions a data engineer can make. It has a profound impact on storage costs, query performance, and compatibility with other tools in the ecosystem. A poor choice of format can lead to slow pipelines and high cloud bills, while a good choice can make your systems fast, efficient, and cost-effective. In this section, we will take a deep dive into the most common data formats, exploring their internal structure, their strengths and weaknesses, and when to use them.

Row-Based vs. Columnar Formats: A Fundamental Divide

Before we look at specific formats, we need to understand a fundamental distinction in how data can be laid out on disk: row-based vs. columnar.

Row-Based Formats

In a row-based format, all the data for a single row is stored together on disk. This is the traditional way of storing data, used by most relational databases and simple formats like CSV.

Columnar Formats

In a columnar format, all the data for a single column is stored together on disk.

A Deep Dive into Common Data Formats

Now, let’s explore the specific data formats that you will encounter as a data engineer.

CSV (Comma-Separated Values): The Lingua Franca

CSV is the simplest and most ubiquitous data format. It is a text-based, row-oriented format where each line is a data record and each record consists of one or more fields, separated by commas.

Strengths:

Weaknesses:

When to use it: CSV is a good choice for small datasets, for exporting data to be used by non-technical users in tools like Excel, and for simple data exchange between systems. It should generally be avoided for large-scale data storage and analytics in a data lake or data warehouse.

JSON (JavaScript Object Notation): The Language of the Web

JSON is a text-based, semi-structured format that has become the standard for data exchange on the web. It is the language of modern APIs.

Strengths:

Weaknesses:

When to use it: JSON is the perfect choice for API responses, configuration files, and for storing semi-structured data in NoSQL document databases like MongoDB. It is also a common format for raw data ingested into a data lake before it is transformed into a more efficient format.

Apache Parquet: The King of Columnar

Parquet is a binary, columnar storage format that was designed from the ground up for efficient data storage and analytics. It is the most common data format used in modern data lakes and is a cornerstone of the big data ecosystem.

Strengths:

Weaknesses:

When to use it: Parquet is the recommended format for storing large datasets in a data lake for analytical purposes. If you are building a data platform on Spark, Presto, or any other modern query engine, you should be using Parquet.

Apache Avro: The Champion of Schema Evolution

Avro is a binary, row-based format that was designed with a strong focus on schema evolution and data serialization. It is a popular choice for streaming data pipelines and for data exchange between services.

Strengths:

Weaknesses:

When to use it: Avro is the ideal choice for serializing events in a streaming data pipeline, especially when using a schema registry with Apache Kafka. It is also a good choice for storing the raw, ingested data in a data lake before it is transformed into Parquet for analytics.

Apache ORC (Optimized Row Columnar): The Other Columnar Format

ORC is another binary, columnar storage format that originated in the Hadoop ecosystem. It is very similar to Parquet in its goals and features.

Strengths:

Weaknesses:

When to use it: ORC is a great choice if you are working heavily within the Hadoop and Hive ecosystem. For new projects, Parquet is often the more common choice.

Choosing the Right Format: A Practical Guide

FormatTypeBest Use CaseCompressionSchema EvolutionEcosystem
CSVRow-based, TextSmall datasets, human-readable exportsPoorNoneUniversal
JSONRow-based, TextAPIs, semi-structured dataModerateFlexibleUniversal
AvroRow-based, BinaryStreaming data (Kafka), schema evolutionGoodExcellentStrong in Kafka/Hadoop
ParquetColumnar, BinaryData lakes, analytical queries (Spark)ExcellentGoodStrong in Spark/Big Data
ORCColumnar, BinaryData lakes, analytical queries (Hive)ExcellentGoodStrong in Hive/Hadoop

A Common Data Lake Strategy:

A common and effective strategy for a data lake is to use a multi-stage approach with different data formats at each stage:

  1. Bronze Layer (Raw Data): Ingest raw data from source systems and store it in its original format or in a flexible format like Avro. This layer is all about capturing the data as-is, with a focus on schema evolution.

  2. Silver Layer (Cleaned and Conformed Data): Clean, deduplicate, and conform the raw data, and store it in a highly efficient columnar format like Parquet. This layer provides a clean, queryable source of truth for the entire organization.

  3. Gold Layer (Aggregated and Business-Ready Data): Create aggregated data marts and business-level tables from the silver layer, also stored in Parquet. This layer is designed for specific business use cases and is what data analysts and BI tools will query.

By understanding the deep trade-offs between these different data formats, you can design a data platform that is not only functional but also highly performant and cost-effective.

2.3 Data Modeling: The Blueprint for Your Data

Data modeling is the process of creating a conceptual representation of data objects, the associations between them, and the rules that govern them. It is one of the most critical skills for a data engineer. A well-designed data model is like a well-designed blueprint for a house: it ensures that the structure is sound, that everything has its proper place, and that it can be easily extended in the future. A poorly designed data model, on the other hand, can lead to a host of problems, including poor query performance, data integrity issues, and difficulty in adapting to new business requirements. In this section, we will explore the key concepts and techniques of data modeling, from the traditional relational world to the modern analytical world.

Entity-Relationship (ER) Modeling: The Foundation of Relational Design

Entity-Relationship (ER) modeling is the cornerstone of relational database design. It is a top-down approach to database design that starts by identifying the key business entities and the relationships between them. The output of this process is an Entity-Relationship Diagram (ERD), which is a visual representation of the data model.

Key Concepts of ER Modeling:

Figure 2.1. An entity relationship diagram showing the core entities and their relationships.

Figure 1:Figure 2.1. An entity relationship diagram showing the core entities and their relationships.

Normalization: The Art of Reducing Redundancy

Normalization is the process of organizing the columns and tables of a relational database to minimize data redundancy. The goal is to ensure that each piece of data is stored in exactly one place. This has several benefits:

Normalization is achieved by following a series of rules called normal forms. The most common normal forms are:

For most OLTP systems, striving for 3NF is a good goal. It provides a good balance between data integrity and query performance.

Denormalization: Trading Integrity for Speed

While normalization is crucial for OLTP systems, it can be detrimental to the performance of analytical queries. A highly normalized schema often requires many joins to retrieve data, which can be slow and complex. Denormalization is the process of deliberately introducing redundancy into a database design to improve query performance. This is a common practice in data warehousing and analytical databases.

For example, in a normalized e-commerce database, to get the name of the category for a product in an order, you might have to join order_items to products to categories. In a denormalized analytical database, you might add the category_name directly to the sales fact table, eliminating the need for a join.

The trade-offs of denormalization:

Dimensional Modeling: Designing for Analytics

Dimensional modeling is a data modeling technique specifically designed for building data warehouses and analytical databases. It is a highly denormalized approach that organizes data in a way that is intuitive for business users and optimized for analytical queries. The two key components of a dimensional model are facts and dimensions.

The most common type of dimensional model is the star schema.

The Star Schema

In a star schema, a central fact table is surrounded by its associated dimension tables, creating a structure that looks like a star. The dimension tables are highly denormalized and contain a wide range of attributes.

Advantages of the Star Schema:

The Snowflake Schema

A snowflake schema is a variation of the star schema where the dimension tables are normalized. For example, the Product dimension might be normalized into Product, Category, and Brand tables. This creates a more complex, snowflake-like structure.

Advantages of the Snowflake Schema:

Disadvantages of the Snowflake Schema:

In modern data warehousing, with the low cost of storage and the power of modern query engines, the star schema is almost always the preferred approach. The simplicity and performance benefits of the star schema usually outweigh the storage savings of the snowflake schema.

Data modeling is a deep and nuanced topic, but by mastering these fundamental concepts of ER modeling, normalization, and dimensional modeling, you will be well-equipped to design data systems that are both robust and performant, for both transactional and analytical workloads.

2.4 Data Storage Paradigms: Choosing the Right Home for Your Data

Beyond the logical design of your data model, a data engineer must also make critical decisions about the physical storage of the data. Different business problems require different storage solutions. A system designed to handle millions of small, fast transactions has very different requirements from a system designed to run complex analytical queries over petabytes of historical data. Understanding the different data storage paradigms is essential for building a data platform that is both performant and cost-effective.

OLTP (Online Transaction Processing): The Engine of Business

What is it?

OLTP systems are the workhorses of the digital economy. They are designed to power the day-to-day operational applications that run the business. Think of an e-commerce site processing orders, a banking system handling ATM transactions, or an airline reservation system booking flights. These systems are characterized by a large number of concurrent users performing a high volume of short, simple transactions (inserts, updates, and deletes).

Key Characteristics:

Technology Examples: PostgreSQL, MySQL, Oracle, SQL Server.

OLAP (Online Analytical Processing): The Engine of Insight

What is it?

While OLTP systems are about running the business, OLAP systems are about understanding the business. They are designed for complex analytical queries over large volumes of historical data. The goal is to support business intelligence, reporting, and data-driven decision-making. Instead of processing individual transactions, OLAP systems are designed to aggregate and analyze data from many different sources.

Key Characteristics:

Technology Examples: Amazon Redshift, Google BigQuery, Snowflake, Apache Druid.

Figure 2.2. A comparison of OLTP and OLAP workloads.

Figure 2:Figure 2.2. A comparison of OLTP and OLAP workloads.

Data Warehouse: The Single Source of Truth for Analytics

What is it?

A data warehouse is a centralized repository of integrated data from one or more disparate sources. It is a specialized type of OLAP system that is designed to be the single source of truth for an organization’s analytical data. The data in a data warehouse is cleaned, transformed, and structured for a specific purpose: to make it easy to query and analyze.

Key Characteristics:

Data Lake: The Repository for Raw Data

What is it?

A data lake is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. Unlike a data warehouse, which requires you to define a schema before you can load data (schema-on-write), a data lake allows you to store data in its raw, native format and apply a schema when you read the data (schema-on-read). This provides immense flexibility.

Key Characteristics:

However, the flexibility of data lakes can also be a drawback. Without proper governance and metadata management, a data lake can quickly turn into a “data swamp,” a messy and unusable repository of data that no one understands.

The Lakehouse: The Best of Both Worlds

What is it?

The lakehouse is a new, modern data architecture that combines the best features of data lakes and data warehouses. It aims to provide the low-cost, flexible storage of a data lake with the ACID transactions, data governance, and query performance of a data warehouse. This is achieved by adding a transactional metadata layer on top of the data lake.

Key Technologies:

Key Characteristics:

Figure 2.3. The main paradigms used to store and organize data.

Figure 3:Figure 2.3. The main paradigms used to store and organize data.

2.5 Data Quality and Governance: The Foundation of Trust

Data is only valuable if it is trustworthy. If the business does not trust the data, they will not use it to make decisions, and all the effort spent on building data pipelines will be wasted. Data quality and data governance are the disciplines that ensure the data is accurate, consistent, and used in a compliant and secure manner. For a data engineer, these are not afterthoughts; they are a core part of the job.

The Dimensions of Data Quality

Data quality is a multi-faceted concept. It is often defined by a set of key dimensions:

Implementing a Data Quality Strategy

Ensuring data quality is not a one-time fix; it is a continuous process that must be embedded throughout the data engineering lifecycle.

A multi-layered approach:

  1. At the Source: Whenever possible, fix data quality issues at the source. This is the most effective way to prevent bad data from propagating through your systems.

  2. In the Pipeline: Build automated data quality checks and validation rules into your data pipelines. Tools like Great Expectations and dbt tests can be used to define and enforce data quality rules.

  3. Monitoring and Alerting: Set up monitoring and alerting to detect data quality issues as soon as they occur. This allows you to quickly identify and fix problems before they impact downstream users.

  4. Data Quality Dashboards: Create dashboards to track data quality metrics over time. This provides visibility into the health of your data and helps to build trust with your stakeholders.

Data Governance: Managing Data as a Strategic Asset

Data governance is a broader discipline that encompasses data quality but also includes policies and procedures for how data is managed, accessed, and used. It is about treating data as a strategic asset.

Key pillars of data governance:

Figure 2.4. Data lineage from source systems through downstream transformations and consumers.

Figure 4:Figure 2.4. Data lineage from source systems through downstream transformations and consumers.

Chapter Summary

In this chapter, we have taken a comprehensive tour of the core concepts of data that every data engineer must understand. We have explored the spectrum of data from structured to unstructured, and we have dived deep into the trade-offs of different data formats like CSV, JSON, Parquet, and Avro. We have also covered the critical skill of data modeling, learning about ER modeling and normalization for OLTP systems, and dimensional modeling and the star schema for OLAP systems. We have examined the different data storage paradigms, from the traditional data warehouse to the modern lakehouse. Finally, we have discussed the crucial importance of data quality and governance in building trustworthy data systems.

With this strong conceptual foundation, you are now ready to start exploring the specific tools and technologies that you will use to implement these concepts in the real world. In the next chapter, we will begin this practical journey by exploring the vibrant open-source ecosystem that powers modern data engineering.