Appendix E: Glossary of Data Engineering Terms

Introduction¶

This glossary provides clear, concise definitions of key terms and concepts used throughout this book. It serves as a quick reference to help you understand the language of data engineering.

A¶

ACID (Atomicity, Consistency, Isolation, Durability): A set of properties that guarantee reliable transaction processing in database systems. Atomicity ensures that transactions are all-or-nothing. Consistency ensures that a transaction brings the database from one valid state to another. Isolation ensures that concurrent transactions do not interfere with each other. Durability ensures that once a transaction is committed, it will remain so, even in the event of a system failure.

Airflow (Apache Airflow): An open-source platform for programmatically authoring, scheduling, and monitoring workflows. It uses Directed Acyclic Graphs (DAGs) to define data pipelines.

API (Application Programming Interface): A set of rules and protocols that allows one software application to interact with another. Data engineers frequently use APIs to extract data from external systems.

B¶

Batch Processing: A method of processing data where a large volume of data is collected over a period of time and then processed all at once. This is in contrast to stream processing, where data is processed in real-time as it arrives.

Big Data: A term used to describe datasets that are so large or complex that traditional data processing applications are inadequate. Big Data is typically characterized by the “3 Vs”: Volume, Velocity, and Variety.

C¶

CAP Theorem: A fundamental theorem in distributed systems that states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, and Partition Tolerance.

Cassandra (Apache Cassandra): A highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers with no single point of failure.

CDC (Change Data Capture): A technique used to identify and capture changes made to data in a database and then deliver those changes to a downstream process or system in real-time or near real-time.

Columnar Storage: A method of storing data where values from the same column are stored together, rather than storing entire rows together. This is highly efficient for analytical queries that only need to access a subset of columns. Parquet and ORC are examples of columnar file formats.

Consistency: In the context of databases, consistency refers to the requirement that any transaction will bring the database from one valid state to another, maintaining all defined rules and constraints.

D¶

DAG (Directed Acyclic Graph): A graph structure where nodes are connected by directed edges, and there are no cycles (i.e., you cannot start at a node and follow the edges back to that same node). In data engineering, DAGs are commonly used to represent workflows in tools like Apache Airflow.

Data Catalog: A centralized repository that provides metadata about an organization’s data assets, including information about data sources, schemas, lineage, and ownership.

Data Lake: A centralized repository that allows you to store all your structured and unstructured data at any scale. Data lakes typically use object storage (e.g., Amazon S3) and do not enforce a schema on write.

Data Lineage: The ability to track the flow of data from its origin through all transformations and movements to its final destination. This is crucial for debugging, auditing, and understanding data dependencies.

Data Mart: A subset of a data warehouse that is focused on a specific business line or team. Data marts are designed to provide quick access to relevant data for a particular group of users.

Data Pipeline: A series of data processing steps where the output of one step is the input to the next. Data pipelines are used to move and transform data from source systems to target systems.

Data Warehouse: A central repository of integrated data from one or more disparate sources. Data warehouses store current and historical data and are optimized for analytical queries rather than transactional operations.

Denormalization: The process of intentionally introducing redundancy into a database schema to improve read performance. This is the opposite of normalization and is common in OLAP systems.

Dimensional Modeling: A data modeling technique used in data warehousing that structures data into fact tables (containing measurable events) and dimension tables (containing descriptive attributes).

Distributed System: A system in which components located on networked computers communicate and coordinate their actions by passing messages. Many modern data engineering tools (e.g., Spark, Kafka) are distributed systems.

E¶

ELT (Extract, Load, Transform): A data integration process where data is first extracted from source systems, then loaded into a target system (usually a data warehouse or data lake), and finally transformed within that target system. This is in contrast to ETL, where transformation happens before loading.

ETL (Extract, Transform, Load): A traditional data integration process where data is extracted from source systems, transformed (cleaned, aggregated, etc.) in a staging area, and then loaded into a target system.

Event Sourcing: A pattern where state changes are logged as a sequence of events. Instead of storing just the current state, the entire history of changes is stored, allowing you to reconstruct past states.

F¶

Fact Table: In dimensional modeling, a fact table contains the quantitative data (measures) for analysis, along with foreign keys to dimension tables. Examples of measures include sales revenue, quantity sold, and profit.

Flink (Apache Flink): An open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

H¶

Hadoop (Apache Hadoop): An open-source framework for distributed storage and processing of large datasets using the MapReduce programming model. While its popularity has waned in favor of Spark, it remains an important part of the big data ecosystem.

Hive (Apache Hive): A data warehouse software built on top of Hadoop that provides SQL-like query capabilities for data stored in HDFS or other compatible file systems.

Hudi (Apache Hudi): An open-source data management framework that provides record-level insert, update, and delete capabilities on data lakes, enabling incremental data processing.

I¶

Iceberg (Apache Iceberg): An open table format for huge analytic datasets, designed to improve on the performance and reliability of traditional data lake formats.

Idempotency: A property of an operation where applying it multiple times has the same effect as applying it once. In data pipelines, idempotency is crucial for ensuring that re-running a failed job does not produce duplicate or incorrect data.

OLAP (Online Analytical Processing): A category of software tools that provides analysis of data stored in a database. OLAP systems are optimized for complex queries and aggregations, typically used for business intelligence and reporting.

OLTP (Online Transaction Processing): A category of data processing that focuses on transaction-oriented tasks. OLTP systems are optimized for fast, concurrent read and write operations, typically used for operational applications.

K¶

Kafka (Apache Kafka): A distributed event streaming platform capable of handling trillions of events a day. Kafka is widely used for building real-time data pipelines and streaming applications.

Kafka Connect: A framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems.

Kafka Streams: A client library for building applications and microservices where the input and output data are stored in Kafka clusters.

L¶

Lakehouse: A modern data architecture that combines the best features of data lakes and data warehouses. Lakehouses provide the low-cost, flexible storage of data lakes with the ACID transactions and schema enforcement of data warehouses. Delta Lake, Iceberg, and Hudi are lakehouse technologies.

Lambda Architecture: A data processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. It consists of three layers: batch, speed (stream), and serving.

M¶

MapReduce: A programming model for processing large datasets with a parallel, distributed algorithm on a cluster. It consists of two main functions: Map (which filters and sorts data) and Reduce (which performs a summary operation).

Metadata: Data about data. In data engineering, metadata includes information such as table schemas, column names, data types, data lineage, and data quality metrics.

Milvus: An open-source vector database built for scalable similarity search and AI applications. It is designed to handle embedding vectors generated by machine learning models.

N¶

Normalization: The process of organizing data in a database to reduce redundancy and improve data integrity. This typically involves dividing large tables into smaller ones and defining relationships between them.

NoSQL: A category of database management systems that do not use the traditional relational (SQL) model. NoSQL databases are designed for specific use cases such as document storage (MongoDB), wide-column storage (Cassandra), key-value storage (Redis), and graph storage (Neo4j).

O¶

Object Storage: A storage architecture that manages data as objects, as opposed to file systems (which manage data as a file hierarchy) or block storage (which manages data as blocks within sectors). Amazon S3, Google Cloud Storage, and MinIO are examples of object storage systems.

ORC (Optimized Row Columnar): A columnar storage file format optimized for Hadoop workloads. It provides efficient compression and encoding schemes.

P¶

Parquet: A columnar storage file format that is widely used in the Hadoop ecosystem. It is designed to be efficient for both storage and processing, supporting complex nested data structures.

Partition: A division of a database table or dataset into smaller, more manageable pieces. Partitioning can improve query performance by allowing the database to scan only the relevant partitions.

Pub/Sub (Publish-Subscribe): A messaging pattern where senders (publishers) send messages to a topic without knowledge of the receivers (subscribers), and subscribers receive messages from topics they are interested in.

R¶

RAG (Retrieval-Augmented Generation): A technique in natural language processing that combines information retrieval with large language models to generate more accurate and contextually relevant responses.

Redis: An open-source, in-memory data structure store used as a database, cache, and message broker. It is known for its high performance and support for various data structures.

Replication: The process of copying and maintaining database objects (such as tables) in multiple databases that make up a distributed database system. Replication is used for high availability and disaster recovery.

S¶

Schema: The structure or blueprint of a database, defining how data is organized, including tables, columns, data types, and relationships.

Schema-on-Read: A data management approach where the schema is applied when the data is read, rather than when it is written. This is common in data lakes.

Schema-on-Write: A traditional data management approach where the schema is defined and enforced when data is written to the database. This is common in relational databases.

SCD (Slowly Changing Dimension): A dimension in a data warehouse that changes slowly over time, rather than changing on a regular schedule. There are several types of SCDs (Type 1, Type 2, Type 3) that define how historical changes are tracked.

Sharding: A database architecture pattern where a large database is partitioned into smaller, more manageable pieces called shards, each of which is hosted on a separate server.

Spark (Apache Spark): A unified analytics engine for large-scale data processing, providing high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

SQL (Structured Query Language): The standard language for managing and querying relational databases.

Star Schema: A type of dimensional model where a central fact table is surrounded by dimension tables, forming a star-like structure. It is the simplest and most common dimensional model.

Streaming: The continuous processing of data in real-time or near real-time as it is generated, as opposed to batch processing.

T¶

Throughput: The amount of data that can be processed or transferred in a given amount of time. In data engineering, high throughput is often a key performance metric.

Transformation: The process of converting data from one format or structure to another. This is the “T” in ETL and ELT.

V¶

Vector Database: A specialized database designed to store and query high-dimensional vector embeddings, which are numerical representations of data used in machine learning and AI applications.

Velocity: One of the “3 Vs” of Big Data, referring to the speed at which data is generated and must be processed.

Volume: One of the “3 Vs” of Big Data, referring to the sheer amount of data that is generated and stored.

W¶

Workflow Orchestration: The automated coordination and management of data workflows, ensuring that tasks are executed in the correct order and handling dependencies, retries, and failures.

This glossary covers the most important terms you will encounter in your data engineering journey. Refer back to it whenever you need a quick refresher on a concept.