In the previous chapter, we laid the foundation of our modern data platform with the data lake, a highly scalable and cost-effective repository for all our raw and cleaned data. However, a data lake on its own is not enough. While it is great for storing data and for ad-hoc exploration by data scientists, it is not optimized for the fast, interactive, and user-friendly analytics that business users demand. For this, we need a data warehouse.
A data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. It is a centralized repository of integrated data from one or more disparate sources, structured and optimized for analytical queries. For decades, the data warehouse has been the single source of truth for an organization’s analytical data. However, the rise of the data lake has challenged this paradigm, leading to a new, hybrid architecture that aims to combine the best of both worlds: the lakehouse.
This chapter is dedicated to the world of analytical data platforms. We will start by exploring the principles of modern cloud data warehousing, understanding the architectural innovations that have made them so powerful. We will then take a deep dive into the practical aspects of dimensional modeling, learning how to design the star schemas that are the heart of any data warehouse. Finally, we will explore the exciting new world of the lakehouse, understanding the key technologies like Delta Lake, Apache Iceberg, and Apache Hudi that are making this new architecture a reality. By the end of this chapter, you will have a comprehensive understanding of how to build a high-performance, scalable, and user-friendly analytics platform.
7.1 The Modern Cloud Data Warehouse¶
The traditional data warehouse, which dominated the enterprise for decades, was typically a monolithic, on-premises appliance from vendors like Teradata or Oracle. These systems were powerful but also incredibly expensive, inflexible, and difficult to scale. The rise of the cloud has led to a new generation of data warehouses that have completely changed the game.
Key Architectural Innovations of Cloud Data Warehouses
Separation of Compute and Storage: This is the most important innovation. In a traditional data warehouse, compute and storage were tightly coupled in the same physical appliance. If you needed more storage, you had to buy more compute, and vice versa. Modern cloud data warehouses, like Snowflake, Google BigQuery, and Amazon Redshift, separate these two layers. Storage is handled by a scalable, elastic object store (like S3 or GCS), and compute is handled by a cluster of virtual machines that can be spun up and down on demand. This has several profound benefits:
Independent Scaling: You can scale your storage and compute resources independently, paying only for what you need of each.
Elasticity: You can spin up a massive compute cluster to handle a peak workload and then shut it down when you are done, providing incredible flexibility and cost-effectiveness.
Concurrency: Different teams can run their queries on their own isolated compute clusters, all accessing the same central data store, without competing for resources.
Columnar Storage: As we discussed in Chapter 2, columnar storage is far more efficient for analytical queries than traditional row-based storage. All modern cloud data warehouses use a columnar storage format internally to provide high query performance.
Massive Parallel Processing (MPP): Cloud data warehouses are MPP systems, which means they distribute the data and the query processing across a large cluster of nodes. When you run a query, it is broken down into smaller pieces that are executed in parallel on all the nodes in the cluster, providing massive performance gains for large datasets.
Serverless and Managed: Many cloud data warehouses are offered as fully managed, serverless services. This means you don’t have to worry about provisioning servers, patching software, or managing infrastructure. You simply load your data and start querying.
The Key Players¶
Snowflake: A leading cloud data warehouse that is known for its unique multi-cluster, shared data architecture, which provides excellent performance and concurrency. It is available on all three major cloud providers (AWS, Azure, and GCP).
Google BigQuery: Google’s serverless, highly scalable, and cost-effective data warehouse. Its architecture is designed for extreme performance on massive datasets.
Amazon Redshift: Amazon’s cloud data warehouse, which is a popular choice for companies that are heavily invested in the AWS ecosystem.
Alibaba Cloud MaxCompute: A serverless, enterprise-grade data warehouse that is a core component of the Alibaba Cloud data platform.
7.2 Dimensional Modeling in Practice: Building the Star Schema¶
As we learned in Chapter 2, the dimensional model is the standard data modeling technique for data warehouses. It organizes data into fact tables, which contain the quantitative measures of the business, and dimension tables, which contain the descriptive context. The most common and effective type of dimensional model is the star schema.
Let’s walk through the practical steps of designing a star schema for a retail business.
Step 1: Choose the Business Process
The first step is to choose the business process you want to model. This could be sales, inventory, customer support, etc. Let’s choose retail sales.
Step 2: Identify the Grain
The grain defines what a single row in the fact table represents. This is the most important decision in dimensional modeling. A common grain for retail sales would be one row per product on a customer’s sales order.
Step 3: Identify the Dimensions
Next, we identify the dimensions that provide the context for the facts. For our retail sales process, the key dimensions would be:
Date: When did the sale happen?
Product: What product was sold?
Customer: Who bought the product?
Store: Where was the product sold?
Step 4: Identify the Facts
Finally, we identify the numerical measures that we want to analyze. For our retail sales process, the facts would be:
quantity_soldunit_pricetotal_sales_amountdiscount_amountnet_sales_amount
The Resulting Star Schema
This process results in a simple, elegant star schema with a central fact_sales table and four dimension tables: dim_date, dim_product, dim_customer, and dim_store.
Slowly Changing Dimensions (SCDs)¶
One of the key challenges in data warehousing is how to handle changes to dimension attributes over time. For example, what happens when a customer moves to a new address? This is the problem of Slowly Changing Dimensions (SCDs). There are several common techniques for handling this:
SCD Type 1: Overwrite. You simply overwrite the old value with the new value. This is the simplest approach, but it loses all historical information.
SCD Type 2: Add a New Row. This is the most common approach. When a change occurs, you add a new row to the dimension table for the new version of the dimension, and you keep the old row for historical analysis. The new row gets a new surrogate key, and you typically add columns like
start_date,end_date, andis_currentto track the history.SCD Type 3: Add a New Column. You add a new column to the dimension table to store the previous value of the attribute. This is a simple way to track limited history, but it is not very scalable.
Choosing the right SCD type depends on the business requirements for historical tracking.
7.3 The Lakehouse Revolution: Unifying the Lake and the Warehouse¶
For years, the standard big data architecture was a two-tiered system: a data lake for storing raw data and for use by data scientists, and a separate data warehouse for business intelligence and reporting. This architecture worked, but it had several problems:
Data Duplication: Data had to be copied from the lake to the warehouse, leading to redundant storage and increased costs.
Complexity: Maintaining two separate systems and the ETL pipelines between them was complex and brittle.
Data Staleness: The data in the warehouse was often out of date because of the batch ETL process.
The lakehouse is a new architecture that aims to solve these problems by building data warehousing capabilities directly on top of the low-cost, flexible storage of the data lake. It provides a single, unified platform for all your data, from raw ingestion to business intelligence.
How does it work?
The key innovation that enables the lakehouse is the addition of a transactional metadata layer on top of the open file formats (like Parquet) in the data lake. This metadata layer provides the features that were traditionally only available in a data warehouse, such as ACID transactions, schema enforcement, and time travel.
The Key Lakehouse Technologies¶
There are three major open-source projects that are leading the lakehouse revolution:
Delta Lake: Created by Databricks, Delta Lake is an open-source storage layer that brings reliability to data lakes. It runs on top of your existing data lake and is fully compatible with Apache Spark. Its core feature is the Delta Log, which is an ordered record of every transaction that has ever been performed on a Delta table. This log is the single source of truth and is what enables ACID transactions, time travel, and other reliability features.
Apache Iceberg: Created at Netflix, Apache Iceberg is an open table format for huge analytic datasets. Unlike Delta Lake, which is a storage layer, Iceberg is a table format specification. It manages the metadata of a table separately from the data files, which provides several advantages, including fast schema evolution, hidden partitioning, and better performance for point lookups.
Apache Hudi: (Hadoop Upserts Deletes and Incrementals) Created at Uber, Hudi is a streaming data lake platform. It provides low-latency ingestion and incremental processing capabilities, making it a good choice for building streaming data lakes.
Comparison of Lakehouse Formats
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Primary Ecosystem | Spark, Databricks | Spark, Flink, Presto | Spark, Flink |
| Key Feature | Simplicity, tight Spark integration | Robustness, schema evolution, hidden partitioning | Streaming ingestion, incremental processing |
| Transactions | ACID | ACID | ACID |
| Schema Evolution | Yes | Excellent | Yes |
| Time Travel | Yes | Yes | Yes |
While these technologies have different strengths, they all share the same goal: to bring the reliability and performance of a data warehouse to the flexibility and low cost of a data lake.
7.4 Building a Lakehouse on Alibaba Cloud¶
Alibaba Cloud provides a powerful set of tools for building a modern lakehouse architecture.
Alibaba Cloud OSS: The foundation of your lakehouse, providing the scalable and cost-effective storage layer.
Alibaba Cloud EMR: You can use EMR to run Spark jobs that read from and write to Delta Lake, Iceberg, or Hudi tables stored in OSS.
Alibaba Cloud Data Lake Formation (DLF): A service that helps you to build and manage your data lake, providing features like unified metadata management, access control, and data governance.
Alibaba Cloud Hologres: A real-time data warehouse that can directly query data in your data lake, providing a high-performance query engine for your lakehouse.
By combining these services, you can build a unified analytics platform that supports a wide range of use cases, from data science and machine learning on the raw data in the lake to interactive business intelligence on the curated data in the warehouse, all on a single copy of the data.
Chapter Summary¶
In this chapter, we have completed our tour of the world of data storage by exploring the critical systems used for analytics: the data warehouse and the emerging lakehouse. We have understood the architectural innovations that make modern cloud data warehouses so powerful, and we have learned the practical steps of dimensional modeling to build the star schemas that power business intelligence. We have also dived into the exciting new world of the lakehouse, understanding the key technologies like Delta Lake and Apache Iceberg that are unifying the data lake and the data warehouse into a single, powerful platform.
With a comprehensive understanding of how to store data for both transactional and analytical workloads, we are now ready to move on to the next major part of our data engineering journey: processing the data. In the next chapter, we will take a deep dive into the most important data processing frameworks in the industry, Apache Spark and Apache Flink.