Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 7: Data Warehousing and Lakehouse Architectures

In the previous chapter, we laid the foundation of our modern data platform with the data lake, a highly scalable and cost-effective repository for all our raw and cleaned data. However, a data lake on its own is not enough. While it is great for storing data and for ad-hoc exploration by data scientists, it is not optimized for the fast, interactive, and user-friendly analytics that business users demand. For this, we need a data warehouse.

A data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. It is a centralized repository of integrated data from one or more disparate sources, structured and optimized for analytical queries. For decades, the data warehouse has been the single source of truth for an organization’s analytical data. However, the rise of the data lake has challenged this paradigm, leading to a new, hybrid architecture that aims to combine the best of both worlds: the lakehouse.

This chapter is dedicated to the world of analytical data platforms. We will start by exploring the principles of modern cloud data warehousing, understanding the architectural innovations that have made them so powerful. We will then take a deep dive into the practical aspects of dimensional modeling, learning how to design the star schemas that are the heart of any data warehouse. Finally, we will explore the exciting new world of the lakehouse, understanding the key technologies like Delta Lake, Apache Iceberg, and Apache Hudi that are making this new architecture a reality. By the end of this chapter, you will have a comprehensive understanding of how to build a high-performance, scalable, and user-friendly analytics platform.

7.1 The Modern Cloud Data Warehouse

The traditional data warehouse, which dominated the enterprise for decades, was typically a monolithic, on-premises appliance from vendors like Teradata or Oracle. These systems were powerful but also incredibly expensive, inflexible, and difficult to scale. The rise of the cloud has led to a new generation of data warehouses that have completely changed the game.

Key Architectural Innovations of Cloud Data Warehouses

  1. Separation of Compute and Storage: This is the most important innovation. In a traditional data warehouse, compute and storage were tightly coupled in the same physical appliance. If you needed more storage, you had to buy more compute, and vice versa. Modern cloud data warehouses, like Snowflake, Google BigQuery, and Amazon Redshift, separate these two layers. Storage is handled by a scalable, elastic object store (like S3 or GCS), and compute is handled by a cluster of virtual machines that can be spun up and down on demand. This has several profound benefits:

    • Independent Scaling: You can scale your storage and compute resources independently, paying only for what you need of each.

    • Elasticity: You can spin up a massive compute cluster to handle a peak workload and then shut it down when you are done, providing incredible flexibility and cost-effectiveness.

    • Concurrency: Different teams can run their queries on their own isolated compute clusters, all accessing the same central data store, without competing for resources.

  2. Columnar Storage: As we discussed in Chapter 2, columnar storage is far more efficient for analytical queries than traditional row-based storage. All modern cloud data warehouses use a columnar storage format internally to provide high query performance.

  3. Massive Parallel Processing (MPP): Cloud data warehouses are MPP systems, which means they distribute the data and the query processing across a large cluster of nodes. When you run a query, it is broken down into smaller pieces that are executed in parallel on all the nodes in the cluster, providing massive performance gains for large datasets.

  4. Serverless and Managed: Many cloud data warehouses are offered as fully managed, serverless services. This means you don’t have to worry about provisioning servers, patching software, or managing infrastructure. You simply load your data and start querying.

The Key Players

7.2 Dimensional Modeling in Practice: Building the Star Schema

As we learned in Chapter 2, the dimensional model is the standard data modeling technique for data warehouses. It organizes data into fact tables, which contain the quantitative measures of the business, and dimension tables, which contain the descriptive context. The most common and effective type of dimensional model is the star schema.

Let’s walk through the practical steps of designing a star schema for a retail business.

Step 1: Choose the Business Process

The first step is to choose the business process you want to model. This could be sales, inventory, customer support, etc. Let’s choose retail sales.

Step 2: Identify the Grain

The grain defines what a single row in the fact table represents. This is the most important decision in dimensional modeling. A common grain for retail sales would be one row per product on a customer’s sales order.

Step 3: Identify the Dimensions

Next, we identify the dimensions that provide the context for the facts. For our retail sales process, the key dimensions would be:

Step 4: Identify the Facts

Finally, we identify the numerical measures that we want to analyze. For our retail sales process, the facts would be:

The Resulting Star Schema

This process results in a simple, elegant star schema with a central fact_sales table and four dimension tables: dim_date, dim_product, dim_customer, and dim_store.

Slowly Changing Dimensions (SCDs)

One of the key challenges in data warehousing is how to handle changes to dimension attributes over time. For example, what happens when a customer moves to a new address? This is the problem of Slowly Changing Dimensions (SCDs). There are several common techniques for handling this:

Choosing the right SCD type depends on the business requirements for historical tracking.

7.3 The Lakehouse Revolution: Unifying the Lake and the Warehouse

For years, the standard big data architecture was a two-tiered system: a data lake for storing raw data and for use by data scientists, and a separate data warehouse for business intelligence and reporting. This architecture worked, but it had several problems:

The lakehouse is a new architecture that aims to solve these problems by building data warehousing capabilities directly on top of the low-cost, flexible storage of the data lake. It provides a single, unified platform for all your data, from raw ingestion to business intelligence.

How does it work?

The key innovation that enables the lakehouse is the addition of a transactional metadata layer on top of the open file formats (like Parquet) in the data lake. This metadata layer provides the features that were traditionally only available in a data warehouse, such as ACID transactions, schema enforcement, and time travel.

The Key Lakehouse Technologies

There are three major open-source projects that are leading the lakehouse revolution:

Comparison of Lakehouse Formats

FeatureDelta LakeApache IcebergApache Hudi
Primary EcosystemSpark, DatabricksSpark, Flink, PrestoSpark, Flink
Key FeatureSimplicity, tight Spark integrationRobustness, schema evolution, hidden partitioningStreaming ingestion, incremental processing
TransactionsACIDACIDACID
Schema EvolutionYesExcellentYes
Time TravelYesYesYes

While these technologies have different strengths, they all share the same goal: to bring the reliability and performance of a data warehouse to the flexibility and low cost of a data lake.

7.4 Building a Lakehouse on Alibaba Cloud

Alibaba Cloud provides a powerful set of tools for building a modern lakehouse architecture.

By combining these services, you can build a unified analytics platform that supports a wide range of use cases, from data science and machine learning on the raw data in the lake to interactive business intelligence on the curated data in the warehouse, all on a single copy of the data.

Chapter Summary

In this chapter, we have completed our tour of the world of data storage by exploring the critical systems used for analytics: the data warehouse and the emerging lakehouse. We have understood the architectural innovations that make modern cloud data warehouses so powerful, and we have learned the practical steps of dimensional modeling to build the star schemas that power business intelligence. We have also dived into the exciting new world of the lakehouse, understanding the key technologies like Delta Lake and Apache Iceberg that are unifying the data lake and the data warehouse into a single, powerful platform.

With a comprehensive understanding of how to store data for both transactional and analytical workloads, we are now ready to move on to the next major part of our data engineering journey: processing the data. In the next chapter, we will take a deep dive into the most important data processing frameworks in the industry, Apache Spark and Apache Flink.