Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 6: Object Storage and Data Lakes

In our journey through the world of data storage, we have explored the structured world of relational databases and the flexible world of NoSQL. Now, we turn our attention to the foundation of the modern, cloud-native data platform: object storage and the data lake. In the era of big data, the ability to store and process massive volumes of structured, semi-structured, and unstructured data in a cost-effective and scalable way is paramount. This is the problem that data lakes, built on top of object storage, were designed to solve. A data lake is a centralized repository that allows you to store all your data at any scale, in its raw, native format.

This chapter is dedicated to providing a deep, practical understanding of these foundational technologies. We will start by exploring the principles of object storage, understanding what makes it different from traditional file systems and why it has become the de facto storage layer for the cloud. We will take a close look at Amazon S3, the industry standard, as well as open-source alternatives like MinIO. We will then dive into the concept of the data lake, learning how to design and build one effectively. We will cover key architectural patterns like the medallion architecture, best practices for organizing data, and the critical importance of data lake governance to avoid turning your data lake into a data swamp. By the end of this chapter, you will have the knowledge to build the scalable, cost-effective, and flexible storage foundation that is required for any modern data engineering initiative.

6.1 Object Storage: The Infinite Data Repository

Before we can build a data lake, we must first understand the technology that makes it possible: object storage. Object storage is a computer data storage architecture that manages data as objects, as opposed to the file hierarchy of a traditional file system or the blocks of a block storage system.

What is an Object?

An object is a self-contained bundle of data that consists of three components:

  1. Data: The actual data itself, which can be anything from a text file to an image, a video, or a massive Parquet file.

  2. Metadata: A set of descriptive attributes about the data. This can include system metadata (like the creation date and content type) and custom metadata (like a customer ID or a project name).

  3. A Globally Unique Identifier (GUID): A unique ID that is used to access the object from anywhere in the system.

Unlike a file in a file system, an object is not stored in a hierarchical directory structure. Instead, all objects are stored in a flat address space, in a container called a bucket. You can create a pseudo-hierarchical structure by using prefixes in your object names (e.g., my-bucket/raw/sales/2025/11/08/data.csv), but this is just a convention for organization; the underlying storage is still flat.

Key Characteristics of Object Storage

Object Storage vs. File Storage vs. Block Storage

FeatureObject StorageFile Storage (NAS)Block Storage (SAN)
Data ModelObjects in a flat address spaceFiles in a hierarchical directoryBlocks of raw storage
Access ProtocolHTTP (REST API)SMB, NFSFibre Channel, iSCSI
ScalabilityVirtually unlimitedLimited by the file systemLimited by the volume size
PerformanceHigh throughput, higher latencyLower latency for file operationsLowest latency, high IOPS
CostLowestModerateHighest
Use CaseData lakes, backups, archives, cloud-native applicationsShared file storage, home directoriesDatabases, virtual machine disks

Key Object Storage Platforms

6.2 The Data Lake: A Centralized Repository for All Your Data

A data lake is a design pattern, not a specific product. It is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. The core idea is to have a single place where you can store all your data in its raw, native format, without having to first structure it or define a schema.

Why Build a Data Lake?

The Medallion Architecture: Structuring Your Data Lake

One of the biggest challenges with data lakes is that they can easily turn into “data swamps”—a messy, disorganized, and untrustworthy repository of data that no one knows how to use. To avoid this, it is crucial to have a well-defined structure and a set of data quality processes. The medallion architecture is a popular and effective design pattern for structuring a data lake. It organizes the data into three distinct layers or zones: Bronze, Silver, and Gold.

Bronze Layer (Raw Data)

Silver Layer (Cleaned and Conformed Data)

Gold Layer (Aggregated and Business-Ready Data)

This multi-layered approach provides a clear separation of concerns and a progressive improvement in data quality as data moves from Bronze to Silver to Gold. It provides a solid foundation for building a reliable and trustworthy data platform.

Best Practices for Data Lake Design

6.3 Data Lake Governance: Taming the Swamp

As we have mentioned, the biggest risk of a data lake is that it can become a data swamp. Data lake governance is the set of policies, processes, and tools that you put in place to ensure that your data lake is well-managed, secure, and trustworthy.

Key Pillars of Data Lake Governance:

6.4 Building a Data Lake on Alibaba Cloud

Alibaba Cloud provides a rich set of services for building a modern data lake.

Chapter Summary

In this chapter, we have explored the foundational storage layer of the modern data platform: object storage and the data lake. We have understood the principles of object storage and why its scalability, durability, and cost-effectiveness make it the ideal choice for storing massive amounts of data. We have taken a deep dive into the concept of the data lake, learning how to design and structure one using the medallion architecture. We have also discussed the critical importance of data lake governance in ensuring that your data lake remains a valuable asset and does not devolve into a data swamp. Finally, we have seen how you can build a data lake using the services available on Alibaba Cloud.

With a solid understanding of how to store data, from relational databases to NoSQL databases to data lakes, we are now ready to move on to the next logical step in our journey: processing the data. In the next chapter, we will explore the world of data warehousing and the emerging lakehouse architecture, which combines the best of data lakes and data warehouses.