Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 11: Data Engineering on Alibaba Cloud

Throughout this book, we have explored a wide range of open-source data engineering tools and technologies. However, in the real world, you will not be running these tools on your own hardware. You will be running them in the cloud. The cloud provides a scalable, elastic, and cost-effective platform for building modern data applications. While there are several major cloud providers, each with its own set of data services, this chapter will focus on Alibaba Cloud, one of the leading cloud providers in the world, particularly in the Asia-Pacific region.

Alibaba Cloud offers a rich and comprehensive suite of data services that cover the entire data engineering lifecycle, from data ingestion and storage to processing, analytics, and machine learning. Understanding how to use these services effectively is a critical skill for any data engineer working in the Alibaba Cloud ecosystem. In this chapter, we will take a comprehensive tour of the Alibaba Cloud data platform. We will explore the key services for data engineering, including MaxCompute, DataWorks, OSS, E-MapReduce, and AnalyticDB. We will then look at a reference architecture for building a modern data platform on Alibaba Cloud. We will take a deeper dive into some of the core services, such as MaxCompute and DataWorks, and we will also discuss hybrid and multi-cloud strategies. By the end of this chapter, you will have a clear understanding of the Alibaba Cloud data ecosystem and be ready to start building your own data pipelines on the platform.

11.1 An Overview of the Alibaba Cloud Data Platform

Alibaba Cloud provides a dizzying array of data services. Let’s break down the most important ones for data engineering.

11.2 A Reference Architecture for a Modern Data Platform on Alibaba Cloud

By combining these services, you can build a powerful and scalable modern data platform. Here is a typical reference architecture:

  1. Data Ingestion:

    • Batch Ingestion: Use DataWorks’s data integration capabilities to ingest data from a variety of sources (e.g., relational databases, log files) into your OSS data lake.

    • Real-time Ingestion: Use DataHub to ingest streaming data from sources like application logs, clickstreams, and IoT devices.

  2. Data Storage:

    • Data Lake: Use OSS to store all your raw and semi-structured data in a cost-effective and scalable way. Implement a medallion architecture (Bronze, Silver, Gold) to organize your data lake.

    • Data Warehouse: Use MaxCompute to store your structured, transformed, and aggregated data for business intelligence and reporting.

  3. Data Processing:

    • Batch Processing: Use MaxCompute’s SQL engine or a Spark job on EMR to perform large-scale batch ETL/ELT on the data in your data lake and data warehouse.

    • Real-time Processing: Use Realtime Compute for Apache Flink to process the streaming data from DataHub in real time.

  4. Data Serving and Analytics:

    • Business Intelligence: Use MaxCompute for large-scale, complex queries and reporting. Connect a BI tool like Quick BI to MaxCompute to build dashboards and visualizations.

    • Interactive Analytics: Use AnalyticDB or Hologres for low-latency, interactive queries on your analytical data.

    • Data Science: Data scientists can use Spark on EMR to explore the data in the data lake and build machine learning models.

  5. Orchestration and Governance:

    • Use DataWorks to orchestrate your entire data pipeline, from ingestion to processing to serving. Use DataWorks’s data governance features (Data Map, Data Quality) to manage and govern your data assets.

11.3 A Deeper Dive into Key Services

MaxCompute: The Petabyte-Scale Data Warehouse

MaxCompute is a powerful, fully managed, multi-tenancy data warehouse service. It is designed for processing massive datasets and is used extensively within Alibaba Group to power their e-commerce and logistics businesses.

Key Features:

DataWorks: The All-in-One Data Platform

DataWorks is the control plane for your data platform on Alibaba Cloud. It is a powerful and comprehensive tool that can be a bit overwhelming at first, but it provides a huge amount of value by integrating many different data engineering tasks into a single platform.

Key Components:

11.4 Hybrid and Multi-Cloud Strategies

While Alibaba Cloud provides a rich set of data services, many organizations have a hybrid or multi-cloud strategy. They may have some data on-premises and some data in other clouds like AWS or Azure. In this scenario, you need to think about how to integrate your Alibaba Cloud data platform with your other environments.

Key Considerations:

Chapter Summary

In this chapter, we have taken a comprehensive tour of the data engineering ecosystem on Alibaba Cloud. We have explored the key services for data storage, processing, and analytics, and we have looked at a reference architecture for building a modern data platform on the platform. We have also taken a deeper dive into the core services of MaxCompute and DataWorks. You should now have a solid understanding of how to leverage the power of the Alibaba Cloud platform to build your own scalable, reliable, and secure data applications.

This chapter concludes our tour of the foundational technologies of data engineering. We have covered storage, processing, orchestration, governance, and cloud platforms. In the next part of the book, we will move on to the exciting and rapidly evolving world of data engineering for AI and machine learning.