Data Engineering in Action - Data Engineering in Action

Building Scalable Pipelines with Open Source Tools and Cloud Platforms¶

About the Author¶

This book was written by a university professor with extensive experience teaching data engineering subjects. The primary goal is to make data engineering as practical and helpful as possible for students entering the workforce as data engineers.

Preface¶

Welcome to Data Engineering in Action. This book is designed to be your comprehensive, practical guide to the world of modern data engineering. Whether you are a student just starting your journey, a software engineer looking to transition into data, or a data analyst wanting to deepen your technical skills, this book will provide you with the knowledge and hands-on experience you need to succeed.

In the 21st century, the phrase “data is the new oil” has become a ubiquitous cliché, but like many clichés, it holds a profound truth. Data, in its raw form, is a crude, unrefined resource. It is a torrent of information flowing from every corner of our digital world: every click on a website, every transaction in a store, every sensor reading from a smart device, every post on social media. This raw data, much like crude oil, is full of potential, but it is not immediately useful. It is messy, inconsistent, and often overwhelming. To unlock its value, it must be discovered, collected, cleaned, processed, and transformed into a reliable, usable, and accessible product. This is the work of data engineering.

Data engineering is one of the most critical and in-demand fields in technology today. Every modern organization, from startups to Fortune 500 companies, is drowning in data. They have data from their websites, their mobile apps, their IoT devices, their business systems, and countless other sources. But raw data, by itself, is not valuable. It needs to be collected, cleaned, transformed, and organized before it can be used to power analytics, machine learning, and data-driven decision-making. This is the job of the data engineer.

This book is different from other data engineering books in several important ways. First, it is intensely practical. Every chapter includes hands-on exercises with real code that you can run on your own machine. Second, it covers the entire modern data stack, from traditional data warehousing to cutting-edge AI and machine learning infrastructure. Third, it provides deep coverage of both open-source tools and cloud platforms, with a particular focus on Alibaba Cloud, one of the world’s leading cloud providers.

Who This Book Is For¶

This book is designed for:

Students studying computer science, data science, or information systems who want to build a career in data engineering
Software engineers who want to transition into data engineering or expand their skill set
Data analysts who want to move beyond SQL and learn how to build production data pipelines
Data scientists who want to understand the infrastructure that powers their models
Anyone who is curious about how modern data systems are built and wants to learn by doing

What You Will Learn¶

By the end of this book, you will be able to:

Understand the role of a data engineer and how it fits into the broader data team
Design and implement data models for both OLTP and OLAP systems
Build batch and streaming data pipelines using Apache Spark and Apache Flink
Orchestrate complex workflows with Apache Airflow and modern alternatives
Implement data governance, quality, and security best practices
Build data platforms on cloud infrastructure (with a focus on Alibaba Cloud)
Create data pipelines for AI and machine learning, including RAG applications
Implement feature stores and model serving infrastructure
Work with vector databases and embeddings
Apply all these skills to real-world business problems through detailed case studies

How This Book Is Organized¶

The book is organized into six parts:

Part 1: Foundations of Data Engineering introduces you to the field, covering the key roles, the data landscape, and core concepts like data modeling and storage paradigms.

Part 2: Data Storage Solutions provides deep dives into the major categories of data storage systems, from relational and NoSQL databases to data lakes and lakehouses.

Part 3: Data Processing and Orchestration covers the tools and techniques for processing data at scale and managing complex workflows.

Part 4: Data Governance, Security, and Cloud Platforms explores the critical topics of data quality, security, and compliance, and shows you how to build data platforms on Alibaba Cloud.

Part 5: Data Engineering for AI and ML is where we dive into the exciting world of data engineering for artificial intelligence, covering RAG, ML pipelines, feature stores, and vector databases.

Part 6: Business Applications and Case Studies brings it all together with detailed, end-to-end case studies from different industries.

Prerequisites¶

To get the most out of this book, you should have:

Basic programming knowledge (preferably in Python)
Basic understanding of SQL and relational databases
Familiarity with the command line
A willingness to learn and experiment

If you are not comfortable with Python or SQL, don’t worry. Appendix B provides a quick primer on both.

Setting Up Your Environment¶

All the code examples in this book can be run on your local machine using Docker and open-source tools. Appendix A provides detailed instructions for setting up your development environment. Additionally, a complete GitHub repository with all the code examples, sample data, and hands-on exercises is available at:

https://github.com/yourusername/data-engineering-in-action

A Note on Cloud Platforms¶

While this book covers open-source tools that can run anywhere, it also provides extensive coverage of Alibaba Cloud, one of the world’s leading cloud providers. If you don’t have access to Alibaba Cloud, don’t worry—the concepts and architectures discussed are applicable to any cloud platform, and you can adapt the examples to AWS, Azure, or Google Cloud.

Let’s Get Started¶

Data engineering is a challenging but incredibly rewarding field. The problems are complex, the tools are powerful, and the impact is immense. Every data-driven application, every machine learning model, and every business intelligence dashboard is built on the foundation of robust data pipelines created by data engineers.

I hope this book will be a valuable companion on your journey to becoming a skilled data engineer. Let’s dive in!

Dr. Kushnazarov Farruh
University Tashkent University of Information Technologies named after Muhammad al-Khwarizmi
2026 May 1st

The students who have taken my data engineering courses over the years and provided invaluable feedback
The open-source communities behind Apache Spark, Apache Flink, Apache Airflow, and countless other projects
The Alibaba Cloud team for their support and documentation
My colleagues and fellow educators in the data community
My family for their patience and support during the writing process

Thank you all.

Table of Contents¶

Part 1: Foundations of Data Engineering¶

Chapter 1: Introduction to Data Engineering

The Role of a Data Engineer
Data Engineering vs. Data Science vs. Data Analytics
The Data Landscape
The Importance of Open-Source
Data Engineering Architecture

Chapter 2: Data Modeling and Storage Paradigms

Data Types and Structures
Data Formats
Data Modeling Fundamentals
Data Storage Paradigms
Data Quality and Data Lineage

Chapter 3: The Open-Source Ecosystem

Why Open-Source Matters
Navigating the Landscape
Understanding Open-Source Licenses
Choosing and Evaluating Projects
How to Contribute

Part 2: Data Storage Solutions¶

Chapter 4: Relational Databases

Introduction to Relational Databases
PostgreSQL: The World’s Most Advanced Open Source Database
MySQL: The World’s Most Popular Open Source Database
PostgreSQL vs. MySQL
Hands-On Exercise

Chapter 5: NoSQL Databases

The NoSQL Movement
Document Databases: MongoDB
Wide-Column Stores: Cassandra
Key-Value Stores: Redis
Choosing the Right NoSQL Database

Chapter 6: Object Storage and Data Lakes

The Rise of Object Storage
MinIO: Open-Source Object Storage
Building a Data Lake
The Medallion Architecture
Data Lake Best Practices

Chapter 7: Data Warehousing and Lakehouse Architectures

Traditional Data Warehousing
The Lakehouse Paradigm
Delta Lake
Apache Iceberg
Apache Hudi
Choosing Your Lakehouse

Part 3: Data Processing and Orchestration¶

Chapter 8: Data Processing Frameworks: Spark and Flink

The Evolution of Big Data Processing
Apache Spark: The Unified Analytics Engine
Apache Flink: The True Stream Processor
Practical Development

Chapter 9: Streaming Data with Kafka and Flink

Placeholder aligned to BOOK_PLAN.md

Chapter 10: Transformations, Testing, and Analytics Engineering

Placeholder aligned to BOOK_PLAN.md

Chapter 11: Data Orchestration and Workflow Management

The Need for Orchestration
Apache Airflow: The Open-Source Standard
Modern Alternatives: Prefect and Dagster
Data Orchestration on Alibaba Cloud

Part 4: Analytics, Governance, Security, and Cloud Platforms¶

Chapter 12: Data Observability and Pipeline Reliability

Placeholder aligned to BOOK_PLAN.md

Chapter 13: Data Governance and Security

The Pillars of Data Governance
Implementing Data Quality at Scale
Data Security Best Practices
Compliance and Regulations
Data Catalogs and Discovery

Chapter 14: Data Engineering on Alibaba Cloud

An Overview of the Alibaba Cloud Data Platform
A Reference Architecture
A Deeper Dive into Key Services
Hybrid and Multi-Cloud Strategies

Chapter 15: Cost, Performance, and Scalability Engineering

Placeholder aligned to BOOK_PLAN.md

Chapter 16: Solution Selection Framework

The Technology Evaluation Process
Build vs. Buy vs. Open-Source
Choosing a Database
Choosing a Processing Framework
Choosing a Cloud Provider
Common Pitfalls to Avoid

Part 5: Data Engineering for AI and ML¶

Chapter 17: Data Engineering for RAG Applications

Understanding RAG
Building the Data Pipeline for RAG
Vector Storage and Retrieval
Production RAG Systems
RAG on Alibaba Cloud

Chapter 18: ML Pipeline Engineering

The ML Lifecycle
Building Training Data Pipelines
Model Training and Experiment Tracking
Model Deployment
Monitoring ML Systems in Production
ML Pipelines on Alibaba Cloud

Chapter 19: Feature Stores and Model Serving

The Feature Store
Feast: The Leading Open-Source Feature Store
Model Serving
Feature Stores and Model Serving on Alibaba Cloud

Chapter 20: Vector Databases and Embeddings

Embeddings: The Lingua Franca of AI
The Challenge of Vector Search
The Vector Database Landscape
Managing Embedding Pipelines in Production

Part 6: Business Applications and Case Studies¶

Chapter 21: Case Study: Building a Real-time Customer 360 Platform

Business Goals and Requirements
Data Sources
Architecture and Technology Choices
Implementation Details and Challenges

Chapter 22: Case Study: Fraud Detection in Financial Services

Business Goals and Requirements
Challenges of Fraud Detection: Imbalance, Drift, and Delayed Labels
Real-Time Fraud Detection Reference Architecture
Feature Stores, Decisioning, Feedback Loops, and Governance
Guided Lab: Running the Shared SecurePay Fraud-Detection Project

Appendices¶

Appendix A: Setting Up Your Development Environment

Appendix B: A SQL and Python Primer

Appendix C: The Data Engineering Career Path

Appendix D: Further Reading and Resources

Appendix E: Glossary of Terms

How to Use This Book¶

This book is designed to be read sequentially, as each chapter builds on the concepts introduced in previous chapters. However, if you are already familiar with certain topics, you can skip ahead to the chapters that interest you most.

Each chapter follows a similar structure:

Introduction: An overview of what you will learn
Concepts: Detailed explanations of the key concepts
Hands-On Examples: Practical code examples you can run yourself
Best Practices: Production-ready advice from real-world experience
Chapter Summary: A recap of the key takeaways
Exercises: Practice problems to reinforce your learning

Make sure to work through the hands-on exercises and examples. Data engineering is a practical skill that is best learned by doing.

Let’s begin your journey into the world of data engineering!

Acknowledgments¶

This book would not have been possible without the support and contributions of many people: