Building Scalable Pipelines with Open Source Tools and Cloud Platforms¶
About the Author¶
This book was written by a university professor with extensive experience teaching data engineering subjects. The primary goal is to make data engineering as practical and helpful as possible for students entering the workforce as data engineers.
Preface¶
Welcome to Data Engineering in Action. This book is designed to be your comprehensive, practical guide to the world of modern data engineering. Whether you are a student just starting your journey, a software engineer looking to transition into data, or a data analyst wanting to deepen your technical skills, this book will provide you with the knowledge and hands-on experience you need to succeed.

Figure 1:Data is the new oil.
In the 21st century, the phrase “data is the new oil” has become a ubiquitous cliché, but like many clichés, it holds a profound truth. Data, in its raw form, is a crude, unrefined resource. It is a torrent of information flowing from every corner of our digital world: every click on a website, every transaction in a store, every sensor reading from a smart device, every post on social media. This raw data, much like crude oil, is full of potential, but it is not immediately useful. It is messy, inconsistent, and often overwhelming. To unlock its value, it must be discovered, collected, cleaned, processed, and transformed into a reliable, usable, and accessible product. This is the work of data engineering.
Data engineering is one of the most critical and in-demand fields in technology today. Every modern organization, from startups to Fortune 500 companies, is drowning in data. They have data from their websites, their mobile apps, their IoT devices, their business systems, and countless other sources. But raw data, by itself, is not valuable. It needs to be collected, cleaned, transformed, and organized before it can be used to power analytics, machine learning, and data-driven decision-making. This is the job of the data engineer.
This book is different from other data engineering books in several important ways. First, it is intensely practical. Every chapter includes hands-on exercises with real code that you can run on your own machine. Second, it covers the entire modern data stack, from traditional data warehousing to cutting-edge AI and machine learning infrastructure. Third, it provides deep coverage of both open-source tools and cloud platforms, with a particular focus on Alibaba Cloud, one of the world’s leading cloud providers.
Who This Book Is For¶
This book is designed for:
Students studying computer science, data science, or information systems who want to build a career in data engineering
Software engineers who want to transition into data engineering or expand their skill set
Data analysts who want to move beyond SQL and learn how to build production data pipelines
Data scientists who want to understand the infrastructure that powers their models
Anyone who is curious about how modern data systems are built and wants to learn by doing
What You Will Learn¶
By the end of this book, you will be able to:
Understand the role of a data engineer and how it fits into the broader data team
Design and implement data models for both OLTP and OLAP systems
Build batch and streaming data pipelines using Apache Spark and Apache Flink
Orchestrate complex workflows with Apache Airflow and modern alternatives
Implement data governance, quality, and security best practices
Build data platforms on cloud infrastructure (with a focus on Alibaba Cloud)
Create data pipelines for AI and machine learning, including RAG applications
Implement feature stores and model serving infrastructure
Work with vector databases and embeddings
Apply all these skills to real-world business problems through detailed case studies
How This Book Is Organized¶
The book is organized into six parts:
Part 1: Foundations of Data Engineering introduces you to the field, covering the key roles, the data landscape, and core concepts like data modeling and storage paradigms.
Part 2: Data Storage Solutions provides deep dives into the major categories of data storage systems, from relational and NoSQL databases to data lakes and lakehouses.
Part 3: Data Processing and Orchestration covers the tools and techniques for processing data at scale and managing complex workflows.
Part 4: Data Governance, Security, and Cloud Platforms explores the critical topics of data quality, security, and compliance, and shows you how to build data platforms on Alibaba Cloud.
Part 5: Data Engineering for AI and ML is where we dive into the exciting world of data engineering for artificial intelligence, covering RAG, ML pipelines, feature stores, and vector databases.
Part 6: Business Applications and Case Studies brings it all together with detailed, end-to-end case studies from different industries.
Prerequisites¶
To get the most out of this book, you should have:
Basic programming knowledge (preferably in Python)
Basic understanding of SQL and relational databases
Familiarity with the command line
A willingness to learn and experiment
If you are not comfortable with Python or SQL, don’t worry. Appendix B provides a quick primer on both.
Setting Up Your Environment¶
All the code examples in this book can be run on your local machine using Docker and open-source tools. Appendix A provides detailed instructions for setting up your development environment. Additionally, a complete GitHub repository with all the code examples, sample data, and hands-on exercises is available at:
https://
A Note on Cloud Platforms¶
While this book covers open-source tools that can run anywhere, it also provides extensive coverage of Alibaba Cloud, one of the world’s leading cloud providers. If you don’t have access to Alibaba Cloud, don’t worry—the concepts and architectures discussed are applicable to any cloud platform, and you can adapt the examples to AWS, Azure, or Google Cloud.
Let’s Get Started¶
Data engineering is a challenging but incredibly rewarding field. The problems are complex, the tools are powerful, and the impact is immense. Every data-driven application, every machine learning model, and every business intelligence dashboard is built on the foundation of robust data pipelines created by data engineers.
I hope this book will be a valuable companion on your journey to becoming a skilled data engineer. Let’s dive in!
Dr. Kushnazarov Farruh
University Tashkent University of Information Technologies named after Muhammad al-Khwarizmi
2026 May 1st
The students who have taken my data engineering courses over the years and provided invaluable feedback
The open-source communities behind Apache Spark, Apache Flink, Apache Airflow, and countless other projects
The Alibaba Cloud team for their support and documentation
My colleagues and fellow educators in the data community
My family for their patience and support during the writing process
Thank you all.
Table of Contents¶
Part 1: Foundations of Data Engineering¶
Chapter 1: Introduction to Data Engineering
The Role of a Data Engineer
Data Engineering vs. Data Science vs. Data Analytics
The Data Landscape
The Importance of Open-Source
Data Engineering Architecture
Chapter 2: Data Modeling and Storage Paradigms
Data Types and Structures
Data Formats
Data Modeling Fundamentals
Data Storage Paradigms
Data Quality and Data Lineage
Chapter 3: The Open-Source Ecosystem
Why Open-Source Matters
Navigating the Landscape
Understanding Open-Source Licenses
Choosing and Evaluating Projects
How to Contribute
Part 2: Data Storage Solutions¶
Chapter 4: Relational Databases
Introduction to Relational Databases
PostgreSQL: The World’s Most Advanced Open Source Database
MySQL: The World’s Most Popular Open Source Database
PostgreSQL vs. MySQL
Hands-On Exercise
Chapter 5: NoSQL Databases
The NoSQL Movement
Document Databases: MongoDB
Wide-Column Stores: Cassandra
Key-Value Stores: Redis
Choosing the Right NoSQL Database
Chapter 6: Object Storage and Data Lakes
The Rise of Object Storage
MinIO: Open-Source Object Storage
Building a Data Lake
The Medallion Architecture
Data Lake Best Practices
Chapter 7: Data Warehousing and Lakehouse Architectures
Traditional Data Warehousing
The Lakehouse Paradigm
Delta Lake
Apache Iceberg
Apache Hudi
Choosing Your Lakehouse
Part 3: Data Processing and Orchestration¶
Chapter 8: Data Processing Frameworks: Spark and Flink
The Evolution of Big Data Processing
Apache Spark: The Unified Analytics Engine
Apache Flink: The True Stream Processor
Practical Development
Chapter 9: Streaming Data with Kafka and Flink
Placeholder aligned to
BOOK_PLAN.md
Chapter 10: Transformations, Testing, and Analytics Engineering
Placeholder aligned to
BOOK_PLAN.md
Chapter 11: Data Orchestration and Workflow Management
The Need for Orchestration
Apache Airflow: The Open-Source Standard
Modern Alternatives: Prefect and Dagster
Data Orchestration on Alibaba Cloud
Part 4: Analytics, Governance, Security, and Cloud Platforms¶
Chapter 12: Data Observability and Pipeline Reliability
Placeholder aligned to
BOOK_PLAN.md
Chapter 13: Data Governance and Security
The Pillars of Data Governance
Implementing Data Quality at Scale
Data Security Best Practices
Compliance and Regulations
Data Catalogs and Discovery
Chapter 14: Data Engineering on Alibaba Cloud
An Overview of the Alibaba Cloud Data Platform
A Reference Architecture
A Deeper Dive into Key Services
Hybrid and Multi-Cloud Strategies
Chapter 15: Cost, Performance, and Scalability Engineering
Placeholder aligned to
BOOK_PLAN.md
Chapter 16: Solution Selection Framework
The Technology Evaluation Process
Build vs. Buy vs. Open-Source
Choosing a Database
Choosing a Processing Framework
Choosing a Cloud Provider
Common Pitfalls to Avoid
Part 5: Data Engineering for AI and ML¶
Chapter 17: Data Engineering for RAG Applications
Understanding RAG
Building the Data Pipeline for RAG
Vector Storage and Retrieval
Production RAG Systems
RAG on Alibaba Cloud
Chapter 18: ML Pipeline Engineering
The ML Lifecycle
Building Training Data Pipelines
Model Training and Experiment Tracking
Model Deployment
Monitoring ML Systems in Production
ML Pipelines on Alibaba Cloud
Chapter 19: Feature Stores and Model Serving
The Feature Store
Feast: The Leading Open-Source Feature Store
Model Serving
Feature Stores and Model Serving on Alibaba Cloud
Chapter 20: Vector Databases and Embeddings
Embeddings: The Lingua Franca of AI
The Challenge of Vector Search
The Vector Database Landscape
Managing Embedding Pipelines in Production
Part 6: Business Applications and Case Studies¶
Chapter 21: Case Study: Building a Real-time Customer 360 Platform
Business Goals and Requirements
Data Sources
Architecture and Technology Choices
Implementation Details and Challenges
Chapter 22: Case Study: Fraud Detection in Financial Services
Business Goals and Requirements
Challenges of Fraud Detection: Imbalance, Drift, and Delayed Labels
Real-Time Fraud Detection Reference Architecture
Feature Stores, Decisioning, Feedback Loops, and Governance
Guided Lab: Running the Shared SecurePay Fraud-Detection Project
Appendices¶
Appendix A: Setting Up Your Development Environment
Appendix B: A SQL and Python Primer
Appendix C: The Data Engineering Career Path
Appendix D: Further Reading and Resources
Appendix E: Glossary of Terms
How to Use This Book¶
This book is designed to be read sequentially, as each chapter builds on the concepts introduced in previous chapters. However, if you are already familiar with certain topics, you can skip ahead to the chapters that interest you most.
Each chapter follows a similar structure:
Introduction: An overview of what you will learn
Concepts: Detailed explanations of the key concepts
Hands-On Examples: Practical code examples you can run yourself
Best Practices: Production-ready advice from real-world experience
Chapter Summary: A recap of the key takeaways
Exercises: Practice problems to reinforce your learning
Make sure to work through the hands-on exercises and examples. Data engineering is a practical skill that is best learned by doing.
Let’s begin your journey into the world of data engineering!
Acknowledgments¶
This book would not have been possible without the support and contributions of many people: