Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data Engineering in Action

Building Scalable Pipelines with Open Source Tools and Cloud Platforms


About the Author

This book was written by a university professor with extensive experience teaching data engineering subjects. The primary goal is to make data engineering as practical and helpful as possible for students entering the workforce as data engineers.


Preface

Welcome to Data Engineering in Action. This book is designed to be your comprehensive, practical guide to the world of modern data engineering. Whether you are a student just starting your journey, a software engineer looking to transition into data, or a data analyst wanting to deepen your technical skills, this book will provide you with the knowledge and hands-on experience you need to succeed.

Data is the new oil.

Figure 1:Data is the new oil.

In the 21st century, the phrase “data is the new oil” has become a ubiquitous cliché, but like many clichés, it holds a profound truth. Data, in its raw form, is a crude, unrefined resource. It is a torrent of information flowing from every corner of our digital world: every click on a website, every transaction in a store, every sensor reading from a smart device, every post on social media. This raw data, much like crude oil, is full of potential, but it is not immediately useful. It is messy, inconsistent, and often overwhelming. To unlock its value, it must be discovered, collected, cleaned, processed, and transformed into a reliable, usable, and accessible product. This is the work of data engineering.

Data engineering is one of the most critical and in-demand fields in technology today. Every modern organization, from startups to Fortune 500 companies, is drowning in data. They have data from their websites, their mobile apps, their IoT devices, their business systems, and countless other sources. But raw data, by itself, is not valuable. It needs to be collected, cleaned, transformed, and organized before it can be used to power analytics, machine learning, and data-driven decision-making. This is the job of the data engineer.

This book is different from other data engineering books in several important ways. First, it is intensely practical. Every chapter includes hands-on exercises with real code that you can run on your own machine. Second, it covers the entire modern data stack, from traditional data warehousing to cutting-edge AI and machine learning infrastructure. Third, it provides deep coverage of both open-source tools and cloud platforms, with a particular focus on Alibaba Cloud, one of the world’s leading cloud providers.

Who This Book Is For

This book is designed for:

What You Will Learn

By the end of this book, you will be able to:

How This Book Is Organized

The book is organized into six parts:

Part 1: Foundations of Data Engineering introduces you to the field, covering the key roles, the data landscape, and core concepts like data modeling and storage paradigms.

Part 2: Data Storage Solutions provides deep dives into the major categories of data storage systems, from relational and NoSQL databases to data lakes and lakehouses.

Part 3: Data Processing and Orchestration covers the tools and techniques for processing data at scale and managing complex workflows.

Part 4: Data Governance, Security, and Cloud Platforms explores the critical topics of data quality, security, and compliance, and shows you how to build data platforms on Alibaba Cloud.

Part 5: Data Engineering for AI and ML is where we dive into the exciting world of data engineering for artificial intelligence, covering RAG, ML pipelines, feature stores, and vector databases.

Part 6: Business Applications and Case Studies brings it all together with detailed, end-to-end case studies from different industries.

Prerequisites

To get the most out of this book, you should have:

If you are not comfortable with Python or SQL, don’t worry. Appendix B provides a quick primer on both.

Setting Up Your Environment

All the code examples in this book can be run on your local machine using Docker and open-source tools. Appendix A provides detailed instructions for setting up your development environment. Additionally, a complete GitHub repository with all the code examples, sample data, and hands-on exercises is available at:

https://github.com/yourusername/data-engineering-in-action

A Note on Cloud Platforms

While this book covers open-source tools that can run anywhere, it also provides extensive coverage of Alibaba Cloud, one of the world’s leading cloud providers. If you don’t have access to Alibaba Cloud, don’t worry—the concepts and architectures discussed are applicable to any cloud platform, and you can adapt the examples to AWS, Azure, or Google Cloud.

Let’s Get Started

Data engineering is a challenging but incredibly rewarding field. The problems are complex, the tools are powerful, and the impact is immense. Every data-driven application, every machine learning model, and every business intelligence dashboard is built on the foundation of robust data pipelines created by data engineers.

I hope this book will be a valuable companion on your journey to becoming a skilled data engineer. Let’s dive in!


Dr. Kushnazarov Farruh
University Tashkent University of Information Technologies named after Muhammad al-Khwarizmi
2026 May 1st


Thank you all.


Table of Contents

Part 1: Foundations of Data Engineering

Chapter 1: Introduction to Data Engineering

Chapter 2: Data Modeling and Storage Paradigms

Chapter 3: The Open-Source Ecosystem

Part 2: Data Storage Solutions

Chapter 4: Relational Databases

Chapter 5: NoSQL Databases

Chapter 6: Object Storage and Data Lakes

Chapter 7: Data Warehousing and Lakehouse Architectures

Part 3: Data Processing and Orchestration

Chapter 8: Data Processing Frameworks: Spark and Flink

Chapter 9: Streaming Data with Kafka and Flink

Chapter 10: Transformations, Testing, and Analytics Engineering

Chapter 11: Data Orchestration and Workflow Management

Part 4: Analytics, Governance, Security, and Cloud Platforms

Chapter 12: Data Observability and Pipeline Reliability

Chapter 13: Data Governance and Security

Chapter 14: Data Engineering on Alibaba Cloud

Chapter 15: Cost, Performance, and Scalability Engineering

Chapter 16: Solution Selection Framework

Part 5: Data Engineering for AI and ML

Chapter 17: Data Engineering for RAG Applications

Chapter 18: ML Pipeline Engineering

Chapter 19: Feature Stores and Model Serving

Chapter 20: Vector Databases and Embeddings

Part 6: Business Applications and Case Studies

Chapter 21: Case Study: Building a Real-time Customer 360 Platform

Chapter 22: Case Study: Fraud Detection in Financial Services

Appendices

Appendix A: Setting Up Your Development Environment

Appendix B: A SQL and Python Primer

Appendix C: The Data Engineering Career Path

Appendix D: Further Reading and Resources

Appendix E: Glossary of Terms


How to Use This Book

This book is designed to be read sequentially, as each chapter builds on the concepts introduced in previous chapters. However, if you are already familiar with certain topics, you can skip ahead to the chapters that interest you most.

Each chapter follows a similar structure:

  1. Introduction: An overview of what you will learn

  2. Concepts: Detailed explanations of the key concepts

  3. Hands-On Examples: Practical code examples you can run yourself

  4. Best Practices: Production-ready advice from real-world experience

  5. Chapter Summary: A recap of the key takeaways

  6. Exercises: Practice problems to reinforce your learning

Make sure to work through the hands-on exercises and examples. Data engineering is a practical skill that is best learned by doing.


Let’s begin your journey into the world of data engineering!

Acknowledgments

This book would not have been possible without the support and contributions of many people: