Chapter 17: Case Study: Building a Real-time Customer 360 Platform

Throughout this book, we have explored the theories, tools, and technologies that form the foundation of modern data engineering. Now it is time to put it all together and see how these concepts are applied to solve real-world business problems. In this final part of the book, we will walk through a series of detailed case studies, each one showcasing a different application of data engineering in a different industry. We will start with one of the most common and high-impact data engineering projects: building a Real-time Customer 360 platform.

In today’s competitive digital landscape, understanding your customers is more important than ever. A Customer 360 platform is a system that consolidates all the data about your customers from a variety of different sources to create a single, unified view of each customer. This unified view can then be used to power a wide range of business applications, from personalized marketing and targeted advertising to improved customer service and proactive churn prevention. While a traditional Customer 360 platform is built on a batch-oriented data warehouse, a modern, real-time platform is built on a streaming architecture that can provide an up-to-the-second view of the customer.

This chapter will walk you through the process of designing and building a real-time Customer 360 platform for a fictional e-commerce company. We will start by defining the business goals and the data sources. We will then design the end-to-end architecture, from real-time data ingestion and processing to the storage and serving of the unified customer profile. We will discuss the technology choices at each stage of the pipeline and the trade-offs involved. By the end of this chapter, you will have a practical, step-by-step guide to building one of the most valuable data assets in any modern business.

17.1 Business Goals and Requirements¶

Let’s imagine we are the data engineering team at “ShopSmart,” a fast-growing e-commerce company. The business has several key goals that it wants to achieve with a real-time Customer 360 platform:

Personalized Website Experience: The website should be able to personalize the content and product recommendations for each user in real time based on their browsing history and past purchases.
Targeted Marketing Campaigns: The marketing team wants to be able to run targeted email and social media campaigns based on customer segments (e.g., “high-value customers who have not purchased in the last 30 days”).
Proactive Customer Service: The customer service team wants to have a complete view of the customer’s history when they contact support, so they can provide faster and more effective service.
Real-time Churn Prediction: The business wants to be able to predict in real time when a customer is at risk of churning, so they can take proactive steps to retain them.

From these business goals, we can derive a set of technical requirements for our platform:

Real-time Data Ingestion: The platform must be able to ingest a high volume of real-time data, including website clickstreams, orders, and customer service interactions.
Unified Customer Profile: The platform must be able to stitch together data from different sources to create a single, unified profile for each customer.
Low-Latency Serving: The unified customer profile must be available for low-latency lookups by the website and other applications.
Scalability and Elasticity: The platform must be able to scale to handle a growing number of customers and a growing volume of data.

17.2 Data Sources¶

ShopSmart has a variety of data sources that we will need to integrate into our platform:

Transactional Database (MySQL): This is the production database for the e-commerce website. It contains tables for customers, products, and orders.
Website Clickstream (Kafka): The front-end engineering team has instrumented the website to send a stream of events (e.g., page_view, add_to_cart, purchase) to a Kafka topic.
Customer Service System (Zendesk): The customer service team uses Zendesk to manage support tickets.
Email Marketing System (Mailchimp): The marketing team uses Mailchimp to send email campaigns.

17.3 Architecture and Technology Choices¶

Now let’s design the end-to-end architecture for our real-time Customer 360 platform. We will use a combination of open-source tools and services on Alibaba Cloud.

The Lambda Architecture: A Hybrid Approach¶

We will use a Lambda Architecture, a popular pattern for building big data pipelines that need to handle both real-time and batch processing. The Lambda Architecture consists of three layers:

The Batch Layer: This layer pre-computes a comprehensive view of the data from the entire historical dataset. It is optimized for accuracy and completeness.
The Speed Layer (or Streaming Layer): This layer processes data in real time and provides a low-latency, up-to-the-second view of the most recent data. It is optimized for speed.
The Serving Layer: This layer combines the results from the batch layer and the speed layer to provide a complete and up-to-date view of the data to querying applications.

Our Architecture¶

Here is the architecture for our real-time Customer 360 platform:

Data Ingestion:
- Real-time: We will use DataHub to ingest the website clickstream from Kafka and the customer service events from Zendesk (via a webhook).
- Batch: We will use DataWorks to perform a daily batch ingest of the data from the transactional MySQL database and the email marketing system (Mailchimp) into our data lake on OSS.
The Batch Layer:
- We will use MaxCompute to build our batch-level customer profiles. A daily Spark job will run on MaxCompute to read the raw data from OSS, join it together, and compute a set of historical features for each customer (e.g., total_lifetime_spend, number_of_orders, last_purchase_date).
- The output of this job will be a comprehensive customer profile table that is stored in MaxCompute.
The Speed Layer:
- We will use Realtime Compute for Apache Flink to process the real-time data from DataHub. A Flink job will consume the clickstream and customer service events, and it will compute a set of real-time features for each customer (e.g., current_session_page_views, time_since_last_event, has_open_support_ticket).
- The Flink job will maintain the state of these real-time features in its own state backend.
The Serving Layer:
- This is the heart of our Customer 360 platform. We need a database that can combine the batch and real-time views and serve the unified profile with low latency. Hologres is a perfect choice for this. Hologres is a real-time data warehouse that can perform fast point lookups and can also directly query data in MaxCompute.
- We will have two tables in Hologres for each customer:
  - A batch profile table that is updated daily from MaxCompute.
  - A real-time profile table that is updated in real time by our Flink job.
- When an application needs to get the complete profile for a customer, it will query a view in Hologres that joins these two tables together.
Data Consumption:
- The e-commerce website will call a microservice that queries the Hologres view to get the real-time customer profile for personalization.
- The marketing team will use a BI tool (like Quick BI) to query the batch profile table in MaxCompute to build customer segments for their campaigns.
- The data science team will use the historical feature data in MaxCompute to train a churn prediction model.

17.4 Implementation Details and Challenges¶

Identity Resolution: The Stitching Problem¶

One of the biggest challenges in building a Customer 360 platform is identity resolution. A single customer might be represented by different IDs in different systems (e.g., a customer_id in the database, a cookie_id in the clickstream, an email_address in Zendesk). You need to have a robust process for stitching these different identities together to create a single, unified customer profile. This often involves a combination of deterministic rules (e.g., joining on email address) and probabilistic matching.

Data Modeling for the Unified Profile¶

You need to carefully design the data model for your unified customer profile. This will likely be a wide table with hundreds or even thousands of columns, including:

Customer Attributes: Name, email, location, etc.
Historical Aggregates: Lifetime spend, number of orders, etc.
Real-time Features: Current session information, recent events, etc.
Predicted Features: Churn score, predicted lifetime value, etc.

Managing the Complexity¶

This is a complex architecture with many moving parts. You will need to have a robust data orchestration and monitoring strategy. DataWorks can be used to orchestrate the entire pipeline, from the batch ingest to the daily MaxCompute job. You will also need to have detailed monitoring and alerting in place for your Flink job and your serving layer to ensure that the system is reliable and performant.

Chapter Summary¶

In this chapter, we have walked through a detailed, real-world case study of building a real-time Customer 360 platform. We have seen how to combine a variety of different data engineering tools and technologies—from Kafka and Flink for real-time processing to MaxCompute and Hologres for batch and serving—to build a powerful and high-impact data product. We have also discussed some of the key challenges, such as identity resolution and data modeling.

This case study demonstrates how the foundational concepts we have learned throughout this book can be applied to solve a concrete business problem. In the next chapter, we will look at another real-world case study from a different industry: building a fraud detection system in financial services.