A real-time Customer 360 platform answers a deceptively simple question: what do we know about this customer right now, and what action is safe, useful, and compliant? In this capstone, you will combine the core patterns from the book into an end-to-end customer profile platform for TuranMart, the fictional e-commerce and logistics company used throughout the earlier chapters. The practical outcome is a runnable project that generates source data, builds bronze, silver, and gold Customer 360 products, and validates the result with operating checks.
This chapter is not a tour of vendor products. It is a design and implementation lesson about source contracts, change capture, event streams, identity resolution, medallion data products, consent-aware activation, and operational observability. Customer 360 programs often fail when teams treat the profile as a wide table; they succeed when the profile is managed as a governed data product with clear owners, service levels, consumers, and failure modes.
Figure 1:Chapter 21 capstone overview: business goals, source systems, streaming and batch processing, identity resolution, serving, activation, and operating controls.
21.1 Opening Scenario: TuranMart’s Fragmented Customer Experience¶
TuranMart’s growth team has a familiar problem. The web storefront can see which products a visitor clicked during the current session. The commerce database knows what they bought last week. The support desk knows whether the same customer has an unresolved complaint. The campaign platform knows which promotion they opened. The consent system knows whether marketing and personalization are currently allowed. Yet the application that must decide the next best action sees only one fragment at a time.
The pain is visible in daily operations. A loyal customer who has just opened a delivery complaint receives a cheerful upsell email. A returning visitor sees a discount for a product they already purchased. A support agent spends the first two minutes asking questions that TuranMart already knows how to answer. The retention team receives a churn-risk list only after the customer has stopped visiting the site. Everyone has data, but no one has a trusted, current, and governed customer profile.
The Chief Customer Officer asks for a platform that supports four high-value decisions. The website should personalize the next session. The support desk should show recent orders, complaints, value, and consent. The retention team should prioritize urgent customers. The marketing team should build compliant audiences. The success criterion is not merely that a pipeline runs; it is that every known customer receives exactly one usable profile, the profile is fresh enough for its consumers, and activation respects current consent.
| Use case | Pain point | Solution required |
|---|---|---|
| Website personalization | The storefront reacts to the current page view but does not know recent purchases or service issues. | A low-latency profile lookup that combines recent behavior, value, and consent. |
| Support-agent context | Agents ask customers to repeat information that exists in other systems. | A support-ready customer view with orders, tickets, value tier, and risk indicators. |
| Retention workflow | Churn risk is discovered after engagement has already dropped. | A gold profile with recency, engagement, support, and campaign features. |
| Marketing segmentation | Campaign lists are assembled from inconsistent extracts and may ignore consent changes. | A governed audience source that carries current consent and suppression attributes. |
These requirements force TuranMart to integrate systems with different latency, ownership, and data-shape characteristics. Orders arrive through a transactional database. Clickstream events arrive continuously. Support tickets and campaign events arrive from SaaS platforms. Consent changes must override activation logic immediately enough to prevent misuse. The Customer 360 platform must therefore combine historical completeness with fresh behavioral state.
Learning Objectives¶
By the end of this chapter, you should be able to design, build, validate, and explain a production-style Customer 360 platform. The runnable project is intentionally lightweight, but the reasoning mirrors a real data engineering review.
| Learning outcome | What you will do in the capstone | Evidence of mastery |
|---|---|---|
| Translate business goals into requirements | Convert personalization, support, retention, and marketing needs into freshness, quality, consent, and serving requirements. | You can state which consumers need seconds, minutes, hours, or daily freshness and why. |
| Design the reference architecture | Combine CDC, event streams, stream processing, lakehouse layers, serving stores, and observability. | You can explain why the architecture separates ingestion, processing, identity, serving, and control planes. |
| Model the unified customer profile | Build identity, consent, value, engagement, service, segment, and risk fields. | You can distinguish a silver customer entity from a gold serving profile. |
| Validate operational readiness | Check profile coverage, duplicate profiles, critical nulls, freshness, and consent readiness. | You can interpret the generated quality report and decide whether the profile is safe to use. |
| Extend the project safely | Propose changes such as late-event handling, activation exports, stronger identity resolution, and serving contracts. | You can modify the project without breaking reproducibility or governance assumptions. |
21.2 Conceptual Foundation: Customer 360 as a Governed Data Product¶
A Customer 360 platform is a governed data product that unifies customer identity, behavior, transactions, service history, consent, and derived signals for downstream decisions. AWS describes Customer 360 programs around the pillars of data collection, unification, analytics, activation, and governance, which is a useful reminder that the profile must support both analytical and operational use cases.[1] Salesforce’s architecture guidance similarly emphasizes trusted, unified, actionable, and real-time customer data for operational, analytical, AI, and agentic workflows.[2]
The phrase real-time should be used carefully. A website personalization decision may need a profile lookup in milliseconds and feature updates within seconds or minutes. A support dashboard may tolerate a few minutes. A campaign segment may be safe with hourly or daily refresh. Real-time architecture is therefore not a single latency target; it is a set of service-level promises tied to consumer decisions.
| Concept | Definition | Why it matters |
|---|---|---|
| Customer 360 | A governed profile that combines identity, transactions, behavior, service, consent, and derived features. | It prevents each team from inventing a different customer truth. |
| Source contract | A documented agreement about owner, schema, key, event time, update pattern, lateness, and privacy classification. | It makes ingestion and downstream expectations explicit. |
| Change data capture | The pattern of converting database inserts, updates, and deletes into a stream of change events. | It keeps operational history fresh without repeatedly scanning source databases. |
| Event stream | A continuously appended sequence of events that represent facts or state changes. | It allows processing systems to react to behavior as it happens. |
| Identity resolution | The process of mapping identifiers such as loyalty ID, email hash, phone hash, session ID, and CRM contact ID to a stable customer key. | It prevents fragmented or over-merged profiles. |
| Medallion architecture | A bronze, silver, and gold layering pattern for progressively refined data products. | It separates raw evidence, trusted entities, and serving-ready products. |
| Serving profile | A compact profile optimized for online or operational consumers. | It avoids exposing a wide analytical table directly to latency-sensitive applications. |
| Operating checks | Automated measurements for coverage, uniqueness, nulls, freshness, schema, and consent readiness. | They convert data trust into observable service levels. |
Event streaming provides the backbone for fresh profiles. Confluent describes event streaming as the processing of continuous flows of events as changes happen, allowing applications to react without waiting for a batch window.[3] Apache Flink documentation distinguishes bounded from unbounded streams and explains why event time and stateful processing are central to consistent stream applications, especially when events arrive out of order.[4]
Change data capture is the bridge from systems of record to streams. Debezium documents a pattern in which connectors capture row-level changes from databases and stream those change events to Kafka topics.[5] For Customer 360, CDC is especially useful for customers, orders, addresses, account state, and consent records because those entities are often stored in transactional systems that should not be overloaded by analytical joins.
The lakehouse pattern then gives the platform durable, replayable, and auditable data products. Databricks describes medallion architecture as a sequence of bronze, silver, and gold layers, where bronze keeps raw ingested data, silver performs cleaning and validation, and gold provides business-level aggregates and serving-ready datasets.[6] In Customer 360, this separation is more than a naming convention. It defines where raw facts are preserved, where identity is resolved, where quality is enforced, and where application contracts are served.
21.3 Production Design Pattern: Reference Architecture¶
The production design pattern separates five planes: ingestion, processing, identity, serving, and control. This separation keeps the architecture understandable even when the underlying tools vary by organization. TuranMart could implement the pattern with open-source components such as Debezium, Kafka, Flink, Iceberg, PostgreSQL, Redis, and Superset. A cloud-native implementation could use managed CDC, managed streaming, lakehouse tables, serverless SQL, online feature stores, and observability tools. The durable lesson is not the logo on each box; it is the contract between the boxes.
Figure 2:Reference architecture for a real-time Customer 360 platform using source systems, CDC and events, stream processing, a medallion lakehouse, serving stores, activation services, and observability controls.
| Architecture plane | Production responsibility | Typical technology choices | Capstone equivalent |
|---|---|---|---|
| Ingestion | Capture database changes, clickstream events, support updates, campaign events, and consent changes with event-time metadata. | CDC connectors, Kafka Connect, managed streaming, REST collectors, object storage landing zones. | scripts/generate_data.py writes deterministic source CSV files. |
| Processing | Clean, deduplicate, aggregate, join, and maintain state for customer profiles. | Flink, Spark Structured Streaming, SQL transformations, dbt, orchestration systems. | scripts/build_customer360.py builds bronze, silver, and gold outputs. |
| Identity | Resolve identifiers to a stable customer key with auditability and confidence. | MDM, identity graph, deterministic rules, probabilistic matching, stewardship tools. | data/silver/customer_identity.csv. |
| Serving | Provide profiles to websites, support tools, marketing, analytics, and ML systems. | Redis, PostgreSQL, Hologres, Elasticsearch/OpenSearch, online feature stores, APIs. | data/gold/customer_360.csv and summary outputs. |
| Control | Monitor schema, freshness, quality, consent, access, lineage, and rollback readiness. | Data observability, lineage tools, CI checks, alerting, access policies. | scripts/validate_outputs.py and reports/customer360_quality_report.md. |
The reference architecture should begin with source contracts rather than technology selection. A source contract names the owner, grain, primary key, event-time field, update pattern, expected lateness, privacy classification, and replay strategy. Without that contract, a downstream team may discover too late that order updates overwrite history, clickstream events can arrive late, support tickets have inconsistent customer IDs, or consent records are not current enough for activation.
| Source domain | Example capstone file | Grain | Event-time field | Production concern |
|---|---|---|---|---|
| Customer master | data/raw/customers.csv | One row per customer. | created_at | Customer identifiers must be stable and privacy-protected. |
| Orders | data/raw/orders.csv | One row per order event. | event_time | Returns and cancellations must not inflate value metrics. |
| Clickstream | data/raw/clickstream.csv | One row per behavioral event. | event_time | Anonymous sessions and late events require careful state handling. |
| Support tickets | data/raw/support_tickets.csv | One row per ticket. | opened_at | Complaint signals should influence retention and support context. |
| Campaign events | data/raw/campaign_events.csv | One row per campaign contact. | sent_at | Opens and clicks must be interpreted with channel and consent rules. |
| Consent events | data/raw/consent_events.csv | One row per consent change. | event_time | Current consent must constrain activation even when commercial value is high. |
The design must also distinguish analytical profiles from online serving profiles. An analytical profile may include hundreds of columns and support ad hoc exploration. An online serving profile should be compact, versioned, and designed for predictable lookup latency. Exposing a wide warehouse table directly to an application is usually a mistake because it creates unstable contracts, high payload sizes, and unclear access controls.
21.4 Identity Resolution and the Unified Profile Model¶
Identity resolution is the central modeling challenge in Customer 360. A customer may appear as a session cookie before login, a loyalty ID after purchase, an email hash in marketing, and a CRM contact ID in support. If TuranMart does not stitch these identifiers together, it creates partial profiles. If it stitches too aggressively, it may merge two real people and cause privacy, support, or compliance mistakes.
A robust identity model starts with deterministic rules. The same verified loyalty ID should map to the same resolved customer key. A verified email hash can be a strong signal if the business process guarantees uniqueness. Phone hashes may be useful but require caution when families share accounts. Probabilistic identity methods can improve coverage, but they require confidence scores, review workflows, unmerge procedures, and audit history.
Figure 3:Identity resolution graph showing how loyalty IDs, hashed email, hashed phone, sessions, orders, support tickets, and consent events are stitched into one resolved customer profile.
The capstone project deliberately keeps identity resolution conservative. It writes data/silver/customer_identity.csv, carries hashed contact identifiers, and assigns a stable resolved key. That implementation is not meant to solve enterprise identity in full; it is meant to make the profile pipeline auditable and reproducible. In a larger platform, the identity table would be versioned, governed by a data owner, and connected to privacy workflows such as deletion, suppression, and unmerge.
| Profile field group | Example gold fields | Primary consumer | Modeling rule |
|---|---|---|---|
| Identity and governance | customer_id, marketing_consent, personalization_consent. | All downstream consumers. | Consent must be current and easy to filter before activation. |
| Commercial value | order_count, total_revenue_usd, avg_order_value_usd, days_since_last_order. | Retention, executive reporting, personalization. | Use paid orders unless the business explicitly defines net revenue with returns. |
| Behavioral engagement | click_events_45d, sessions_45d, open_rate, click_rate. | Personalization and marketing. | Window definitions should be documented and stable. |
| Service experience | support_tickets_120d, avg_csat, avg_resolution_hours. | Support and retention. | Service frustration should influence customer treatment. |
| Decision features | customer_value_score, churn_risk_score, segment, retention_priority. | Retention workflows and analytics. | Scores must be explainable before they automate customer actions. |
21.5 Data Products, KPIs, and Operating Metrics¶
A Customer 360 platform should not be judged by the number of systems connected to it. It should be judged by whether its products are complete, trusted, timely, governed, and useful. The Chapter 21 project produces three gold outputs: customer_360.csv, segment_summary.csv, and kpi_summary.csv. It also produces a validation report that behaves like a miniature control plane.
Figure 4:Customer 360 data products and KPIs: bronze evidence, silver identities and profiles, gold serving tables, segment summaries, operating checks, and business activation metrics.
The deterministic sample dataset contains 120 customers, 424 paid orders, $28,142.12 in gross revenue, a 1.0000 profile coverage rate, a 0.8083 marketing consent rate, and 3 urgent retention customers. These numbers are intentionally small enough to inspect manually while still showing the structure of a production operating review.
| Metric from the capstone project | Value | Interpretation |
|---|---|---|
| Customers | 120 | The generated source population used by the capstone. |
| Paid orders | 424 | Orders included in commercial value calculations. |
| Gross revenue | $28,142.12 | Sum of paid order amounts in the deterministic sample. |
| Profile coverage rate | 1.0000 | Every source customer appears in the gold profile table. |
| Marketing consent rate | 0.8083 | Activation must filter roughly one fifth of customers from marketing use. |
| Urgent retention customers | 3 | A small high-priority group receives immediate attention. |
The quality report turns assumptions into observable controls. A green orchestration run does not prove that Customer 360 is safe. The platform must prove that it produced one profile per customer, avoided critical nulls, met freshness thresholds, and carried consent fields. A production implementation should emit these checks to observability systems, connect severe failures to release gates, and make warnings visible to data product owners.
21.6 Guided Lab: Run the Shared Real-Time Customer 360 Project¶
The runnable project is located in shared/projects/ch21_realtime_customer360/. It does not require Docker, Kafka, Flink, or external Python packages. That simplicity is intentional: the lab focuses on data-product thinking, contract validation, and layer responsibilities before introducing heavy infrastructure. The directory structure mirrors a production platform even though the implementation uses local files.
Figure 5:Architecture of the shared Chapter 21 project: deterministic source generation, bronze copies, silver identity and profile tables, gold Customer 360 outputs, and a validation report.
| Lab material | Link |
|---|---|
| Project README | README.md |
| Source generator | generate_data.py |
| Customer 360 builder | build |
| Validator | validate_outputs.py |
| Sample raw data folder | |
| Expected gold output folder | |
| Validation report | customer360 |
| Extension exercises | README.md |
| Instructor solution | solution.md |
From the repository root, run the project with the following commands.
cd shared/projects/ch21_realtime_customer360
python3 scripts/generate_data.py
python3 scripts/build_customer360.py
python3 scripts/validate_outputs.pyA successful run should produce output similar to the following.
Generated source data in .../shared/projects/ch21_realtime_customer360/data/raw
Built Customer 360 outputs in .../shared/projects/ch21_realtime_customer360/data
Wrote validation report to .../shared/projects/ch21_realtime_customer360/reportsAfter the scripts complete, inspect the generated files.
find data reports -maxdepth 2 -type f | sortThe most important outputs are the gold serving profile and the quality report. The gold table is the profile data product. The report is the operating evidence that the profile is safe enough to inspect and discuss.
| Output file | Layer | Purpose |
|---|---|---|
data/raw/*.csv | Source landing | Deterministic customer, order, clickstream, ticket, campaign, and consent inputs. |
data/bronze/*_bronze.csv | Bronze | Source-shaped evidence copied into a durable landing layer. |
data/silver/customer_identity.csv | Silver | Resolved customer keys and identity attributes. |
data/silver/customer_profile_base.csv | Silver | Cleaned customer profile base with selected measures. |
data/gold/customer_360.csv | Gold | Serving-ready unified profile used by applications and analysts. |
data/gold/segment_summary.csv | Gold | Segment-level revenue, risk, and value aggregates. |
data/gold/kpi_summary.csv | Gold | Executive and operating metrics. |
reports/customer360_quality_report.md | Control plane | Validation checks for coverage, uniqueness, nulls, freshness, and consent readiness. |
The validation report should include rows similar to the following.
profile_coverage_rate 1.0 pass
duplicate_profile_count 0 pass
critical_null_rate 0.0 pass
orders_freshness_lag_hours 5.13 pass
clickstream_freshness_lag_hours 0.41 pass
marketing_consent_rate 0.8083 infoThese values are not universal production thresholds. They are a reproducible sample that teaches how to express service levels. A real website personalization service may require sub-minute feature freshness, while a weekly executive segment report may tolerate a daily refresh. The important lesson is that every consumer should know which profile version it is using and what freshness promise it receives.
If the build script fails with a missing-file error, run python3 scripts/generate_data.py first. If validation reports duplicate profiles, inspect data/silver/customer_identity.csv and verify that every customer has exactly one resolved key. If freshness warnings appear after changing sample dates, update the deterministic clock in the scripts or document why the sample intentionally represents stale data. If consent-based activation produces surprising counts, inspect data/raw/consent_events.csv and confirm that the latest consent event is used.
21.7 Implementation Details and Design Trade-offs¶
The capstone calculates customer value and churn risk with transparent rules instead of a black-box model. Recency, frequency, and monetary value become customer_value_score. Churn risk combines days since last order, recent engagement, support ticket volume, and campaign engagement. The resulting segment and retention_priority fields are easy to explain to a product manager, support lead, or reviewer.
This rule-based approach is not a production-grade churn model. A production model would need training labels, validation, drift monitoring, feature lineage, controlled experiments, fairness review, and rollback. The rule-based approach is still useful because it creates a baseline feature contract. A data science team can later replace the scoring logic while preserving the serving interface.
| Design decision | Capstone choice | Production alternative | Trade-off |
|---|---|---|---|
| Batch versus streaming | File-based deterministic simulation. | Kafka, Debezium, Flink, Spark Structured Streaming. | The lab is easy to run; production streams support fresher state. |
| Storage model | CSV files organized as bronze, silver, and gold. | Delta Lake, Iceberg, Hudi, MaxCompute, BigQuery, Snowflake. | CSV is transparent; table formats add transactions, schema evolution, compaction, and optimization. |
| Serving model | Gold CSV as the serving artifact. | Redis, PostgreSQL, Hologres, online feature stores, profile APIs. | CSV proves the contract; online stores provide predictable lookup latency. |
| Identity resolution | Deterministic key assignment. | Identity graph with confidence scoring and stewardship. | Deterministic rules are safe for teaching; graph methods improve coverage with higher governance cost. |
| Scoring | Explainable rules. | ML churn, next-best-action, uplift, or recommendation models. | Rules are auditable; ML can improve performance but requires monitoring and experimentation. |
| Governance | Consent fields carried in the profile. | Central policy engine, attribute-based access control, privacy workflows. | The lab teaches the minimum safe pattern; enterprise systems need stronger controls. |
A common architectural debate is Lambda Architecture versus Kappa Architecture. Lambda separates batch and speed layers; Kappa treats the event log as the source of truth and recomputes state by replaying events. TuranMart’s capstone is closer to a hybrid medallion pattern: historical tables provide completeness, while streams provide freshness. This is often practical because not every source emits clean events and not every consumer needs second-level latency.
21.8 Common Pitfalls and Operational Lessons¶
The first pitfall is building a Customer 360 table without a product owner. Many teams want the profile, but one accountable owner must define columns, quality thresholds, access rules, service levels, and the roadmap. Without ownership, every consumer adds fields, definitions drift, and no one knows which table should be trusted.
The second pitfall is confusing data integration with identity resolution. Joining everything on email or customer_id may work in a demo, but real customers change emails, share devices, use guest checkout, contact support through different channels, and request deletion. Identity resolution is a governed subsystem, not an accidental join.
The third pitfall is treating consent as a campaign-layer filter. Consent and suppression attributes should travel with the profile and should be included in quality checks. Downstream teams should not have to remember to join a separate consent table before activation.
The fourth pitfall is serving a profile that is too wide. A warehouse table may contain hundreds of columns, but an operational application needs a compact payload with stable names, versioned semantics, and predictable latency. Use separate analytical and online-serving views when necessary.
The fifth pitfall is measuring pipeline success but not product success. A scheduled job can be green while the profile is stale, duplicated, missing consent, or unhelpful to consumers. Customer 360 operations should measure coverage, uniqueness, nulls, schema changes, event-time lag, processing-time lag, segment distribution, activation outcomes, and business impact.
21.9 Exercises¶
The exercises below increase in difficulty. They should be completed after the base project runs successfully and the validation report is generated.
| Difficulty | Exercise | Expected evidence |
|---|---|---|
| Easy | Add a returns.csv source and compute net_revenue_usd so returned orders reduce value. | Updated source data, builder logic, and a short explanation of how returns affect segment assignment. |
| Easy | Add a serving-contract check that verifies required columns exist in customer_360.csv. | Validator fails if customer_id, consent fields, scores, segment, or retention_priority are missing. |
| Medium | Create data/gold/activation_candidates.csv containing only customers with both marketing and personalization consent. | Export file plus proof that high-value customers without consent are excluded. |
| Medium | Add a late-arriving order scenario and explain whether the pipeline processes by event time, ingestion time, or deterministic replay snapshot. | Modified data plus a short note about watermarks and replay in production. |
| Challenge | Replace the deterministic identity rule with a small identity graph that maps hashed identifiers to resolved_customer_key. | New identity artifact plus one example where automatic merging should be rejected. |
| Team | Write a production migration plan for replacing local CSVs with CDC, event streams, stream processing, lakehouse tables, serving APIs, and monitoring. | One-page design with a freshness SLO, correctness SLO, rollback plan, and ownership model. |
21.10 Review Questions¶
Why is Customer 360 better described as a governed data product than as a single table?
Which fields in the capstone profile should be considered governance-critical, and why?
How do CDC and event streaming complement each other in a real-time profile platform?
What is the difference between a silver customer entity and a gold serving profile?
Why can profile coverage be high while activation is still unsafe?
How should a production team handle late-arriving clickstream or order events?
What risks appear when identity resolution is too conservative or too aggressive?
Which validation checks would you add before connecting the profile to a website personalization API?
Chapter Summary and Next Step¶
This chapter integrated the book’s core data engineering patterns into a realistic Customer 360 capstone. You translated TuranMart’s business problem into source contracts, architecture planes, identity rules, lakehouse layers, serving outputs, and validation checks. You also ran a deterministic project that generated raw source data, bronze evidence, silver identity and profile artifacts, gold Customer 360 products, and an operating report.
The main lesson is that real-time Customer 360 is not just an ingestion problem. It is an ownership, governance, modeling, serving, and observability problem. Fresh events matter, but so do identity correctness, consent, replayability, and consumer-specific service levels. In the next chapter, you will apply the same end-to-end thinking to another high-stakes system: fraud detection combined with AI-powered search.