# Instructor Solution — Chapter 21 Real-Time Customer 360 Capstone

This instructor solution explains the intended design and validation interpretation for `shared/projects/ch21_realtime_customer360`. The project is deliberately implemented with local CSV files and standard-library Python, but the conceptual target is a production Customer 360 data product built from CDC streams, behavioral events, support activity, consent changes, and lakehouse serving tables.

## Expected implementation flow

A correct run executes the three scripts in order from the project directory.

```bash
python3 scripts/generate_data.py
python3 scripts/build_customer360.py
python3 scripts/validate_outputs.py
```

The first script produces deterministic source-shaped data under `data/raw/`. The second script copies raw records to bronze, builds silver identity and profile-base artifacts, and writes gold serving marts. The third script evaluates whether the gold profile is complete, unique, sufficiently non-null, and fresh enough for the sample scenario.

## Production interpretation

The capstone should be assessed as a data-product design exercise, not merely as a CSV transformation exercise. The important outcome is that readers can explain how local artifacts map to production building blocks.

| Project artifact | Intended production analogue | Instructor notes |
|---|---|---|
| `data/raw/customers.csv` and `data/raw/orders.csv` | Operational database snapshots or CDC topics | In production, these records should be captured as ordered changes with primary keys, operation type, transaction metadata, and source timestamps. |
| `data/raw/clickstream.csv` | Behavioral event topic | The clickstream is append-only and should be processed with event-time semantics. |
| `data/raw/consent_events.csv` | Consent and preference event stream | Consent must be treated as a governance input, not as an optional marketing attribute. |
| `data/bronze/*_bronze.csv` | Raw lakehouse bronze tables | Bronze should preserve source fidelity and enable replay. |
| `data/silver/customer_identity.csv` | Identity-resolution table or graph | The sample uses deterministic rules; large systems often combine deterministic, probabilistic, and steward-reviewed matches. |
| `data/silver/customer_profile_base.csv` | Cleaned and conformed profile entity | This is the auditable customer entity before business-specific serving logic. |
| `data/gold/customer_360.csv` | Serving mart, feature table, or profile API backing table | Gold is optimized for activation and analytics, so it contains consent, value, churn, support, and segment fields. |
| `reports/customer360_quality_report.md` | Data product SLO report | A production version should be emitted to observability systems and tied to release gates. |

## Expected quality interpretation

The validation report should show `pass` for `profile_coverage_rate`, `duplicate_profile_count`, and `critical_null_rate`. Freshness checks are deliberately lenient because the sample project uses deterministic dates. A student who changes event dates should either preserve the sample thresholds or explain the resulting warnings.

The `marketing_consent_rate` row is informational. It should not be optimized as a performance metric by itself because raising marketing reach by ignoring consent would be a governance failure. The correct interpretation is that activation must filter or suppress records based on current consent state.

## Reference answer for design discussion

A strong production design would ingest operational changes with CDC, publish customer, order, consent, campaign, support, and clickstream events to partitioned topics, and process those topics with a stateful stream processor. The processor would maintain profile state by `customer_id` or resolved identity key, apply event-time handling for late events, and write both replayable lakehouse tables and low-latency serving stores.

The lakehouse should preserve the bronze/silver/gold separation. Bronze tables keep raw source fidelity. Silver tables enforce schemas, deduplicate records, resolve identities, and attach governance metadata. Gold tables expose business-ready Customer 360 views, activation candidates, segment summaries, and ML-ready features. The serving plane can use a key-value store or profile API when operational applications need millisecond lookups.

## Common mistakes

The most common mistake is to treat Customer 360 as a single denormalized table without specifying ownership, consent, identity rules, freshness, or quality thresholds. Another common mistake is to aggregate behavior without event time, which makes late-arriving data difficult to correct. A third mistake is to expose profiles directly to activation systems without a serving contract or a validation report.

## Grading guidance

| Criterion | Strong evidence | Weak evidence |
|---|---|---|
| Architecture mapping | Clearly maps local artifacts to CDC, topics, stream processors, lakehouse layers, and serving APIs. | Names technologies without explaining how state and replay work. |
| Identity handling | Explains deterministic identity rules and limitations. | Assumes all systems already share one perfect customer key. |
| Consent handling | Treats consent as a blocking governance field for activation. | Optimizes campaign reach without preserving current consent. |
| Quality controls | Interprets coverage, duplicates, critical nulls, and freshness thresholds. | Only checks that the scripts run. |
| Production readiness | Discusses late events, replay, schema changes, rollback, and monitoring. | Presents the CSV pipeline as though it were production-ready by itself. |
