# Chapter 2 Solution Guide: Data Models, Formats, and Quality

This guide gives a reference interpretation for the Chapter 2 exercises. It is intentionally separated from the starter lab so instructors can assign the exercises without exposing the answers immediately.

## Reference Answers

| Exercise | Reference solution |
|---|---|
| Add `returned` status | `returned` should be added to the valid status set only if the business process treats returns as a post-payment lifecycle state. It should not be mixed with `cancelled`, because cancellation usually occurs before fulfillment while return occurs after delivery. |
| Duplicate `order_id` | The uniqueness check should fail because `orders['order_id'].is_unique` becomes `False`. The correct response is to quarantine or deduplicate records using a deterministic rule, not silently aggregate duplicates. |
| `dim_campaign` | A minimal dimension includes `campaign_key`, `campaign_id`, `campaign_channel`, and `utm_source`. In a production warehouse, it should also include effective dates if campaign metadata changes over time. |
| CSV vs Parquet benchmark | Parquet usually becomes more efficient as row counts grow because typed columnar data compresses repeated values well and allows analytical engines to scan selected columns. The exact ratio depends on cardinality and compression settings. |
| Slowly changing customer dimension | A type-2 dimension should include `customer_key`, `customer_id`, descriptive attributes such as `segment`, `effective_from`, `effective_to`, and `is_current`. Facts should join to the correct historical dimension row using event time. |

## Reference DuckDB Query for Campaign Revenue

```sql
WITH event_campaign AS (
    SELECT customer_id, campaign_id, campaign_channel
    FROM read_parquet('shared/labs/ch02_data_models_formats_quality/output/parquet/events.parquet')
), paid_orders AS (
    SELECT customer_id, campaign_id, total_amount
    FROM read_parquet('shared/labs/ch02_data_models_formats_quality/output/parquet/orders.parquet')
    WHERE payment_status = 'paid'
)
SELECT
    COALESCE(e.campaign_channel, 'unknown') AS campaign_channel,
    ROUND(SUM(o.total_amount), 2) AS paid_revenue
FROM paid_orders o
LEFT JOIN event_campaign e
  ON o.customer_id = e.customer_id
 AND o.campaign_id = e.campaign_id
GROUP BY 1
ORDER BY 2 DESC;
```

## Quality Checklist

A complete solution should preserve the original raw data, make schema and type assumptions explicit, fail visibly when critical assumptions are violated, and explain how the curated Parquet output supports the star schema introduced in the chapter.
