# Chapter 9 Lab: Streaming Clickstream Windows

This lab simulates a small TuranMart clickstream pipeline without requiring Kafka or Flink on your laptop. The goal is to make **event time**, **watermarks**, **allowed lateness**, and **deterministic replay** visible with a file-based simulator that you can inspect line by line.

The simulator reads JSONL events sorted by arrival time, tracks the maximum observed event time, subtracts the allowed-lateness interval to form a watermark, and writes one row per tumbling event-time window. Events older than the watermark are written to the optional late-event output instead of changing the window metrics.

| Asset | Purpose |
|---|---|
| `data/clickstream_events.jsonl` | Deterministic sample events with arrival time and event time. |
| `streaming_window_simulator.py` | Dependency-free event-time window simulator. |
| `expected_output/lateness_20_metrics.csv` | Expected metrics for a 60-second window and 20-second allowed lateness. |
| `expected_output/lateness_20_late_events.csv` | Expected late-event side output for the same run. |
| `expected_output/lateness_0_metrics.csv` | Comparison output when no lateness is allowed. |
| `validate_outputs.py` | Exact CSV validator for deterministic grading. |

## Run the lab

From the repository root, run the default Chapter 9 command.

```bash
python3 shared/labs/ch09_streaming_clickstream/streaming_window_simulator.py \
  --input shared/labs/ch09_streaming_clickstream/data/clickstream_events.jsonl \
  --window-seconds 60 \
  --allowed-lateness-seconds 20 \
  --output /tmp/ch09_window_metrics.csv \
  --late-output /tmp/ch09_late_events.csv
```

Then validate the deterministic metrics.

```bash
python3 shared/labs/ch09_streaming_clickstream/validate_outputs.py \
  --actual /tmp/ch09_window_metrics.csv \
  --expected shared/labs/ch09_streaming_clickstream/expected_output/lateness_20_metrics.csv
```

You can also validate the late-event side output.

```bash
python3 shared/labs/ch09_streaming_clickstream/validate_outputs.py \
  --actual /tmp/ch09_late_events.csv \
  --expected shared/labs/ch09_streaming_clickstream/expected_output/lateness_20_late_events.csv
```

## Explore the design

Run the simulator again with `--allowed-lateness-seconds 0` and compare the results with `expected_output/lateness_0_metrics.csv`. The stricter watermark treats more delayed records as late, which demonstrates why production teams must agree on lateness policy before publishing real-time metrics.
