# Chapter 2 Lab: Data Models, Formats, and Quality

This lab supports Chapter 2 of **Data Engineering in Action**. It gives readers a small TuranMart dataset and a notebook that converts CSV and JSON source files into validated Parquet outputs.

## Learning Goals

By completing the lab, you will practice reading raw CSV and JSON Lines files, applying explicit types, checking basic data quality expectations, writing Parquet datasets, and validating results with DuckDB SQL.

## Files

| Path | Purpose |
|---|---|
| `data/orders.csv` | Small operational order export. |
| `data/order_items.csv` | Line-item export used for referential and revenue checks. |
| `data/events.jsonl` | Semi-structured campaign and clickstream events. |
| `sql/turanmart_operational_model.sql` | Simplified normalized ER model as SQL DDL. |
| `sql/turanmart_star_schema.sql` | Simplified dimensional model as SQL DDL. |
| `expected_output/revenue_by_date_region.csv` | Expected DuckDB revenue result for validation. |
| `tests/validate_lab_outputs.py` | Lightweight validation script for generated Parquet outputs. |
| `exercises/README.md` | Student exercises that extend the guided lab. |
| `../../notebooks/ch02_formats_quality_lab.ipynb` | Executable conversion and validation notebook. |
| `../../../shared/solutions/ch02_data_models_formats_quality/solution.md` | Instructor/reference solution guide. |

## Quick Start

From the repository root, run:

```bash
python -m pip install -r requirements.txt
jupyter lab shared/notebooks/ch02_formats_quality_lab.ipynb
```

The notebook writes Parquet files to `shared/labs/ch02_data_models_formats_quality/output/parquet/`. The `output/` folder is generated and can be deleted safely. After running the notebook, validate the generated output with:

```bash
python shared/labs/ch02_data_models_formats_quality/tests/validate_lab_outputs.py
```

This chapter does not require Docker Compose because the lab runs locally with Python, pandas, PyArrow, and DuckDB.

## Completion Checklist

| Check | Expected result |
|---|---|
| Raw files load | Orders, order items, and events row counts are printed. |
| Quality report passes | All checks show `PASS`. |
| Parquet files exist | `orders.parquet`, `order_items.parquet`, and `events.parquet` are created. |
| DuckDB query succeeds | Revenue by date and region is returned from Parquet files. |
