# Chapter 17 Guided Lab: Design a RAG Knowledge Pipeline

This lab turns Chapter 17 into a concrete design artifact for an internal policy assistant. The goal is not to call a specific vendor API. The goal is to practice the data engineering decisions that make retrieval-augmented generation reliable: source inventory, chunk schema, access-control metadata, embedding versioning, evaluation questions, and operational validation.

## Scenario

A company wants an assistant that answers employee policy, support, IT, and security questions using approved internal documents. The assistant must cite its sources, avoid restricted content, refuse unsafe requests, and stay current as source documents change.

## Materials

| File | Purpose |
|---|---|
| `source_inventory.csv` | Starter list of knowledge sources, owners, sensitivity levels, update cadence, and access tags. |
| `chunk_schema.json` | Canonical chunk record that students can extend before implementing indexing. |
| `evaluation_questions.csv` | Starter retrieval, grounding, freshness, and refusal tests. |
| `tests/validate_rag_lab.py` | Lightweight validator for required fields and risk coverage. |
| `exercises/README.md` | Optional extension tasks after the main lab. |

## Workflow

First, review `source_inventory.csv` and decide which sources belong in the first production release. For each selected source, document the owner, freshness requirement, sensitivity level, and access policy. Second, adapt `chunk_schema.json` to your selected stack, making sure every chunk can be traced back to a document, source URI, section path, content hash, access tags, chunking strategy, embedding model, and vector dimension. Third, expand `evaluation_questions.csv` with at least 30 realistic questions, including normal questions, source-lookup questions, freshness tests, permission tests, and refusal cases. Fourth, write a short design note explaining your chunking strategy, vector-store choice, hybrid retrieval strategy, reranking approach, and rollback plan.

## Validation

Run the dependency-free validator from the lab directory:

```bash
python3 tests/validate_rag_lab.py
```

Expected output:

```text
PASS validate_source_inventory
PASS validate_chunk_schema
PASS validate_evaluation_questions
PASS Chapter 17 RAG guided-lab artifacts are internally consistent
```

The validator does not prove that a RAG system is accurate. It only verifies that the design artifacts include the minimum metadata, schema, and evaluation coverage required before implementation.

## Expected Deliverables

| Deliverable | Acceptance criteria |
|---|---|
| Source inventory | Includes owners, sensitivity, update frequency, format, document counts, and access tags for each source. |
| Chunk schema | Includes traceability, access control, embedding versioning, chunking versioning, quality status, and timestamps. |
| Evaluation set | Contains at least 30 questions with expected source documents and refusal cases. |
| Architecture note | Explains ingestion cadence, parsing quality gates, chunking, indexing, hybrid retrieval, reranking, evaluation, monitoring, and rollback. |
| Security note | Describes permission filtering, prompt-injection handling, sensitive-data controls, logging policy, and incident review. |

## Cleanup

This lab writes no generated files by default. If you implement a prototype, keep generated indexes, embeddings, and logs under `outputs/` and remove them before committing unless they are small, deterministic, and intentionally part of the exercise.

## Troubleshooting

If validation fails because a column is missing, compare your file headers with the starter files. If the validator reports missing risk coverage, add refusal or security questions to `evaluation_questions.csv`. If your design cannot support a required schema field, explain the exception in your architecture note and identify the compensating control.