# Chapter 20 Guided Lab: Build a Governed Semantic Search Prototype with pgvector

This lab turns Chapter 20 into a concrete, reviewable semantic-search prototype for TuranMart product, support, and policy content. The goal is not to depend on a paid embedding API or to benchmark a large production vector database. The goal is to practice the data-engineering decisions that make vector search dependable: document chunk metadata, deterministic embedding generation, pgvector schema design, metadata filters, golden-query evaluation, and promotion criteria for an ANN index.

## Scenario

TuranMart wants a search service that helps product managers and support agents find relevant catalog and support passages even when a user describes the problem in natural language. The first release indexes a small set of product and support snippets locally in PostgreSQL with pgvector. The service must keep vectors tied to source metadata, support filters such as language and visibility, and prove that retrieval quality is measured before any ANN tuning is promoted.

## Materials

| File | Purpose |
|---|---|
| `docker-compose.yml` | Local PostgreSQL service using the pgvector image for readers who want to execute SQL end to end. |
| `sql/01_schema.sql` | Starter schema for `document_chunks`, including vector, lineage, metadata, and index examples. |
| `starter_semantic_search.py` | Dependency-free Python prototype that creates deterministic hash-based embeddings for local experimentation. |
| `data/source_documents.csv` | Starter TuranMart source passages with owners, language, visibility, and source checksums. |
| `data/golden_queries.csv` | Retrieval evaluation set with expected sources and query categories. |
| `expected_output/sample_search_results.json` | Example output showing the expected shape of a search response. |
| `tests/validate_vector_search_lab.py` | Lightweight validator for lab artifacts, schema fields, golden-query coverage, and sample output. |
| `exercises/README.md` | Extension exercises after the main guided lab. |

## Workflow

First, inspect `data/source_documents.csv` and confirm that every passage has a stable `source_id`, `chunk_id`, owner, language, visibility level, checksum, and source URI. These fields are the governance layer around the vector. Second, read `sql/01_schema.sql` and compare the table definition with the metadata requirements in the chapter. If you change the embedding dimension, update both the SQL vector type and the prototype configuration. Third, run `starter_semantic_search.py` to produce deterministic local results without calling an external model. The hash-based embedding is intentionally simple; it exists so that readers can test chunking, filtering, and evaluation flow before replacing it with a real embedding model. Fourth, expand `data/golden_queries.csv` with more product, support, exact-code, and permission-sensitive queries. Fifth, if Docker is available, start PostgreSQL with pgvector and load the schema so you can translate the prototype workflow into SQL-backed search.

## Optional pgvector startup

From the lab directory, run the following commands if Docker is available:

```bash
docker compose up -d
psql postgresql://postgres:postgres@localhost:5432/turanmart_vectors -f sql/01_schema.sql
```

When you finish experimenting, stop the service with:

```bash
docker compose down -v
```

## Validation

Run the dependency-free validator from the lab directory:

```bash
python3 tests/validate_vector_search_lab.py
```

Expected output:

```text
PASS validate_source_documents
PASS validate_golden_queries
PASS validate_schema_sql
PASS validate_sample_output
PASS Chapter 20 vector search guided-lab artifacts are internally consistent
```

The validator does not prove that vector search is accurate. It verifies that the lab artifacts include the minimum metadata, schema, evaluation coverage, and output shape required before readers tune embeddings, indexes, or rerankers.

## Expected Deliverables

| Deliverable | Acceptance criteria |
|---|---|
| Source inventory | Contains at least five searchable chunks with stable IDs, checksums, owners, visibility, source URIs, and non-empty passage text. |
| Vector schema | Stores chunk text, vector, embedding model/version, dimension, checksum, source URI, language, visibility, and timestamps. |
| Semantic-search prototype | Produces top-*k* results with scores, source IDs, chunk IDs, passages, and metadata filters. |
| Golden-query evaluation | Includes semantic, exact-code, metadata-filter, support, and permission-sensitive queries. |
| Benchmark note | Compares exact search, a candidate ANN configuration, and hybrid or reranked search using Recall@k and latency assumptions. |

## Cleanup

The starter Python script prints results to standard output and writes no generated files by default. If you create real embeddings, indexes, benchmark logs, or exported reports, store them under `outputs/` and remove large generated files before committing. The Docker command above removes the local PostgreSQL volume when `-v` is used.

## Troubleshooting

| Problem | Likely cause | Practical fix |
|---|---|---|
| Docker cannot bind port `5432` | A local PostgreSQL server is already running. | Change the host port in `docker-compose.yml`, or stop the conflicting service. |
| `CREATE EXTENSION vector` fails | The database image or server does not include pgvector. | Use the included pgvector image or install the extension in your PostgreSQL environment. |
| Query results look random | The starter uses deterministic hash embeddings, not a trained semantic model. | Treat it as a workflow harness, then replace `embed_text` with a real embedding model. |
| Validator reports missing coverage | The golden query file lacks one or more required query categories. | Add rows for semantic, exact-code, metadata-filter, support, and permission-sensitive behavior. |
| Dimension mismatch | SQL vector dimension and embedding generator dimension differ. | Update `EMBEDDING_DIMENSION` in the script and `vector(...)` in the SQL schema together. |