# Chapter 20 Solution Guide: Governed Semantic Search with pgvector

A strong Chapter 20 submission shows that the student understands vector search as a governed data product rather than as a standalone index. The reference design keeps every embedding connected to its source passage, source URI, checksum, owner, visibility level, embedding model, embedding version, and evaluation evidence.

## Expected interpretation of the starter artifacts

| Artifact | What a correct solution should demonstrate |
|---|---|
| `data/source_documents.csv` | The source inventory contains stable identifiers, owner metadata, visibility labels, checksums, language, section titles, and enough passage text to support realistic retrieval tests. |
| `sql/01_schema.sql` | The schema stores vectors and metadata together, creates conventional indexes for filters, keeps an exact-search query for baseline evaluation, and introduces HNSW only as a benchmarked candidate index. |
| `starter_semantic_search.py` | The prototype proves the end-to-end workflow without external dependencies. Students should recognize that hash embeddings are a deterministic test harness, not a production semantic model. |
| `data/golden_queries.csv` | The evaluation set includes semantic questions, exact identifiers, support questions, metadata-filter scenarios, and permission-sensitive behavior. |
| `expected_output/sample_search_results.json` | The response shape includes query context, retrieval mode, embedding version, filters, source IDs, chunk IDs, scores, visibility, source URIs, and returned passage text. |

## Reference architecture explanation

The preferred architecture separates the offline indexing path from the online query path. The offline path reads approved TuranMart product and support content, extracts passages, assigns stable chunk IDs, computes checksums, generates embeddings, validates dimensions, and upserts the records into PostgreSQL with pgvector. The online path receives a user query, generates a query embedding, applies language and visibility filters, runs exact or ANN similarity search, optionally combines lexical retrieval, and returns cited passages with scores and metadata.

A production implementation should replace the deterministic hash embedding function with a real embedding model. The replacement must preserve model/version fields and must not mix incompatible embedding spaces in the same active index. When a new model is evaluated, the safer migration pattern is to build a new index version, run golden-query evaluation, shadow traffic when possible, and promote the new index only after recall, latency, and safety criteria are met.

## Suggested benchmark note

| Retrieval mode | Expected role in the solution | Promotion guidance |
|---|---|---|
| Exact baseline | Establishes the quality reference because every vector can be compared. | Keep it available for offline evaluation even if it is too slow for production traffic. |
| HNSW candidate | Reduces query latency by using an approximate graph index. | Promote only if Recall@k remains above the target and p95 latency meets the user-experience requirement. |
| Hybrid retrieval | Combines exact lexical matches with semantic retrieval. | Use when product codes, error codes, SKUs, or policy identifiers must be found exactly. |
| Reranking | Improves final ordering by applying a stronger model or rule set to top candidates. | Use when quality gains justify added latency and cost. |

## Common grading issues

| Issue | Why it matters | Corrective feedback |
|---|---|---|
| Missing visibility metadata | Search can leak restricted passages. | Add visibility, department, tenant, or role fields and enforce filters before display. |
| Treating ANN latency as the only metric | Fast retrieval can miss critical documents. | Require Recall@k or Precision@k against a golden query set before promotion. |
| Mixing embedding model versions | Distances become inconsistent across the collection. | Store model/version metadata and migrate with blue-green index versions. |
| Weak source lineage | Results cannot be cited, refreshed, or debugged. | Store source URI, checksum, section title, chunk ID, and creation/update timestamps. |
| Uncontrolled re-embedding | Cost can spike after small experiments. | Cache by source checksum and model version, then re-embed only changed chunks. |

## Instructor notes

The validator intentionally checks artifact completeness rather than semantic quality. It is acceptable for the deterministic starter search to return imperfect rankings because the point is to teach repeatable workflow, metadata discipline, and evaluation design. A high-quality student extension should replace the embedding function, populate PostgreSQL, compare exact and HNSW retrieval, and explain the trade-off among recall, latency, cost, and governance.
