Chapter 20: Vector Databases and Embeddings - Data Engineering in Action

A modern AI application is only as good as the data it can retrieve. In Chapter 17, you built the data-engineering foundations of retrieval-augmented generation. In Chapter 18, you turned machine learning into a repeatable pipeline. In Chapter 19, you learned how feature stores and model serving connect training-time features with online inference. This chapter completes Part 5 by focusing on the storage and retrieval layer that powers semantic search, recommendations, anomaly detection, duplicate detection, multimodal search, and many retrieval-augmented generation systems: embeddings and vector databases.

Opening Scenario: TuranMart’s Semantic Search Challenge¶

Imagine that TuranMart, the e-commerce company used throughout this part of the book, has years of product descriptions, customer-support notes, seller policies, return rules, troubleshooting articles, and logistics exceptions. A support agent asks, “What should I do when the smart kettle shows E17?” A product manager asks, “Which air purifier pages mention HEPA filter replacement?” A general support employee should not see restricted vendor-risk investigations. A keyword engine helps with exact terms, but it struggles when users describe intent in natural language or when business rules require semantic search plus permission filtering. A vector-powered retrieval platform solves this problem only if the data engineering is disciplined: source records must be prepared, chunked, embedded, indexed, filtered, evaluated, versioned, and monitored.

A vector database stores and searches high-dimensional vector representations of data. In production, it is not merely an index. It is part of a governed retrieval platform that includes source ingestion, metadata, embedding versioning, approximate nearest-neighbor search, hybrid ranking, access control, quality evaluation, cost control, and observability.^[1]

Figure 1:Chapter overview: a production vector system combines raw content ingestion, chunking, embedding generation, vector indexing, hybrid retrieval, and evaluation monitoring.

Learning Objectives¶

By the end of this chapter, you should be able to explain how embeddings represent text, images, audio, code, and other data as vectors; design an embedding pipeline that is reproducible, versioned, and cost-aware; choose between exact k-nearest-neighbor search and approximate nearest-neighbor indexes; compare pgvector, Milvus, and managed vector-search options using engineering criteria; implement hybrid retrieval with metadata filters and reranking; and evaluate vector-search quality using Recall@k, Precision@k, latency, freshness, and operational cost.

Capability	What you should be able to do	Practical artifact
Embedding design	Choose an embedding model, chunking strategy, metadata schema, and versioning approach.	Embedding pipeline specification.
Vector search	Explain distance metrics, top-k search, exact scan, and ANN indexes.	Search design with index parameters.
Database selection	Compare pgvector, Milvus, search engines, and managed vector databases.	Technology decision matrix.
Retrieval quality	Measure Recall@k, Precision@k, p95 latency, freshness, and query failure modes before tuning.	Retrieval benchmark table.
Guided lab	Build a governed semantic-search prototype using PostgreSQL with pgvector-style schema and deterministic local embeddings.	Runnable lab in `shared/labs/ch20_vector_search_pgvector/`.

20.1 Embeddings: Turning Meaning into Data¶

An embedding is a dense vector of numbers produced by a model. The vector is not manually designed. It is learned from data so that similar items are usually located near each other in vector space. A sentence about “stream processing with Kafka” and another sentence about “real-time event pipelines” may not share many exact words, but a good embedding model should place them close enough that similarity search can retrieve both. Microsoft’s vector database guidance describes embeddings as numerical representations that map words, documents, or other objects into a vector space where similarity can be computed.^[1]

Embeddings are valuable because they transform unstructured and semi-structured information into data that systems can index, compare, cluster, and score. Text embeddings are common in business applications, but the same pattern applies to image embeddings for visual search, audio embeddings for speech retrieval, graph embeddings for relationship analysis, code embeddings for developer assistance, and multimodal embeddings that place different data types in a shared vector space.

Data type	Example source	Embedding use case	Metadata that must travel with the vector
Text	Policies, tickets, documentation, contracts	Semantic search, RAG, duplicate detection	Document ID, language, section, owner, access policy, timestamp.
Image	Product photos, diagrams, scanned documents	Visual similarity, product discovery, quality inspection	Image URL, category, license, resolution, moderation labels.
Audio	Call-center recordings, lectures, meetings	Speech search after transcription, speaker analytics	Speaker, recording time, transcript pointer, consent status.
Code	Source files, notebooks, SQL scripts	Code search, migration assistance, developer support	Repository, branch, file path, language, commit hash.
User-item events	Clicks, purchases, ratings	Recommendations and personalization	User segment, item catalog ID, event time, experiment variant.

A data engineer should treat embeddings as a derived data product. They are generated from source content through deterministic preparation steps such as extraction, cleaning, chunking, tokenization, model inference, normalization, and validation. If an upstream parser changes, if a chunking rule changes, or if the embedding model changes, the downstream vectors may change even when the original document looks the same to a user. Production embedding pipelines therefore need versioned source snapshots, model identifiers, chunking parameters, schema contracts, and repeatable backfill jobs.

Embedding dimensionality, distance, and normalization¶

Embedding models output vectors with a fixed number of dimensions. A vector with 768 dimensions stores 768 floating-point values; a vector with 1,536 dimensions stores twice as many values. More dimensions can represent richer information, but they also increase storage size, memory pressure, index build time, network transfer, and query cost. The right dimensionality is not always the largest available option. It is the smallest representation that satisfies the application’s retrieval-quality target.

Similarity is usually computed with one of three distance families. Cosine similarity measures the angle between vectors and is common for text embeddings. Inner product is often used when embeddings are normalized or when the model was trained for dot-product ranking. Euclidean distance measures straight-line distance and can work well for some numeric or image embeddings. pgvector supports exact and approximate nearest-neighbor search and provides distance operations for L2 distance, inner product, cosine distance, L1 distance, Hamming distance, and Jaccard distance.^[3]

Design choice	Engineering question	Practical implication
Dimension count	How many numbers does each embedding contain?	Higher dimensions increase storage, memory, and index cost.
Numeric type	Are vectors stored as float32, float16, binary, or quantized values?	Smaller types reduce cost but may reduce recall.
Normalization	Are vectors normalized before indexing?	Normalization can make cosine and inner-product comparisons easier to reason about.
Distance metric	Which metric matches the model and retrieval task?	A wrong metric can silently degrade search quality.
Model version	Which embedding model produced this vector?	Mixed model versions in one active index can produce inconsistent retrieval.

A useful sizing estimate is simple enough to calculate during architecture review. If each vector has d dimensions and each dimension is stored as a 4-byte float, raw vector storage is roughly number_of_vectors × d × 4 bytes, before metadata, index overhead, replication, compression, write-ahead logs, and backups.

Number of vectors	Dimensions	Raw vector size with float32	Practical planning note
100,000	384	~147 MB	Fits comfortably in a local development database.
1,000,000	768	~2.9 GB	Metadata and indexes may dominate operational planning.
10,000,000	1,536	~57.2 GB	Requires careful memory, index, and backup planning.
100,000,000	1,536	~572 GB	Usually requires distributed storage, sharding, compression, or tiering.

These numbers do not replace benchmarking. They help engineers avoid underestimating vector storage. In real systems, metadata columns, HNSW graph structures, inverted lists, replication, and snapshots can multiply the footprint. A production design should therefore include both raw vector size and index expansion factor in the capacity plan.

20.2 The Vector Search Problem¶

The core query in a vector database is the k-nearest-neighbor problem. Given a query vector, find the k stored vectors that are closest according to the selected distance metric. With a small table, a system can compute the distance from the query vector to every stored vector, sort the results, and return the top matches. This is called exact search or brute-force scan. It is accurate, simple, and often good enough for small datasets, evaluation baselines, and compliance-sensitive workloads.

The challenge appears when the collection grows. A table with 100 million vectors and 1,536 dimensions requires a huge amount of computation if every query compares against every vector. Users expect interactive search to respond quickly. Databricks notes that latency expectations differ by application: a RAG application may tolerate a low-second time-to-first-token target, while a search box may need responses closer to the hundred-millisecond range.^[2] Exact search over a large unpartitioned vector set often cannot satisfy such latency targets economically.

Approximate nearest-neighbor algorithms solve this by trading a measured amount of recall for a large improvement in speed and cost. Instead of guaranteeing the mathematically exact nearest neighbors, they return highly likely nearest neighbors. This trade-off is acceptable only when it is measured. If a system retrieves irrelevant passages, a RAG application may hallucinate or answer incompletely. If it misses relevant documents, users lose trust.

Figure 2:ANN index trade-offs: exact scan maximizes recall but becomes expensive at scale, while IVF, HNSW, DiskANN-style approaches, and quantization balance latency, memory, and retrieval quality differently.

Search method	How it works	Strength	Limitation	Good first use case
Exact scan	Computes distance against every vector.	Highest recall and easiest debugging.	Expensive at large scale.	Small datasets, evaluation baselines, compliance-sensitive matching.
IVF	Partitions vectors into clusters and searches selected clusters.	Efficient for large collections when tuned.	Requires choosing cluster count and probe count.	Large text collections with moderate latency requirements.
HNSW	Builds a navigable graph of nearby vectors.	Strong recall-latency balance for many workloads.	Index memory can be high and build time can be significant.	Interactive semantic search and recommendations.
DiskANN-style	Optimizes ANN search with disk-aware graph and storage layouts.	Helps with very large collections.	Adds operational complexity and tuning effort.	Large-scale vector search where memory is constrained.
Quantization	Compresses vectors or index representations.	Reduces memory and storage footprint.	Can reduce recall if too aggressive.	Cost-sensitive systems after benchmark validation.

A mature vector-search project should always keep an exact-search or high-recall baseline for evaluation. The baseline does not need to serve all production traffic, but it should be available in experiments so the team can estimate how much quality is lost when ANN parameters are tuned for speed. This is the same engineering pattern used throughout the book: optimize only after establishing a measurable baseline.

20.3 Vector Database Reference Architecture¶

A vector database is only one component in the complete architecture. The system begins with raw data, usually stored in object storage, relational tables, a lakehouse, application repositories, content-management systems, or operational databases. A processing pipeline extracts text or other features, splits long content into chunks, enriches each chunk with metadata, calls an embedding model, validates the output, and writes vectors to a database. A retrieval API receives user queries, embeds the query, applies metadata filters and access-control filters, searches the vector index, optionally combines results with keyword search, reranks candidates, and returns citations or source records to the application.

Figure 3:Vector search reference architecture: ingestion and embedding workers feed a vector database and metadata store, while the query path combines filters, ANN search, reranking, and observability.

The most important design decision is to keep the vector and its metadata together logically, even if they are physically stored in different systems. A vector without metadata is hard to govern. A document chunk without a model version is hard to re-embed safely. A search result without a source pointer cannot be cited. A vector collection without access-control attributes can leak restricted information to users.

Architecture layer	Responsibility	Typical implementation choices
Source storage	Stores canonical documents, tables, images, and event data.	Object storage, lakehouse tables, CMS, Git repositories, operational databases.
Extraction and cleaning	Converts raw files into normalized text or feature records.	Python, Spark, DuckDB, OCR, parsers, data-quality checks.
Chunking and metadata	Splits records and attaches retrieval filters.	Chunk-size rules, overlap policy, document hierarchy, security labels.
Embedding workers	Generate vectors reproducibly and cost-effectively.	API-based embeddings, open-source models, batch inference jobs.
Vector database	Stores vectors and supports similarity search.	pgvector, Milvus, Qdrant, Weaviate, Pinecone, cloud vector search.
Retrieval API	Serves user queries with filters, ranking, and observability.	FastAPI, model-serving endpoint, application backend, service mesh.
Evaluation and monitoring	Measures quality, latency, drift, and operational health.	Golden query sets, logs, dashboards, tracing, offline evaluation notebooks.

This architecture separates the write path from the read path. The write path is optimized for correctness, backfills, idempotent upserts, and schema evolution. The read path is optimized for latency, filtering, access control, and ranking quality. Combining both concerns in one script is fine for a prototype, but it becomes dangerous when the collection grows or when the system starts serving real users.

Choosing a vector database¶

The vector database landscape is active and still evolving. Some teams start with PostgreSQL and pgvector because they already operate PostgreSQL and want vectors next to relational metadata. pgvector is open-source vector similarity search for Postgres and preserves familiar database capabilities such as SQL queries, joins, transactions, point-in-time recovery, and operational tooling.^[3] Other teams choose a dedicated vector database such as Milvus when they need distributed scale, independent scaling of query and ingestion components, and specialized ANN operations. Milvus describes itself as an open-source, high-performance vector database that supports ANN search, filtered search, range search, hybrid search, full-text search, reranking, and multiple index algorithms across Lite, Standalone, and Distributed deployment modes.^[4]

Option	Best fit	Strength	Watch carefully
pgvector on PostgreSQL	Small to medium collections, strong relational metadata, local labs.	SQL, transactions, joins, backups, and easy adoption by database teams.	Memory, index build time, and horizontal scaling limits at very large scale.
Milvus	Large-scale vector workloads and distributed serving.	Dedicated vector architecture, multiple index types, scalable query and ingestion components.	More moving parts than a single database.
Qdrant or Weaviate	Application teams building vector-native services.	Developer-friendly APIs and metadata filtering.	Operational maturity, backup strategy, and ecosystem fit.
Managed vector database	Teams prioritizing time-to-market and managed operations.	Reduced infrastructure burden and cloud-native scaling.	Cost visibility, portability, data residency, and vendor-specific behavior.
Search engine with vector support	Hybrid keyword/vector search in an existing search stack.	Mature keyword search, BM25, filters, and relevance tooling.	ANN quality and scaling differ by engine and configuration.

A practical rule is to choose the simplest system that can satisfy the next 12 months of scale, quality, governance, and operational requirements. Starting with pgvector is reasonable for a lab, a prototype, or a product whose vector collection is moderate and metadata joins matter. Moving to a dedicated vector database is reasonable when query volume, memory pressure, index build time, latency targets, or distributed ingestion require a specialized architecture.

20.4 Hybrid Search and Reranking¶

Pure vector search is powerful, but it is not always enough. Many business queries contain exact tokens that matter. A user searching for “Policy DE-2026-17” expects the exact policy, not a semantically similar policy. A support engineer searching for error code E17 needs lexical precision. A student searching for a course code such as CS-315 expects exact matching. At the same time, a user asking “How do I build a streaming pipeline?” may benefit from semantic retrieval even if the documents use the phrase “event processing.”

Hybrid search combines lexical search and vector search. The lexical branch may use BM25 or SQL full-text search. The vector branch uses embeddings and ANN. Results are fused, filtered, and reranked. Databricks recommends treating retrieval quality as an engineering problem that starts with reproducible evaluation and can be improved through hybrid search, metadata filtering, reranking, query optimization, adaptive retrieval, and better parsing and chunking.^[2]

Figure 4:Hybrid search and reranking flow: lexical and vector candidates are retrieved in parallel, filtered by metadata and permissions, fused, reranked, and evaluated before results are returned.

Retrieval stage	Purpose	Example engineering decision
Query normalization	Clean and classify the user query.	Detect exact IDs, language, entity names, and query intent.
Lexical retrieval	Capture exact matches and rare terms.	Use BM25, PostgreSQL full-text search, or a search engine.
Vector retrieval	Capture semantic matches.	Embed the query and search top-k vectors with metadata filters.
Metadata filtering	Enforce scope, freshness, tenant, and access policy.	Filter by language, department, document status, ACL, and timestamp.
Candidate fusion	Merge lexical and vector candidates.	Use reciprocal rank fusion or weighted score normalization.
Reranking	Improve final ordering with a stronger model or business logic.	Rerank the top 50 candidates into the top 5 to 10 final results.
Evaluation	Measure quality and latency.	Track Recall@k, Precision@k, p95 latency, and zero-result rate.

The strongest vector systems are not necessarily the systems with the most advanced ANN algorithm. They are the systems with the best end-to-end retrieval process: clean source data, useful chunking, correct metadata, reliable permissions, measured ranking quality, and feedback loops. Vector search should therefore be owned jointly by data engineering, search engineering, ML engineering, security, and application teams.

Evaluation before optimization¶

Search relevance is easy to debate and hard to improve without data. Before tuning index parameters or changing embedding models, create a golden query set. A golden query set is a table of representative queries, expected relevant documents, expected answer passages, and sometimes unacceptable results. It should include easy questions, ambiguous questions, exact-code queries, long natural-language queries, multilingual queries, and permission-sensitive queries.

Metric	Definition	When it matters most
Recall@k	Fraction of relevant documents found in the top k.	RAG and compliance search where missing a key document is costly.
Precision@k	Fraction of top-k results that are relevant.	Search experiences where users inspect only the first few results.
MRR	Mean reciprocal rank of the first relevant result.	Question answering and navigational search.
nDCG	Ranking quality with graded relevance.	Search systems where some results are partially relevant.
p95 latency	95th percentile response time.	User-facing systems with strict experience targets.
Freshness lag	Time from source update to searchable vector.	Policy, product, inventory, and news-like data.
Cost per 1,000 queries	Serving, embedding, and infrastructure cost normalized by usage.	Commercial products and high-volume internal platforms.

A good benchmark includes both offline and online signals. Offline evaluation uses a fixed dataset and golden labels, so it is reproducible. Online evaluation uses user clicks, thumbs-up/down, conversions, escalation rates, and human review. Offline evaluation is better for controlled engineering changes; online evaluation is better for discovering how real users behave.

20.5 Managing Embedding Pipelines in Production¶

Embedding pipelines look simple when demonstrated in a notebook, but they become complex in production because they create long-lived derived data. Every vector is tied to a source record, a text extraction method, a chunking strategy, an embedding model, a model version, a normalization rule, and an index configuration. Changing one of those variables may require reprocessing millions of records.

The pipeline should therefore be designed like a data product with lineage. The source document ID, source checksum, extraction version, chunk ID, chunk position, embedding model name, embedding model version, vector dimension, distance metric, created timestamp, and access policy should be stored as first-class fields. If a team cannot answer “Which model generated this vector?” or “Which source document created this chunk?”, it cannot safely debug retrieval quality.

Field	Example	Why it matters
`source_id`	`policy_2026_001`	Links a vector back to the canonical source.
`source_checksum`	SHA-256 hash	Detects whether content changed and needs re-embedding.
`chunk_id`	`policy_2026_001#chunk_004`	Provides stable idempotent upserts.
`chunk_text`	Text passage	Enables reranking, citation, and debugging.
`visibility`	`public`, `internal`, `restricted`	Enforces access-control and display rules.
`embedding_model`	`text-embedding-model-x`	Separates incompatible embedding spaces.
`embedding_version`	`2026-05-01`	Enables controlled re-embedding campaigns.
`index_version`	`hnsw_m16_ef128_v3`	Connects search behavior to index configuration.

The main production risk is embedding version drift. If half the collection uses one model and half uses another, distances may no longer mean the same thing. The safe migration strategy is to create a new index version, backfill vectors into it, evaluate quality against the old version, shadow traffic if possible, then switch queries only after acceptance. This is similar to a blue-green deployment for services, but the artifact is a vector collection rather than a container image.

Migration step	Purpose	Acceptance criterion
Freeze source snapshot	Make evaluation repeatable.	All documents have stable checksums.
Generate new embeddings	Build a candidate vector space.	No dimension mismatch; failure rate below threshold.
Build new index	Tune ANN parameters independently.	Index build completes within operational window.
Run golden benchmark	Compare quality and latency.	Recall@k and p95 latency meet target.
Shadow production queries	Observe real query behavior safely.	Error rate and cost remain acceptable.
Switch traffic	Promote the new index.	Rollback plan and previous index remain available.

Cost management is another major concern. Embedding generation can become expensive when documents change frequently or when the team repeatedly re-embeds the entire corpus for small experiments. Cache embeddings by source checksum and model version. Re-embed only changed chunks. Use batch inference when possible. Separate experimental collections from production collections. Track cost per 1,000 embedded chunks and cost per 1,000 served queries.

20.6 Guided Lab: Governed Semantic Search with pgvector¶

The guided lab for this chapter builds a governed semantic-search prototype using the files in shared/labs/ch20_vector_search_pgvector/. The lab includes a pgvector-oriented schema, Docker Compose for readers who want to run PostgreSQL locally, a deterministic Python search harness that requires no paid embedding API, source-document metadata, golden queries, expected output, and a dependency-free validator. The instructor solution lives in shared/solutions/ch20_vector_search_pgvector/solution.md.

Figure 5:Guided lab architecture: Docker Compose can run PostgreSQL with pgvector, the starter script creates deterministic local embeddings, SQL stores vectors beside metadata, and benchmark evidence records recall, latency, and governance decisions.

Lab scenario¶

You are building a semantic search service for TuranMart’s product, support, and policy content. The service should index approved passages and return the most relevant snippets for natural-language questions. The first version uses a local pgvector-style schema because it teaches vectors, metadata, SQL filtering, exact baselines, and ANN promotion criteria in one familiar workflow. Later, the same pipeline can be adapted to Milvus or another vector database if scale requires it.

Component	Local lab choice	Why this choice is useful for learning
Database	PostgreSQL with pgvector schema	Teaches vectors, metadata, SQL filtering, and indexes in one system.
Source data	TuranMart product, support, and policy snippets	Keeps the lab business-oriented and reproducible.
Prototype	Dependency-free Python script	Makes embedding, scoring, filtering, and evaluation explicit before adding infrastructure.
Embeddings	Deterministic hash-based vectors	Avoids external API dependencies while preserving the data-engineering workflow.
Evaluation	Golden query CSV	Teaches that search quality must be measured before index promotion.

Step 1: Inspect the governed source inventory¶

Open data/source_documents.csv and verify that every passage has a stable source_id, chunk_id, owner, language, visibility level, source URI, checksum, token count, section title, and chunk text. These fields are not decoration. They are the governance layer that turns a vector into an auditable data product. A system that stores only anonymous vectors cannot explain results, refresh changed content, enforce role filters, or debug ranking failures.

Step 2: Study the pgvector schema¶

The lab schema in sql/01_schema.sql creates a document_chunks table with vectors and metadata side by side. It also keeps conventional B-tree indexes for metadata filters and introduces an HNSW index as a candidate optimization. The exact-search query remains in the schema because exact or high-recall retrieval is the baseline against which approximate indexes should be judged.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS document_chunks (
    chunk_id TEXT PRIMARY KEY,
    source_id TEXT NOT NULL,
    source_uri TEXT NOT NULL,
    owner TEXT NOT NULL,
    visibility TEXT NOT NULL CHECK (visibility IN ('public', 'internal', 'restricted')),
    language TEXT NOT NULL DEFAULT 'en',
    section_title TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    source_checksum TEXT NOT NULL,
    embedding_model TEXT NOT NULL,
    embedding_version TEXT NOT NULL,
    embedding_dimension INTEGER NOT NULL CHECK (embedding_dimension = 384),
    embedding vector(384) NOT NULL
);
```bash

The example uses `vector(384)` because many small local embedding models produce 384-dimensional vectors. If you use a different model, change the dimension in both the schema and the loader. Dimension mismatches should fail early because mixing dimensions or model families in one index makes results unreliable.

### Step 3: Run the local prototype and validator

From the lab directory, run the validator first:

```bash
python3 tests/validate_vector_search_lab.py

The expected output is:

PASS validate_source_documents
PASS validate_golden_queries
PASS validate_schema_sql
PASS validate_sample_output
PASS Chapter 20 vector search guided-lab artifacts are internally consistent
```bash

Then run the starter script:

```bash
python3 starter_semantic_search.py

The script uses deterministic hash-based embeddings so the lab can run anywhere with Python. The rankings are not meant to represent production semantic quality. They are a workflow harness that lets you test source metadata, visibility filters, output shape, and golden-query evaluation before replacing embed_text with a real embedding model.

Step 4: Add an ANN index and benchmark it¶

Start with exact search to establish a baseline. Then create an ANN index such as HNSW or IVF if your pgvector version and workload support it. The benchmark should compare latency and Recall@k against the exact baseline. If the ANN index returns faster results but misses critical documents, the index parameters or retrieval strategy need adjustment.

Experiment	Index configuration	Recall@5	p95 latency	Decision
Baseline	Exact scan	1.00	850 ms	Too slow for interactive use but useful as quality reference.
HNSW small	Lower search breadth	0.88	70 ms	Fast, but recall may be too low for RAG.
HNSW tuned	Higher search breadth	0.96	130 ms	Good candidate if latency target is below 200 ms.
Hybrid + rerank	ANN + lexical + reranker	0.98	420 ms	Best quality, but cost and latency must be justified.

These numbers are illustrative. Readers should replace them with measured results from their own machine and dataset. The important lesson is the method: define a target, measure the baseline, tune one variable at a time, and keep the benchmark results in version control.

20.7 Common Pitfalls and Operational Lessons¶

The first pitfall is treating the vector database as magic. If the source data is messy, the chunks are too large, the metadata is missing, and the evaluation set is weak, switching from one vector database to another will not fix the system. Retrieval quality begins before the vector index.

The second pitfall is ignoring permissions. Many internal knowledge bases contain documents with different access levels. The vector database must store access-control metadata, and the retrieval API must enforce permissions before results are shown to the user or sent to an LLM.

The third pitfall is optimizing latency without preserving quality. Databricks explicitly emphasizes that retrieval tuning should start with a reproducible evaluation framework rather than guesswork.^[2] If engineers only measure p95 latency, they may choose an index configuration that is fast but misses relevant documents. If they only measure Recall@k, they may build a system that is accurate but too slow or expensive for users.

Pitfall	Symptom	Practical fix
Mixed embedding versions	Similar documents stop ranking consistently.	Store model version and migrate with a blue-green index strategy.
Weak chunking	Results contain partial or irrelevant context.	Tune chunk size and overlap against golden queries.
Missing metadata	Users cannot filter by source, date, tenant, or permission.	Define a metadata schema before loading vectors.
No exact baseline	ANN tuning is based on opinion.	Keep an exact or high-recall evaluation collection.
Over-reliance on vector search	Exact IDs and rare terms are missed.	Add lexical search and hybrid fusion.
No freshness monitoring	Updated documents do not appear in search.	Track source-to-index lag and failed upserts.
Uncontrolled re-embedding	Costs spike after model experiments.	Cache by checksum and model version; re-embed only changed chunks.

The fourth pitfall is forgetting that vector search is part of the user experience. Search results should be explainable. A RAG answer should cite sources. A support tool should show the retrieved passage, not only the final generated answer. A benchmark should include both machine metrics and human review for representative queries.

Exercises¶

Build a sizing spreadsheet for a vector collection with 2 million, 20 million, and 200 million chunks at 384, 768, and 1,536 dimensions. Estimate raw vector storage, metadata storage, index overhead, replication, and monthly backup cost.
Create a golden query set for ten questions about this book. For each question, record the expected chapter, expected section, and at least one relevant passage. Use the set to compare exact search, ANN search, and hybrid search.
Modify the lab schema to include additional access-control metadata such as department, tenant_id, or role_policy. Write a query that ensures users only retrieve documents they are allowed to see.
Compare two chunking strategies: 250-token chunks with 50-token overlap and 600-token chunks with 100-token overlap. Measure Precision@5 and manually inspect whether the returned passages are more useful for RAG.
Write an architecture decision record explaining whether a startup should begin with pgvector, Milvus, or a managed vector database. Include scale, latency, cost, operational maturity, team skill, portability, and data-residency requirements.

Review Questions¶

Question	What a strong answer should include
Why should an embedding be treated as a derived data product?	Source lineage, model/version metadata, chunking rules, checksums, reproducible backfills, and evaluation evidence.
When is exact vector search still useful?	Small collections, debugging, compliance-sensitive matching, and offline quality baselines for ANN comparison.
Why is hybrid search often better than pure vector search?	It preserves exact-token precision while adding semantic recall, especially for error codes, product IDs, policies, and rare terms.
What can go wrong when embedding model versions are mixed?	Distances become inconsistent because vectors may not share the same semantic space.
What must be measured before promoting an ANN index?	Recall@k or Precision@k against a baseline, p95 latency, cost, failure cases, and permission-filter correctness.

Chapter Summary¶

Embeddings and vector databases turn unstructured information into searchable data products. The central engineering task is not only to store vectors; it is to build a repeatable pipeline that extracts content, chunks it, generates versioned embeddings, stores metadata, indexes vectors, serves filtered queries, evaluates retrieval quality, and monitors cost and freshness. Exact search provides a baseline, while ANN indexes such as IVF, HNSW, DiskANN-style approaches, and quantization make large-scale search practical when tuned carefully.

Production vector systems are most reliable when they combine semantic retrieval with lexical search, metadata filtering, reranking, and disciplined evaluation. pgvector is an excellent local and relational starting point, while Milvus and other dedicated vector databases become attractive when scale, performance, and independent component scaling matter. The same data-engineering principles from the rest of this book still apply: design for reproducibility, observability, governance, cost control, and safe change management.

This chapter completes Part 5, where you connected data engineering with modern AI systems: RAG pipelines, ML pipeline engineering, feature stores, model serving, and vector search. In the final part of the book, you will bring these capabilities together in capstone case studies that show how production data platforms solve real business problems end to end.

References¶

Footnotes¶

Microsoft Learn, Understanding Vector Databases.
↩↩
pgvector GitHub Repository, Open-source Vector Similarity Search for Postgres.
↩↩
Databricks Documentation, Vector Search Retrieval Quality Guide.
↩↩↩
Milvus Documentation, What is Milvus?.
↩