Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 16: Vector Databases and Embeddings

In our exploration of data engineering for RAG applications, we introduced two fundamental concepts that are at the heart of the modern AI stack: embeddings and vector databases. Embeddings are the numerical representation of data—the language that machine learning models understand. And vector databases are the specialized systems that are designed to store, index, and query these embeddings at a massive scale. While we touched on these concepts in the context of RAG, their importance extends far beyond just LLM applications. From recommendation systems and image search to anomaly detection and drug discovery, the ability to represent data as vectors and to perform efficient similarity search is a foundational capability for a wide range of AI and ML tasks.

This chapter is dedicated to a deep dive into the world of vector databases and embeddings. We will start by building a more formal understanding of what embeddings are and how they are generated. We will then explore the core problem that vector databases are designed to solve: the challenge of performing fast and accurate similarity search on billions of high-dimensional vectors. We will look at the key indexing algorithms that make this possible, and we will survey the rapidly growing landscape of open-source and managed vector databases. Finally, we will discuss some of the practical challenges of managing embedding pipelines in production. By the end of this chapter, you will have a solid foundation in this critical area of AI data engineering and be ready to start building your own vector-powered applications.

16.1 Embeddings: The Lingua Franca of AI

An embedding is a learned representation of a piece of data as a low-dimensional vector of real numbers. The key idea is that the embedding captures the semantic meaning of the data. Data points that are semantically similar will have embeddings that are close to each other in the vector space.

While embeddings are most commonly associated with text, you can create embeddings for a wide variety of data types:

How are Embeddings Generated?

Embeddings are generated by training a deep learning model on a large dataset. The model learns to map the input data to a vector representation in a way that preserves the semantic relationships in the data. For example, a text embedding model might be trained on a massive corpus of text with the objective of predicting the next word in a sentence. In the process of learning to do this, the model learns to create embeddings that capture the meaning of words and their relationships to each other.

As we discussed in the RAG chapter, there are two main options for generating embeddings:

  1. Use a Pre-trained Model via an API: Services like OpenAI, Cohere, and Google’s Vertex AI provide easy-to-use APIs for generating high-quality, general-purpose embeddings.

  2. Use an Open-Source Model: You can use a pre-trained model from a library like Hugging Face’s sentence-transformers and run it on your own infrastructure. This gives you more control and can be more cost-effective, but it also requires more engineering effort.

16.2 The Challenge of Vector Search: Finding a Needle in a Billion-Needle Haystack

Once you have your embeddings, you need a way to search through them. The core problem is this: given a query vector, find the k vectors in your database that are most similar to it. This is known as the k-Nearest Neighbor (k-NN) problem.

For a small number of vectors, you can solve this problem with a simple brute-force search: compute the distance between the query vector and every other vector in the database and then take the top k. However, this approach is not feasible for real-world applications, where you might have billions of vectors, each with hundreds or even thousands of dimensions.

This is where Approximate Nearest Neighbor (ANN) algorithms come in. ANN algorithms trade a small amount of accuracy for a massive improvement in search speed. Instead of guaranteeing that they will find the exact nearest neighbors, they provide a high probability of finding them, and they do so in a fraction of the time.

Key ANN Indexing Algorithms

There are several different families of ANN algorithms, each with its own trade-offs.

16.3 The Vector Database Landscape

The market for vector databases is one of the most active and rapidly evolving areas of the data infrastructure landscape. There are a growing number of open-source and managed vector databases to choose from.

Open-Source Vector Databases

Managed Vector Databases

Vector Search in Existing Databases

In addition to these specialized vector databases, many existing databases are adding support for vector search. For example, PostgreSQL has the pgvector extension, and Redis has the RediSearch module. While these can be a good option for getting started, they may not provide the same level of performance and scalability as a dedicated vector database for very large-scale applications.

16.4 Managing Embedding Pipelines in Production

Building a production-grade vector search application involves more than just choosing a vector database. You also need to build and manage the data pipeline that generates the embeddings and keeps the vector database up-to-date.

Key Challenges:

Chapter Summary

In this chapter, we have taken a deep dive into the world of vector databases and embeddings, two of the most important foundational technologies for modern AI applications. We have learned what embeddings are and how they are generated. We have explored the challenge of vector search and the key ANN algorithms that make it possible to perform similarity search on billions of vectors. We have surveyed the rapidly growing landscape of vector databases, and we have discussed some of the practical challenges of managing embedding pipelines in production.

This chapter concludes our exploration of the key technologies in the world of data engineering for AI. We have covered RAG, ML pipelines, feature stores, and vector databases. In the final part of the book, we will bring all these concepts together and look at how they are used to solve real-world business problems in a series of case studies.