Vector Databases Explained: Embeddings, ANN Indexes & Semantic Search (Visualized)
A vector database stores and searches high-dimensional embedding vectors produced by ML models, enabling semantic search, recommendation, and RAG at scale. This guide covers similarity metrics, HNSW and IVF indexes, real systems (Pinecone, Weaviate, Milvus, pgvector), and RAG pipelines โ with live animations of each core idea.
A vector database is a storage and retrieval system purpose-built for high-dimensional embedding vectors, enabling fast nearest-neighbor search by meaning rather than exact keyword match. Where a traditional database asks "does this row equal that value?", a vector database asks "which stored vectors are most similar to this query vector?" โ a fundamentally different operation that powers modern AI features like semantic search, recommendation engines, and retrieval-augmented generation (RAG).
The rise of large language models made vector databases mainstream. Every time an LLM encodes a sentence into an embedding, it produces a list of floating-point numbers โ perhaps 768 or 1536 of them โ that capture semantic meaning as a point in high-dimensional space. Two semantically similar sentences land close together; unrelated ones land far apart. A vector database indexes millions of these points so a similarity search completes in milliseconds, not minutes.
Embeddings: Turning Meaning into Coordinates
An embedding is a dense numerical vector produced by a neural network that encodes the semantic content of an input. An embedding model (such as OpenAI text-embedding-3-small, Sentence-BERT, or CLIP for images) maps raw data into a continuous vector space where geometric distance correlates with semantic similarity. A sentence about "machine learning" and one about "deep learning" will have vectors separated by a small angle; a sentence about "gardening" will be orthogonal or opposite.
The dimensionality matters. A 128-dimensional vector is compact but coarse; a 3072-dimensional vector carries far more nuance but costs more memory and compute. In production you index millions of these vectors and need to answer queries in tens of milliseconds โ which is exactly what a vector database is optimised for.
Similarity Metrics: Cosine, Dot Product, and Euclidean
Three distance functions dominate vector search. Cosine similarity measures the angle between two vectors, ignoring magnitude โ ideal for text embeddings where the length of a vector is irrelevant to its meaning. A cosine similarity of 1.0 means identical direction; 0 means orthogonal (unrelated); โ1 means opposite. Dot product similarity rewards both angle and magnitude, and is used when magnitudes carry information (e.g., retrieval-style ranking in OpenAI embeddings). Euclidean distance (L2) measures straight-line distance in the vector space and is natural for image or sensor data where absolute position matters.
| Metric | Formula (simplified) | Best for | Notes |
|---|---|---|---|
| Cosine similarity | AยทB / (|A||B|) | Text embeddings, NLP | Magnitude-invariant; range [โ1, 1] |
| Dot product | AยทB | Recommendation, OpenAI embeddings | Magnitude matters; fast on unit vectors |
| Euclidean (L2) | โฮฃ(AแตขโBแตข)ยฒ | Image / sensor embeddings | Sensitive to magnitude; most intuitive |
The Curse of Dimensionality and Why Exact Search Fails
With 768 dimensions and 10 million stored vectors, a brute-force nearest-neighbor search computes 10 million dot products per query. At around 1 microsecond each that is 10 seconds per query โ completely unusable. Worse, in high-dimensional spaces all points tend to become roughly equidistant from each other, a phenomenon called the curse of dimensionality. Both problems force vector databases to use approximate nearest-neighbor (ANN) algorithms that trade a small accuracy loss for orders-of-magnitude speed gains.
ANN Index 1: HNSW (Hierarchical Navigable Small World)
HNSW is the most popular ANN index in production today (used by Weaviate, Milvus, Qdrant, and pgvector). It builds a multi-layer proximity graph inspired by skip lists. The top layer is a sparse graph of long-range connections; lower layers are progressively denser. A query starts at an entry point in the top layer and greedily hops to closer neighbors layer by layer, funneling into an ever-tighter neighborhood until the bottom layer yields the actual approximate k-nearest neighbors.
Key parameters: M controls how many neighbors each node connects to (higher M = better recall, more memory); ef_construction controls build-time search depth (higher = better graph quality, slower build); ef_search controls query-time beam width (higher = better recall, slower query). HNSW achieves sub-linear query time โ O(log N) in practice โ and is highly parallelisable.
ANN Index 2: IVF (Inverted File Index)
IVF partitions the vector space into k clusters using k-means clustering at build time. Each cluster has a centroid. At query time, the search finds the nprobe closest centroids and only scans vectors in those clusters โ skipping the vast majority of the dataset. IVF is highly memory-efficient (especially when combined with product quantization, IVF-PQ) and works well when you can afford a build step. HNSW typically wins on recall-vs-latency for real-time serving; IVF-PQ wins when you must compress millions of vectors into RAM.
Vector Databases in the Wild: Pinecone, Weaviate, Milvus, pgvector
The vector database ecosystem has grown rapidly since 2022. Each system makes different trade-offs between ease of use, scalability, and integration with existing infrastructure.
| System | Index | Deployment | Metadata filtering | Best for |
|---|---|---|---|---|
| Pinecone | Proprietary (HNSW-like) | Fully managed SaaS | Yes (pre-filter) | Teams that want zero ops |
| Weaviate | HNSW | Self-hosted / cloud | Yes (hybrid BM25+vector) | Hybrid search, multi-modal |
| Milvus | HNSW, IVF, DiskANN | Self-hosted / Zilliz cloud | Yes (rich) | Billion-scale, complex queries |
| pgvector | IVF, HNSW (v0.5+) | PostgreSQL extension | Full SQL | Existing Postgres users |
| Qdrant | HNSW | Self-hosted / cloud | Yes (payload filters) | Rust performance, filtering |
Retrieval-Augmented Generation (RAG) with Vector DBs
RAG is the dominant pattern for giving LLMs access to private or up-to-date knowledge. The pipeline works in two phases. Ingestion: documents are split into chunks, each chunk is embedded with an embedding model, and the vectors are stored in a vector database alongside the original text. Retrieval at query time: the user's question is embedded with the same model, a nearest-neighbor search retrieves the top-k relevant chunks, and those chunks are injected into the LLM prompt as context โ letting the model answer with grounded facts instead of hallucinating.
from openai import OpenAI
import pinecone
client = OpenAI()
pc = pinecone.Pinecone(api_key="YOUR_KEY")
index = pc.Index("knowledge-base")
def embed(text: str) -> list[float]:
resp = client.embeddings.create(model="text-embedding-3-small", input=text)
return resp.data[0].embedding
def rag_query(question: str, top_k: int = 5) -> str:
# 1. Embed the question
q_vec = embed(question)
# 2. Retrieve top-k similar chunks from the vector DB
results = index.query(vector=q_vec, top_k=top_k, include_metadata=True)
context = "\n\n".join(m["metadata"]["text"] for m in results["matches"])
# 3. Prompt the LLM with retrieved context
prompt = f"""Answer using ONLY the context below.\n\nContext:\n{context}\n\nQuestion: {question}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return resp.choices[0].message.contentMetadata Filtering and Hybrid Search
Pure vector search retrieves by semantic similarity but cannot enforce hard constraints like "only documents from 2024" or "only products under $50". Production vector databases support metadata filtering: each vector is stored with a JSON payload, and filters are applied either before the ANN search (pre-filter, safer but slower) or after (post-filter, faster but may miss results). Hybrid search combines a sparse keyword index (BM25) with the dense vector index, merging results with reciprocal rank fusion โ useful when exact keyword matching matters alongside semantic relevance.
Frequently Asked Questions
What is the difference between a vector database and a traditional database?
Traditional databases answer equality and range queries (WHERE price < 50) using B-tree or hash indexes. Vector databases answer similarity queries ("find the 10 vectors closest to this query vector") using ANN indexes like HNSW or IVF. The two access patterns are fundamentally different โ B-trees are useless in 768-dimensional space, and HNSW cannot answer WHERE price < 50. In practice, production systems often combine both: pgvector adds vector search to Postgres, so you get SQL predicates and vector search in one query.
When should I use HNSW versus IVF?
Use HNSW when you need low-latency real-time queries (under 10 ms), frequent inserts, and can afford the higher memory footprint (roughly 100โ200 bytes per vector per dimension connection). HNSW supports incremental inserts without rebuilding. Use IVF (especially IVF-PQ with product quantization) when you have a massive static dataset that must fit in limited RAM, can tolerate a build step, and are willing to tune nprobe to balance recall versus speed. Many systems let you combine them: Milvus supports HNSW, IVF, IVF-PQ, and DiskANN in the same collection.
How do vector databases scale to billions of vectors?
At billion scale, three techniques are essential. Sharding: the vector set is partitioned across multiple nodes; a query fans out to all shards and the top-k results are merged. Quantization: product quantization (PQ) compresses each vector from, say, 3072 floats (12 KB) to 64โ128 bytes, reducing memory by 50โ100ร at the cost of small recall loss. DiskANN: systems like Microsoft's DiskANN store the graph on NVMe SSD rather than RAM, reducing cost dramatically while staying within single-digit millisecond latency for queries on billion-scale datasets. Milvus and Weaviate both support distributed deployments with these techniques built in.
A vector database does not replace your relational database โ it replaces the brute-force loop you would otherwise write to find semantic neighbors. Add an ANN index, and meaning becomes searchable.
โ alokknight Engineering
