Part 4: How Vector Indexes Work Internally: HNSW, IVF, PQ, And The Algorithms Behind Them

A vector database has one brutal problem: given a query vector, find the closest vectors among millions or billions of stored vectors.

The naive solution is simple:

For every vector:
  compute distance(query, vector)
return the top k closest

That is exact search. It works beautifully for small datasets. But if you have 100 million vectors, each with 768 dimensions, brute force means a huge number of floating-point operations per query.

Vector indexes exist to avoid scanning everything.

They do this by being a little clever and a little imperfect. Most production vector indexes are approximate nearest neighbor indexes. They trade a small amount of accuracy for a large gain in speed.

The Core Idea

Every vector index answers the same question:

How can we avoid comparing the query against every vector?

Different indexes use different data structures:

Index Families (Mental Model)

Index Type	Data Structure Idea
HNSW	Multi-layer proximity graph
IVF	K-means clusters plus inverted lists
IVFPQ	IVF plus compressed vector codes
LSH	Hash tables that preserve locality
DiskANN	Graph index optimized for SSD storage

Let’s start with the two big ones: HNSW and IVF.

HNSW: Search As Graph Navigation

HNSW stands for Hierarchical Navigable Small World.

The data structure is a graph. Each vector is a node. Edges connect nearby vectors.

At the bottom layer, almost every node exists. This layer contains many short-range connections between nearby points.

Above that, HNSW creates sparse layers. Only some nodes appear in higher layers. The higher the layer, the fewer nodes there are. This is similar in spirit to a skip list: upper layers let you move quickly across the space, while lower layers refine the search.

Visually:

Layer 3:        A ----------- M
Layer 2:        A ----- F ---- M
Layer 1:   A -- C -- F -- H -- M
Layer 0: A-B-C-D-E-F-G-H-I-J-K-L-M

Not literally a line, of course. In real vector space, this is a graph over high-dimensional points.

HNSW Search

HNSW search starts from an entry point, usually one of the nodes in the top layer.

At upper layers, it performs greedy graph search:

current = entry_point
 
for each layer from top to bottom:
  while there is a neighbor closer to the query:
    move to that neighbor

This gives a fast rough location.

At the bottom layer, HNSW does a wider search. It maintains candidate nodes using priority queues. One heap tracks nodes to explore. Another tracks the best results found so far.

The parameter efSearch controls how wide this bottom-layer search is.

Higher efSearch = better recall, more work
Lower efSearch = faster search, more misses

So HNSW is not just “walk to the nearest neighbor.” It is closer to a best-first graph traversal with a bounded search budget.

HNSW Insertion

When inserting a new vector, HNSW randomly assigns it a maximum layer. Most vectors only live at the bottom. A few are promoted to higher layers. Very few reach the top.

Then it searches for good neighbors at each layer and connects the new node to them.

The main parameter is M, the maximum number of neighbors per node. Larger M gives more routes through the graph, which usually improves recall, but increases memory and build cost.

During construction, efConstruction controls how carefully neighbors are selected.

Higher efConstruction = better graph, slower indexing
Lower efConstruction = faster indexing, weaker graph

From a data structures perspective, HNSW is:

Adjacency lists + random levels + greedy search + priority queues

Its strength is fast, high-recall search when the graph fits in memory.

IVF: Search By Partitioning Space

IVF stands for Inverted File.

The name comes from classic search engines. In text search, an inverted index maps a word to a posting list of documents:

"database" -> [doc1, doc7, doc19]

IVF does something similar for vectors, but instead of words, it uses cluster centroids.

First, the system runs k-means over training vectors. Suppose we choose nlist = 1000. That means we create 1000 centroids.

Each vector is assigned to its nearest centroid.

centroid_17 -> [vector ids near centroid_17]
centroid_42 -> [vector ids near centroid_42]
centroid_913 -> [vector ids near centroid_913]

Each centroid owns an inverted list.

IVF Search

At query time, IVF does not search every vector. It first compares the query to the centroids.

Then it chooses the closest nprobe centroids and scans only those lists.

query -> find nearest centroids
      -> scan vectors in those selected lists
      -> return top k

The parameter nprobe controls the search width.

Higher nprobe = search more clusters, better recall
Lower nprobe = search fewer clusters, lower latency

A rough cost model looks like this:

Brute force:  O(N * d)
 
IVF:          O(nlist * d) + O((N / nlist) * nprobe * d)

Where:

N = number of vectors
d = vector dimensions
nlist = number of clusters
nprobe = number of clusters searched

The approximation comes from cluster boundaries. A true nearest neighbor may live in a nearby cluster that was not probed.

IVF is basically:

k-means partitioning + hash-table-like buckets + partial scanning

It shines when datasets are large, relatively stable, and batch/GPU search matters.

IVFFlat Versus IVFPQ

In IVFFlat, the vectors inside each inverted list are stored as full-precision vectors. Once IVF selects lists to scan, distances are computed against real vectors.

That means:

IVFFlat = approximate cluster selection + exact distance inside selected clusters

In IVFPQ, vectors are compressed using Product Quantization.

PQ splits a vector into smaller subvectors. Each subvector is replaced by the ID of its nearest learned codeword.

For example, a 768-dimensional vector might be split into 96 chunks of 8 dimensions. Each chunk is quantized separately.

Instead of storing:

[0.13, -0.82, 0.44, ...]

PQ stores compact codes:

[12, 201, 44, 8, ...]

This massively reduces memory. It also speeds up distance approximation using lookup tables.

The tradeoff is accuracy. You are no longer comparing against the original vector, but against a compressed approximation.

IVFFlat = more memory, better accuracy
IVFPQ   = less memory, faster scans, more approximation

Why HNSW And IVF Feel So Different

HNSW is graph-first.

It asks:

Can I navigate from a known point to better and better neighbors?

IVF is partition-first.

It asks:

Can I divide space into regions and search only the most promising regions?

HNSW usually has excellent recall-latency behavior for in-memory search. IVF is often easier to scale to huge datasets, especially when combined with compression and GPU-friendly scanning.

Quick Comparison

Index	Main Data Structure	Best For	Tradeoff
HNSW	Layered graph	Low-latency, high-recall in-memory search	Higher memory use
IVF	Centroids + inverted lists	Large-scale approximate search	Can miss boundary neighbors
IVFPQ	IVF + compressed codes	Billion-scale search with limited memory	Lower precision
DiskANN	SSD-aware graph	Very large datasets beyond RAM	More complex storage/search design
LSH	Locality-preserving hash tables	Some special similarity workloads	Often weaker recall/space tradeoffs for modern embeddings

Configuring Vector Indexes In Real Databases

The same algorithmic ideas show up across products, but the knobs are named differently. Some systems expose HNSW/IVF directly. Others, like Pinecone, manage the internals and ask you to configure higher-level choices like dimension, metric, cloud, and index type.

Index Knobs Across Databases

Concept	pgvector	MongoDB Atlas	Milvus	Qdrant	Pinecone
Distance metric	opclass like `vector_cosine_ops`	`similarity`	`metric_type`	`distance`	`metric`
HNSW connections	`m`	`maxEdges`	`M`	`m`	managed
HNSW build quality	`ef_construction`	`numEdgeCandidates`	`efConstruction`	`ef_construct`	managed
HNSW query quality	`hnsw.ef_search`	`numCandidates` at query time	`ef`	query/search params	managed/top-k/filtering
IVF clusters	`lists`	not exposed as IVF	`nlist`	not IVF-style	managed
IVF probes	`ivfflat.probes`	not exposed as IVF	`nprobe`	not IVF-style	managed

pgvector

pgvector exposes both HNSW and IVFFlat directly.

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE items (
  id bigserial PRIMARY KEY,
  content text,
  embedding vector(1536)
);

For HNSW with cosine distance:

CREATE INDEX items_embedding_hnsw_idx
ON items
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

At query time:

SET hnsw.ef_search = 100;
 
SELECT id, content
FROM items
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 10;

For IVFFlat:

CREATE INDEX items_embedding_ivfflat_idx
ON items
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

At query time:

SET ivfflat.probes = 10;
 
SELECT id, content
FROM items
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 10;

Rule of thumb: use HNSW for better recall-latency behavior; use IVFFlat when you want faster builds and lower memory, especially after bulk loading data.

Pinecone

Pinecone generally does not ask you to choose HNSW or IVF directly. It manages the index internals. You configure the vector shape, similarity metric, deployment type, and region.

from pinecone import Pinecone, ServerlessSpec
 
pc = Pinecone(api_key="YOUR_API_KEY")
 
pc.create_index(
    name="docs-index",
    vector_type="dense",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ),
    deletion_protection="disabled"
)

Querying is configured through things like top_k, metadata filters, namespace, and whether you use dense, sparse, or hybrid retrieval.

index = pc.Index("docs-index")
 
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    filter={"tenant_id": {"$eq": "acme"}}
)

So with Pinecone, think less “tune HNSW parameters” and more “choose metric, dimension, deployment, metadata model, and retrieval pipeline.”

MongoDB Atlas Vector Search

MongoDB Atlas exposes HNSW-style configuration through Vector Search index definitions.

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine",
      "quantization": "scalar",
      "indexingMethod": "hnsw",
      "hnswOptions": {
        "maxEdges": 16,
        "numEdgeCandidates": 100
      }
    },
    {
      "type": "filter",
      "path": "tenant_id"
    }
  ]
}

Query example:

db.docs.aggregate([
  {
    $vectorSearch: {
      index: "docs_vector_index",
      path: "embedding",
      queryVector: queryEmbedding,
      numCandidates: 100,
      limit: 10,
      filter: { tenant_id: "acme" }
    }
  }
])

In MongoDB terms:

maxEdges is similar to HNSW M
numEdgeCandidates is similar to efConstruction
numCandidates controls how many candidates are explored at query time
quantization can be none, scalar, or binary

Milvus

Milvus exposes many index types directly, including HNSW, IVF_FLAT, IVF_PQ, and DISKANN.

HNSW:

index_params = MilvusClient.prepare_index_params()
 
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    index_name="embedding_hnsw_idx",
    metric_type="COSINE",
    params={
        "M": 16,
        "efConstruction": 100
    }
)

Search:

search_params = {
    "params": {
        "ef": 100
    }
}

IVF_FLAT:

index_params.add_index(
    field_name="embedding",
    index_type="IVF_FLAT",
    index_name="embedding_ivf_idx",
    metric_type="COSINE",
    params={
        "nlist": 128
    }
)

Search:

search_params = {
    "params": {
        "nprobe": 10
    }
}

Qdrant

Qdrant uses HNSW as its core vector index and lets you tune it at collection level.

from qdrant_client import QdrantClient, models
 
client = QdrantClient(url="http://localhost:6333")
 
client.create_collection(
    collection_name="docs",
    vectors_config=models.VectorParams(
        size=1536,
        distance=models.Distance.COSINE
    ),
    hnsw_config=models.HnswConfigDiff(
        m=16,
        ef_construct=100,
        full_scan_threshold=10000
    )
)

You can also configure quantization:

client.update_collection(
    collection_name="docs",
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)

The Practical Rule

If the database exposes HNSW knobs:

Increase m / maxEdges / M -> better graph, more memory.
Increase efConstruction / numEdgeCandidates -> better build quality, slower indexing.
Increase efSearch / numCandidates / ef -> better recall, slower queries.

If the database exposes IVF knobs:

Increase lists / nlist -> more clusters, smaller scan regions.
Increase probes / nprobe -> search more clusters, better recall, slower queries.

If the database is fully managed:

Tune metric, dimensions, metadata filters, top_k, namespaces, reranking, and evaluation.

The best configuration is not universal. Start with defaults, build a small retrieval evaluation set, and tune for your actual recall, latency, memory, and cost targets.

The Big Picture

Vector indexes are not magic. They are classic algorithms wearing modern AI clothing.

HNSW borrows from graphs, greedy search, skip-list-like layering, and priority queues.

IVF borrows from clustering, inverted indexes, partitioning, and partial scanning.

PQ borrows from compression, quantization, and lookup-table acceleration.

All of them are trying to spend distance computations where they matter most.

That is the heart of vector indexing:

Do not compare against everything.
Build a data structure that points search toward likely neighbors.
Accept a controlled amount of approximation.
Tune recall, latency, memory, and build cost.

Once you see the data structures underneath, vector databases feel much less mysterious. They are search engines built on graphs, clusters, heaps, arrays, and clever shortcuts.