Indexing Basics

7 minutes

Vector databases utilize a crucial element known as the “Vector Index” to process data. The creation of this index involves applying an algorithm to the vector embeddings stored within the database. This algorithm functions to map these vectors to a specialized data structure, facilitating rapid searches. Searches are more efficient this way due to the index’s condensed representation of the original vector data. This compactness reduces memory requirements and enhances accessibility when contrasted with processing searches via raw embeddings.

You don’t need to know the details of creating indices themselves with KDB.AI; they are easily created through simple commands that you choose. However, grasping the fundamental workings of vector indices and their various forms can be helpful in knowing which one to choose and when. This article discusses three primary categories of vector indices:

  • Flat (e.g. Brute Force)
  • Graph (e.g. HNSW)
  • Inverted (e.g. IVF, IVF-PQ)

Flat Index

Flat indices (otherwise known as “Brute Force”) deliver the best accuracy of all indexing methods, however, they are notoriously slow. This is because flat indices are a direct representation of the vector embedding. Unlike other indexing methods, generating flat indices does not use pre-trained clusters or any other modification of the vector embeddings. With flat indices, search is exhaustive: it’s performed from the query vector across every single vector embedding and distances are calculated for each pair. Then, k number of embeddings are returned using k-nearest neighbors (kNN) search:

This type of pairwise distance calculation doesn’t seem very efficient (hint: it’s not), however, the high accuracy of search results is sometimes required depending on your application. Specifically, here are some scenarios in which flat indexing is beneficial:

  • Low-Dimensional Data: With low-dimensional vectors, typically up to a few dozen dimensions, flat indices can be sufficiently fast while maintaining high accuracy.
  • Small-Scale Databases: When the database contains a relatively small number of vectors, a flat index can be sufficient to provide quick retrieval times.
  • Simple Querying: If the primary use case involves simple queries, a flat index can offer competitive performance compared to other indices without the complexity.
  • Real-time Data Ingestion: When vectors are continuously added to the database, they must be indexed quickly. Due to the simplicity of flat indexing, minimal computation is needed to generate the new indices.
  • Low Query Volume: If the rate of queries being performed on the database is low, a flat index can handle these queries effectively.
  • Benchmarking Comparisons: In situations where you want to evaluate the accuracy of other index methods, using the perfectly accurate flat index can be used as a benchmark for comparison purposes.

Graph Index

Graph indices use nodes and edges to construct a network-like structure. The nodes (or “vertices”) represent the vector embeddings while the edges represent the relationships between embeddings. The most common type of graph index is Hierarchical Navigable Small Words (HNSW).

HNSW is a “proximity graph” where two embedding vertices are linked based on their proximity – often defined by Euclidean Distance. Each vertex in the graph connects to other vertices called “friends” and each vertex keeps a “friends list.” Starting at a predefined “entry point”, the algorithm traverses connected friend vertices until it finds the nearest neighbor to the query vector:

Each traversal is continued on via Nearest Neighboring vertices in the “friends list” until no nearer vertex is found – a local minimum is reached for distance.

The entry point is typically on high-degree vertices (vertices with many connections) to reduce the chance of stopping early by starting on low-degree vertices:

The number of degrees on the entry vertex can be specified and is typically balanced between recall and search speed.

Rather than doing this across one layer (which would take considerable time and memory), the index space is split into hierarchical layers, thus the “H” in HNSW:

The top layer has the longest distances but includes the highest degree vertices. The algorithm starts by traversing the longer distances until it finds a local minimum with the high degree vertices, then it drops down a layer where more “friend” vertices are available to the previous layer’s high degree vertex. This traversal continues until the nearest neighbor’s distance to the query vector reaches a local minimum.

Specifically, here are the scenarios where HNSW indexing makes the most sense:

  • High-Dimensional Data: HNSW is specifically designed for high-dimensional data, typically hundreds or thousands of dimensions.
  • Efficient Nearest Neighbor Search: HNSW is well-suited for applications where quickly finding the most similar data points to a given query is critical, such as recommendation systems, content-based image retrieval, and natural language processing tasks.
  • Approximate Nearest Neighbor Search: While HNSW is primarily designed for accurate nearest neighbor search, it can also be used for approximate nearest neighbor search tasks where the user seeks to minimize search computation cost.
  • Large-Scale Databases: HNSW is designed to scale well with large datasets.
  • Real-time and Dynamic Data: HNSW is capable of accommodating dynamically changing data, such as real-time updates.
  • Highly-Resourced Environments: HNSW’s performance is not solely reliant on the memory of a single machine, making it well-suited to distributed and parallel computing environments.

Inverted Index (and Product Quantization)

Inverted indices are widely used in search engines because they provide an efficient way to find document contents. Imagine you have a collection of documents, like web pages in a search engine database. Each document contains words, and you want to be able to quickly find all the documents that contain a particular word. For each unique word in the entire collection of documents, the inverted index creates a list of document references or identifiers (such as document IDs) where that word appears:

This is an initial broad stroke of the indexing. We can further optimize the index with “Product Quantization”.

Product Quantization works in 4 steps:

  1. Take a high dimensional vector
  2. Split into equally sized sub-vectors
  3. Assign each subvector to its nearest centroid (defined by kmeans clustering)
  4. Replace each centroid value with Unique IDs

Once the quantized subvectors are available, they can be arranged in Voronoi cells. This is shown below with subvectors in yellow, Voronoi cells bounded by white lines, and their centroids in white circles. The query vector is the green star:

The Voronoi cells chosen are defined by the query vector and probe parameter. Search is restricted to the cell the query vector lies in, and nearest centroids of surrounding cells. Below, a probe parameter of 3 grabs the 3 nearest cells to the query vector:

This yields a huge reduction in memory usage and search time, with great recall. Like HNSW, it’s well-suited to high dimensional vectors and when you need fast nearest neighbor search. In fact, IVFPQ shares a lot of similar benefits as HNSW ranging from Highly-Resourced Environments, to High-Dimensionality Vectors, to Large-Scale Databases. It would be more useful to highlight the nuanced differences in the two index types:

IVFPQ Index is better for:

  • Accuracy: Excels when high precision of nearest neighbors is important.
  • Approximate Search: Both exact and approximate nearest neighbor search are used, however, it’s particularly strong in scenarios where exact results are desired.
  • Memory Efficiency: Due to the use of product quantization, IVFPQ is very memory efficient.

HNSW Index is better for:

  • Approximate Search: HNSW is designed for efficient approximate nearest neighbor calculations. If your goal is to find a relatively good solution quickly, even if it’s not the exact nearest neighbor, choose HNSW.
  • Simplicity: HNSW is relatively simpler to implement and configure compared to IVFPQ. There are fewer steps in constructing the index since PQ is not involved.
  • Scalability: Handles large datasets with ease and is adaptive to real-time and dynamic data.
  • Storage: HNSW does not require PQ, and therefore needs less storage space.
  • Speed (over accuracy): If you can accept approximate results, the speed benefits of HNSW may be worth it.

Final Considerations

The benefits of each of the three index types described here can be summarized in the following table:

Retrieval SpeedLowVery HighHigh
Indexing SpeedVery HighHighModerate
Database SizeLowHighHigh
AccuracyVery HighModerateHigh
DimensionalityLowVery HighHigh
Query VolumeLowHighHigh

It’s important to remember that KDB.AI does vector indexing for you through simple commands, however, this article covered some important considerations and inner workings of how those indices are created as well as the specific tradeoffs to help you better decide how to use them.

To index your own vector embeddings, read our documentation.