In our ongoing effort to improve the efficiency and scalability of vector indexing, we’ve introduced a new vector index called ‘qHNSW‘. This vector index addresses one of the primary limitations of existing vector indices by allowing for on-disk storage with memory-mapped access. The result is a more scalable and efficient indexing solution that can handle large datasets without being bound by available memory.
The Problem with Traditional Vector Indices
Traditional vector indices, such as the ‘HNSW’ index, require the entire index to be stored in memory. While this approach provides fast search speeds, it can become a significant bottleneck when dealing with large datasets or limited memory resources. To mitigate this issue, we’ve developed qHNSW, which stores the index on disk and maps it into memory as needed.
Key Benefits of qHNSW
qHNSW offers several benefits over traditional vector indices:
- Reduced Memory Footprint: Data inserts have a much smaller memory footprint than existing HNSW indexes.
- Incremental Disk Access: Data searches read from disk incrementally, keeping memory utilization extremely low.
- Cost Effectiveness: On-disk storage is generally less expensive, and has less power consumption than in-memory storage.
- Improved Scalability: With qHNSW, users can create as many indexes as there is space for on disk and search all at once.
Scalability and Performance
The scalability benefits of qHNSW are significant. If disk space is available, the index can grow and be searched without being limited by memory constraints. This is particularly beneficial for large-scale applications that require fast and efficient indexing solutions.
Implementation Schema
#Set up the schema and indexes for KDB.AI table, specifying embeddings column with 1536 dimensions, Euclidean Distance, and flat index
pdf_schema = [
{"name": "document_id", "type": "bytes"},
{"name": "text", "type": "bytes"},
{"name": "embedding", "type": "float64s"}
]
indexes = [
{
"name": "qhnsw_index",
"type": "qHnsw",
"column": "embedding",
"params": {"dims": 1536, "metric": "L2"},
}
]
table = database.create_table("pdf", schema = pdf_schema, indexes = indexes)
See a full sample in our GitHub repository, or open the code directly in Google Colab.
Assumptions and Considerations
To take full advantage of qHNSW, users will need to ensure:
- Read access to disk storage for existing data
- Write access to new data on the disk
Additionally, disk speeds will have a direct impact on search speeds, so it’s essential to optimize disk performance for optimal results.
Try it Today
qHNSW represents a significant advancement in vector indexing technology, allowing users to break free from memory-bound limitations and scale their applications with ease. With its fast search speeds, low memory utilization, and improved scalability, qHNSW is poised to revolutionize the way we approach large-scale vector indexing challenges.
Take a look at our qHNSW documentation and sample code to learn more.
We’re excited to bring this innovative solution to our community and look forward to hearing your feedback and experiences with qHNSW!