Vector Search HNSW Indexing Encoding #2316

Beihao-Zhou · 2024-05-15T23:52:23Z

Beihao-Zhou
May 15, 2024
Collaborator

Vector Search HNSW Indexing Encoding

Goal

We aim to integrate the Hierarchical Navigable Small World (HNSW) algorithm into Kvrocks to provide an efficient, scalable solution for approximate nearest neighbour searches in high-dimensional spaces.

Background

HNSW builds a multi-layered structure where each layer acts as a separate graph. The top layers contain fewer nodes and are used to facilitate fast traversal over large distances within the dataset. Lower layers have more nodes, enabling precise navigation and search in the neighborhood of the query point. In a word, HNSW is a skiplist of graph.

The bottom-most level 0 NSW layer contains all information, and we randomly put some vectors to the upper layer (more upper layer has fewer elements), which are also NSW indexes. The search process starts from the upper-most layer, and uses neighbors in that layer as the entry points of the lower layer.

Reference: Write a vector Database

HNSW metadata

Key: index_name
Value
- Basic Metadata
  - flags (1 byte)
  - expiration (Ebytes)
  - version (8 bytes)
  - size (Sbytes)
- HNSW Specific Metadata
  - storing_data_type (1 byte): Vectors could be stored in both JSON and hashes. (0 = hashes, 1 = JSON).
  - vector_field (Sbyte)
  - entryPoint (Sbytes): Entry node ID for search initiation, i.e. key for returned data??.
  - type (1 byte): Enum property to indicate vector type (0 = FLOAT32, 1 = FLOAT64).
  - dim (2 bytes): Stores the vector dimension as a 16-bit integer.
  - distanceMetric (1 byte): Enum for distance metric (0 = L2, 1 = IP, 2 = COSINE).
  - initialCap (4 bytes): Initial capacity as a 32-bit integer.
  - M (2 bytes): Maximum number of outgoing edges per node as a 16-bit integer.
  - efConstruction (4 bytes): Size of the dynamic candidate list during construction as a 32-bit integer.
  - efRuntime (4 bytes): Size of the candidate list during search as a 32-bit integer.
  - epsilon (4 bytes): Floating point to extend search radius, stored as a 32-bit float.
  - maxElements (4 bytes): Maximum elements stored as a 32-bit integer.
  - currentLevel (2 bytes): Highest level nodes currently reach as a 16-bit integer.

HNSW Graph sub key-values

Nodes

Key: index_name | level | node_id
- level is the current level of the node
- node_id is the key for the indexed data point
Value: num_neighbours

Edges

Key: index_name | level | node_id | connected_node_id
Value: computed_distance

Inverted Index Key-values

After you create an index, Redis Stack automatically indexes any existing, modified, or newly created JSON documents stored in the database. When inserting a new entry, it should be aware that it’s part of the HNSW indexing.

Key: type | prefix | vector_field
Value: index_name

APIs

All interfaces and internal APIs will be implemented based on:

Appendix

Redis Vector: https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/
Kvrocks data structure encoding: https://kvrocks.apache.org/community/data-structure-on-rocksdb/
Other Sample Impl: https://github.com/swapneel/hnsw-rust

Beihao-Zhou · 2024-05-15T23:54:50Z

Beihao-Zhou
May 15, 2024
Collaborator Author

Hiii @PragmaTwice @git-hulk , as discussed in here last time, this is the encoding plan to implement basic vector search in kvrocks. Welcome any questions/concerns/feedback!!

I couldn't think of a better way for node_id (for now it's just the key of original data that should be returned). Let me know if you guys have better ideas!

1 reply

git-hulk May 16, 2024
Collaborator

Thanks for your awesome design proposal, I'm good with this design. For the naming of node_id, I guess it's good to use node_id while in the context of the graph, and maybe we can use a specific prefix to identify when implementing, e.g. graph_node_id. To see if @PragmaTwice @mapleFU have any comments.

mapleFU · 2024-05-16T02:40:10Z

mapleFU
May 16, 2024
Collaborator

From my experience, the space amplify of vector index would be huge and might not so sutable for kv-case, we can rush a fast poc and try comparing the space for storing in kv or storing in a self-defined blob format

1 reply

Beihao-Zhou May 26, 2024
Collaborator Author

Apology for the late reply, was busy for the past two weeks.

Could you please elaborate a little bit on what I could do for POC, particularly on the blob format? I'll try to first implement the current solution, as it needs to be done after all. Then I will compare it with the blob format.

PragmaTwice · 2024-05-16T11:33:42Z

PragmaTwice
May 16, 2024
Collaborator

Thank you for your great design proposal and your efforts!

I think there are generally no major issues with your proposal, but there are a few points to note:

For the encoding of indexes, it is different from the encoding of redis data types. We maintain the encoding of all types of indexes in search_encoding.h (such as tag type: https://github.com/apache/kvrocks/blob/unstable/src/search/search_encoding.h#L107), which you can refer to.
Could you explain a little bit about the role of inverted index key-values? It seems that we don't really need to maintain this in KQIR.

2 replies

Beihao-Zhou May 26, 2024
Collaborator Author

Apology for the late reply as well.

Thanks for sharing the code pointers! The inverted index was intended for index updater, but after looking around in the codebase, found that it's already implemented in PR #2111 , so never mind.

I'll try to get more familiar with search module these two days and start to implement the proposed solution.

PragmaTwice May 27, 2024
Collaborator

Thank you!

You can check this issue for the current encoding (and a newly proposed one) of the search framework: #2329

So basically you'll need to add a new field type, associated with the metadata encoding and index data encoding of this type, e.g.

vector field metadata encoding:

ns | FIELD_META | index name | field name -> field flag | ... (like type, dim..)

vector field index encoding:

ns | FIELD | index name | field name | ... -> ... (like nodes, edges)

suppersam1 · 2024-06-13T10:09:36Z

suppersam1
Jun 13, 2024

How did you search? Is it through querying the underlying rocksdb to obtain the various points on the hnsw graph？

6 replies

suppersam1 Jun 19, 2024

node_id -> node_data(like neighbors、vector_data、level), neighbors data contain the neighbors of each layer of the node in the hnsw graph, and only store node id. You can refer to implementations such as pgvector or hnswlib. Implementing pure in-memory HNSW using RocksDB's block cache. When block cache enough, the all kv data load in memory. When block cache not enough, queries will also degenerate into disk queries.

Beihao-Zhou Jun 20, 2024
Collaborator Author

Thanks for sharing the advice! HNSW's first draft PR is here: #2368
The indexing encoding is defined under search_encoding.h within the PR.

The implementation of insertion/search follows pretty much as bustub-vectordb, but basically a modified on-disk version of the original HNSW Algorithm Paper.

For block cache, I see that it's been managed by the storage layer when RocksDB is open here. I think for vector search, RocksDB might need to be carefully tuned.

PragmaTwice Jun 20, 2024
Collaborator

The HNSW implementation in FAISS is also worth checking out: https://github.com/facebookresearch/faiss/blob/main/faiss/impl/HNSW.h

suppersam1 Jun 20, 2024

I'm not sure if RocksDB's point lookup performance can support HNSW queries. Based on my tests, for 500,000 points in a 1536-dimensional space with m=16, each query needs to access around 2000 nodes. In other implementations where node data is stored in arrays, accessing 2000 nodes takes approximately 100 microseconds. However, if querying 2000 points from RocksDB, it may take around 5 microseconds per point, reaching 10 milliseconds for 2000*5us.

PragmaTwice Jun 21, 2024
Collaborator

In terms of performance, it's possible that the initial version of on-disk HNSW may not be ideal. But what is important is that we need a baseline as a reference, and then try various optimization methods based on this (index encoding, rocksdb tuning ..), even exploring other indexing algorithms such as DiskANN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Search HNSW Indexing Encoding #2316

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Vector Search HNSW Indexing Encoding #2316

Beihao-Zhou May 15, 2024 Collaborator

Vector Search HNSW Indexing Encoding

Goal

Background

HNSW metadata

HNSW Graph sub key-values

Nodes

Edges

Inverted Index Key-values

APIs

Appendix

Replies: 4 comments · 10 replies

Beihao-Zhou May 15, 2024 Collaborator Author

git-hulk May 16, 2024 Collaborator

mapleFU May 16, 2024 Collaborator

Beihao-Zhou May 26, 2024 Collaborator Author

PragmaTwice May 16, 2024 Collaborator

Beihao-Zhou May 26, 2024 Collaborator Author

PragmaTwice May 27, 2024 Collaborator

suppersam1 Jun 13, 2024

suppersam1 Jun 19, 2024

Beihao-Zhou Jun 20, 2024 Collaborator Author

PragmaTwice Jun 20, 2024 Collaborator

suppersam1 Jun 20, 2024

PragmaTwice Jun 21, 2024 Collaborator

Beihao-Zhou
May 15, 2024
Collaborator

Replies: 4 comments 10 replies

Beihao-Zhou
May 15, 2024
Collaborator Author

git-hulk May 16, 2024
Collaborator

mapleFU
May 16, 2024
Collaborator

Beihao-Zhou May 26, 2024
Collaborator Author

PragmaTwice
May 16, 2024
Collaborator

Beihao-Zhou May 26, 2024
Collaborator Author

PragmaTwice May 27, 2024
Collaborator

suppersam1
Jun 13, 2024

Beihao-Zhou Jun 20, 2024
Collaborator Author

PragmaTwice Jun 20, 2024
Collaborator

PragmaTwice Jun 21, 2024
Collaborator