-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle larger Indices (current limit 1GB) #1
Comments
@asg017: Do you have any planned for this fix? Any timeline? |
No timeline! I haven't had time to work on core But if anyone reading this would like to sponsor this work, let me know! |
Do we have an rough estimate of how many vectors can be index with 1GB? |
Depends on the dimensions of your vectors, and if you use any additional faiss factory strings with If you're using the default settings, the size of your index in bytes is roughly:
(where With some additional overhead (a BTree that maps rowid's to each vector, some Faiss-specific storage stuff, etc.). In the "headlines" example in this blog post, there are 209,527 vectors with 384 dimensions (using sentence transformers), and it takes up roughly 323.5MB of space. You can also use the I don't have too many examples of this, but it was discussed at length in this issue. The general gist is to use the create virtual table vss_pca using vss0(a(1536) factory="PCA384,Flat,IDMap2"); Then train your index: insert into vss_pca(operation, a)
select 'training', value from vectors; Then insert your vector data: insert into vss_pca(rowid, a)
select key, value from vectors; This example reduces 1536-dimension vectors to 384 dimensions. The storage saving get better with larger datasets, but my quick test was 25% the size of the original full-length index, with 100,000 vectors. This approach does reduce the accuracy of KNN-style searches however, so use caution. |
I did just that |
Hey @baughmann , thanks for the donation! I'm going to give this a shot over the holiday. Although to be clear, this will will only bump the 1GB limit that However, the following shortcomings will still exist:
I've been thinking a lot about these shortcomings. There is #30 where I lay out a possible workaround, but that's separate from this issue. I'll also probably chew a bit on this over my break, but will definitely try to lift this 1GB limit in the following weeks! |
Just wondering if there were any developments on this front. I guess my use case is a simple one as I have around 3-4 million embeddings to index which is >1GB, however as the data is static, I don't have the update concerns you outline in #30. |
+1 for solving this problem as it would unlock vector use cases and possibly integrate with @tursodatabase which has scales |
For everyone waiting on this, it may be better to just create an implementation using the Repository pattern and have that repository instance maintain a FAISS index alongside the SQLite database. On startup, the repository should load a BLOB column of vectors into the Index. You can utilize the IndexIDMap to track which documents were returned in your similarity search by linking that ID column to an auto-incrementing integer column on the SQLite table. I implemented one of these myself. It will probably end up being faster and more scalable than anything that can be done in a native SQLite plugin. |
Any chance you could share an example implementation of that |
I was trying to think of a way to make it into a library, but because
people's metadata structures are so different it would be difficult to make
it generic enough.
However, I believe I have a notebook that just uses
Pandas/PyTorch/Numpy/SQLite that I could minimize and throw up in a gist or
something. I can try to do the same for the FAISS/Numpy/SQLite
implementation.
I'm out of the country for a few weeks but will try to get around to it
this week if I can.
That said, it's actually not difficult at all. If you can't wait on me,
look up some FAISS/PyTorch similarity searches tutorials. The key
difference is that you need to deserialize/serialize the index members and
save them to a BLOB column in SQLite.
The downside for the FAISS implementation is that metadata filtering needs
to be done *after* similarity search since you can't filter a FAISS index
by anything other than it's vector content. The complication with the
PyTorch/Pandas/Numpy implementation is that you need to copy the entire
active data frame if you want to filter by metadata before similarity
search.
Neither implementation is perfect, but will probably get you pretty far
with some basic optimizations.
…On Sat, Apr 27, 2024, 4:49 PM Chris Chance ***@***.***> wrote:
Any chance you could share an example implementation of that
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPEFA4FNDCXD7CSU667F5TY7P6MRAVCNFSM6AAAAAAUQTEU7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGE3DGOBUHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The vector indices that support the
vss0
virtual table are limited to 1GB. This is because they are stored as a BLOB in a single row in a shadow table, which has a limit of ~1GB.Instead, we should store large FAISS indices across several rows, so they can (in theory) grow with infinite space. This will likely be complicated and require the SQLite BLOB I/O API and a custom faiss IOWriter.
The text was updated successfully, but these errors were encountered: