Handle larger Indices (current limit 1GB) #1

asg017 · 2023-02-03T19:10:14Z

The vector indices that support the vss0 virtual table are limited to 1GB. This is because they are stored as a BLOB in a single row in a shadow table, which has a limit of ~1GB.

Instead, we should store large FAISS indices across several rows, so they can (in theory) grow with infinite space. This will likely be complicated and require the SQLite BLOB I/O API and a custom faiss IOWriter.

The text was updated successfully, but these errors were encountered:

chetanladdha · 2023-05-12T09:51:06Z

@asg017: Do you have any planned for this fix? Any timeline?

asg017 · 2023-05-12T16:51:32Z

No timeline! I haven't had time to work on core sqlite-vss features recently.

But if anyone reading this would like to sponsor this work, let me know!

siscia · 2023-06-26T17:02:27Z

Do we have an rough estimate of how many vectors can be index with 1GB?

asg017 · 2023-06-26T17:32:21Z

Depends on the dimensions of your vectors, and if you use any additional faiss factory strings with factory="".

If you're using the default settings, the size of your index in bytes is roughly:

dimensions * 4 * number_vectors

(where 4 is sizeof(float))

With some additional overhead (a BTree that maps rowid's to each vector, some Faiss-specific storage stuff, etc.).

In the "headlines" example in this blog post, there are 209,527 vectors with 384 dimensions (using sentence transformers), and it takes up roughly 323.5MB of space.

You can also use the factory= option to change how Faiss stores and queries your vectors. There are a ton of options, but "dimension reduction" techniques could help you lower the amount of storage your index takes, to stay under the 1GB limit.

I don't have too many examples of this, but it was discussed at length in this issue. The general gist is to use the PCA directive in a custom factory string:

create virtual table vss_pca using vss0(a(1536) factory="PCA384,Flat,IDMap2");

Then train your index:

insert into vss_pca(operation, a)
  select 'training', value from vectors;

Then insert your vector data:

insert into vss_pca(rowid, a)
  select key, value from vectors;

This example reduces 1536-dimension vectors to 384 dimensions. The storage saving get better with larger datasets, but my quick test was 25% the size of the original full-length index, with 100,000 vectors.

This approach does reduce the accuracy of KNN-style searches however, so use caution.

baughmann · 2023-12-19T00:31:49Z

No timeline! I haven't had time to work on core sqlite-vss features recently.

But if anyone reading this would like to sponsor this work, let me know!

I did just that

asg017 · 2023-12-22T17:53:06Z

Hey @baughmann , thanks for the donation! I'm going to give this a shot over the holiday.

Although to be clear, this will will only bump the 1GB limit that sqlite-vss has for indices, to (hopefully) an arbitary size. This will be done by splitting the Faiss index into multiple chunks and storing across multiple rows in shadow tables, instead of the current "store 1 big blob in a single row).

However, the following shortcomings will still exist:

Faiss and therefore sqlite-vss requires the entire index to be stored in-memory, so even if you could store > 1GB per vector table, you'll also need enough memory to handle it.
Updates will be slow, again because of Faiss. Even if you insert/update a few vectors at a time, the entire index has to be re-written from scratch.

I've been thinking a lot about these shortcomings. There is #30 where I lay out a possible workaround, but that's separate from this issue. I'll also probably chew a bit on this over my break, but will definitely try to lift this 1GB limit in the following weeks!

cduk · 2024-03-10T23:15:59Z

Just wondering if there were any developments on this front. I guess my use case is a simple one as I have around 3-4 million embeddings to index which is >1GB, however as the data is static, I don't have the update concerns you outline in #30.

slavakurilyak · 2024-04-02T09:41:10Z

+1 for solving this problem as it would unlock vector use cases and possibly integrate with @tursodatabase which has scales sqlite in production workloads

baughmann · 2024-04-15T18:10:56Z

For everyone waiting on this, it may be better to just create an implementation using the Repository pattern and have that repository instance maintain a FAISS index alongside the SQLite database. On startup, the repository should load a BLOB column of vectors into the Index. You can utilize the IndexIDMap to track which documents were returned in your similarity search by linking that ID column to an auto-incrementing integer column on the SQLite table.

I implemented one of these myself. It will probably end up being faster and more scalable than anything that can be done in a native SQLite plugin.

cchance27 · 2024-04-27T19:49:05Z

Any chance you could share an example implementation of that

baughmann · 2024-04-27T20:05:52Z

I was trying to think of a way to make it into a library, but because people's metadata structures are so different it would be difficult to make it generic enough. However, I believe I have a notebook that just uses Pandas/PyTorch/Numpy/SQLite that I could minimize and throw up in a gist or something. I can try to do the same for the FAISS/Numpy/SQLite implementation. I'm out of the country for a few weeks but will try to get around to it this week if I can. That said, it's actually not difficult at all. If you can't wait on me, look up some FAISS/PyTorch similarity searches tutorials. The key difference is that you need to deserialize/serialize the index members and save them to a BLOB column in SQLite. The downside for the FAISS implementation is that metadata filtering needs to be done *after* similarity search since you can't filter a FAISS index by anything other than it's vector content. The complication with the PyTorch/Pandas/Numpy implementation is that you need to copy the entire active data frame if you want to filter by metadata before similarity search. Neither implementation is perfect, but will probably get you pretty far with some basic optimizations.

…

On Sat, Apr 27, 2024, 4:49 PM Chris Chance ***@***.***> wrote: Any chance you could share an example implementation of that — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPEFA4FNDCXD7CSU667F5TY7P6MRAVCNFSM6AAAAAAUQTEU7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGE3DGOBUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

asg017 mentioned this issue Jun 9, 2023

Changing pointer semantics to std::unique_ptr #58

Closed

asg017 mentioned this issue Sep 5, 2023

OperationalError: Error saving index (1): string or blob too big when index size exceeds 650 000 records #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle larger Indices (current limit 1GB) #1

Handle larger Indices (current limit 1GB) #1

asg017 commented Feb 3, 2023

chetanladdha commented May 12, 2023 •

edited

Loading

asg017 commented May 12, 2023

siscia commented Jun 26, 2023

asg017 commented Jun 26, 2023

baughmann commented Dec 19, 2023

asg017 commented Dec 22, 2023

cduk commented Mar 10, 2024

slavakurilyak commented Apr 2, 2024

baughmann commented Apr 15, 2024 •

edited

Loading

cchance27 commented Apr 27, 2024

baughmann commented Apr 27, 2024 via email •

edited

Loading

Handle larger Indices (current limit 1GB) #1

Handle larger Indices (current limit 1GB) #1

Comments

asg017 commented Feb 3, 2023

chetanladdha commented May 12, 2023 • edited Loading

asg017 commented May 12, 2023

siscia commented Jun 26, 2023

asg017 commented Jun 26, 2023

baughmann commented Dec 19, 2023

asg017 commented Dec 22, 2023

cduk commented Mar 10, 2024

slavakurilyak commented Apr 2, 2024

baughmann commented Apr 15, 2024 • edited Loading

cchance27 commented Apr 27, 2024

baughmann commented Apr 27, 2024 via email • edited Loading

chetanladdha commented May 12, 2023 •

edited

Loading

baughmann commented Apr 15, 2024 •

edited

Loading

baughmann commented Apr 27, 2024 via email •

edited

Loading