Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Sqlite vector store #122

Merged
merged 7 commits into from
Dec 3, 2024

Conversation

tarrencev
Copy link
Contributor

@tarrencev tarrencev commented Nov 25, 2024

New vector store: Sqlite

Description

Sqlite using sqlite-vec

Fixes # (issue)

Resolves #124

Testing

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce your results.

  • Run example with cargo run --package rig-sqlite --example vector_search_sqlite

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • [x I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I've reviewed the vector store API documentation and implemented the types of response accurately

Notes

I think this might have some conflicts with #120?

Perhaps it would be easiest to merge this first then adapt it in that PR or I can adapt it after that is merged. It seems like it might remove a bunch of the code.

@0xMochan
Copy link
Contributor

Hello, @tarrencev, welcome to the Rig community! We appreciate your pull request. I'll be doing a deeper dive and replicating your sqlite setup locally soon but I had a couple of preliminary comments.

  1. In the future, make sure you check out the CONTRIBUTING.md guide. In general, we request all contributions have a corresponding issue made alongside using the PR templates available for new provider and vector store integrations.

  2. I was glancing through your PR and I noticed that you've leveraged zerocopy for memory transmutation in one of the examples (in the unsafe block). This is an unusual dependency and method here being used for an example, is there a specific reason this is needed for the example? I think if you had some documentation on the rusqlite side that made this a normal code pattern, it would make it a bit more clear to us!

  3. Please do link the specific documentation being used for the vector store implementation and even consider adding it to the code documentation as it makes it convenient for future engineers to be able to maintain this crate!

  4. As per your question about feat: embeddings overhaul #120, we are looking to merge this very soon. In any case, if this PR merges before that one, we can add a follow up commit will be added to feat: embeddings overhaul #120 post-merge. I'll follow up with the engineer leading that PR to match up what parts are changing related to this PR.

@tarrencev tarrencev force-pushed the sqlitestore branch 2 times, most recently from e0ed435 to 00265b1 Compare November 25, 2024 19:40
@tarrencev
Copy link
Contributor Author

Hey there, thanks for the response:

  1. I was glancing through your PR and I noticed that you've leveraged zerocopy for memory transmutation in one of the examples (in the unsafe block). This is an unusual dependency and method here being used for an example, is there a specific reason this is needed for the example? I think if you had some documentation on the rusqlite side that made this a normal code pattern, it would make it a bit more clear to us!

In this case, we were using it to efficiently copy the embedding vector to a bytes representation for storage in the db but we can do it safely with a copy.

  1. Please do link the specific documentation being used for the vector store implementation and even consider adding it to the code documentation as it makes it convenient for future engineers to be able to maintain this crate!

I've added a reference to the sqlite-vec rust docs for initializing the extension.

  1. As per your question about Feat/embeddings overhaul #120, we are looking to merge this very soon. In any case, if this PR merges before that one, we can add a follow up commit will be added to Feat/embeddings overhaul #120 post-merge. I'll follow up with the engineer leading that PR to match up what parts are changing related to this PR.

Sounds good!

@tarrencev tarrencev changed the title Add sqlite vector store feat: Add support for Sqlite vector store Nov 25, 2024
@tarrencev
Copy link
Contributor Author

The failing test here seems to depend on a api token

@0xMochan
Copy link
Contributor

The failing test here seems to depend on a api token

Looking into it!

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of suggested changes (thanks for adding to the PR after my prelim glance). The unsafe block makes sense since it's straight from the docs so it should be fine to users of sqlite (the comment helps a lot!)

  1. The earlier zerocopy usage, I accidentally confused the usage of zerocopy with the unsafe block in the example (which, I didn't see why an extra dep. was needed for an example). I see that you reverted this change already (but I don't see it in the commit history) but I think it's actually fine to use it here. Apologies!

  2. The failing tests, we'll resolve this soon!

  3. We are also pursuing a set of integration tests soon. I think this crate should be fairly straightforward to integrate with that but we'll leave updates for that process in our Discord if you wanted to contribute to the testing effort later!

rig-sqlite/src/lib.rs Outdated Show resolved Hide resolved
rig-sqlite/src/lib.rs Outdated Show resolved Hide resolved
rig-sqlite/src/lib.rs Outdated Show resolved Hide resolved
@tarrencev tarrencev force-pushed the sqlitestore branch 2 times, most recently from d6a959e to 4838e66 Compare November 26, 2024 02:10
@tarrencev
Copy link
Contributor Author

Addressed most feedback, if you feel strongly about extracting queries / columns / etc I can try find time to land that too.

@0xMochan
Copy link
Contributor

0xMochan commented Nov 26, 2024

Addressed most feedback, if you feel strongly about extracting queries / columns / etc I can try find time to land that too.

I do think it would be proper, since the user might be using a sqlite database file that is defined outside of rig. We could use prepare and have a sanitized sql query constructed via a builder pattern (since column naming could be theoretically malicious).

If timing is an issue, I could also quickly scaffold or complete this last feedback.

^ Old response, addressed again later

Also, for the EmbeddingModel, it might be possible to save only a reference rather than cloning the model since it's only used for ndim. This is fairly minor though.

Great work on the feedback, this will be a very handy addition as a general replacement for InMemoryVectorStore when I'm handling testing scripts!

Edit:
I just realized you replied directly to my review suggestion (thanks mobile UI). It might be a good point for us to do later, so perhaps we can create an issue for it and then plan to tackle it after the embeddings PR. I do concur with the encapsulation POV hence why I think a builder pattern w/ reasonable defaults is a happy medium.

@tarrencev
Copy link
Contributor Author

Sounds good! I've updated the PR to pass the embedding model by reference

@0xMochan
Copy link
Contributor

Last thing I think is a usage comment on the SqliteVectorStore struct. Something similar to Lancedb's:

/// # Example
/// ```
/// use std::{env, sync::Arc};
///
/// use arrow_array::RecordBatchIterator;
/// use fixture::{as_record_batch, schema};
/// use rig::{
/// embeddings::{EmbeddingModel, EmbeddingsBuilder},
/// providers::openai::{Client, TEXT_EMBEDDING_ADA_002},
/// vector_store::VectorStoreIndexDyn,
/// };
/// use rig_lancedb::{LanceDbVectorStore, SearchParams};
/// use serde::Deserialize;
///
/// #[derive(Deserialize, Debug)]
/// pub struct VectorSearchResult {
/// pub id: String,
/// pub content: String,
/// }
///
/// // Initialize OpenAI client. Use this to generate embeddings (and generate test data for RAG demo).
/// let openai_api_key = env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not set");
/// let openai_client = Client::new(&openai_api_key);
///
/// // Select the embedding model and generate our embeddings
/// let model = openai_client.embedding_model(TEXT_EMBEDDING_ADA_002);
///
/// let embeddings = EmbeddingsBuilder::new(model.clone())
/// .simple_document("doc0", "Definition of *flumbrel (noun)*: a small, seemingly insignificant item that you constantly lose or misplace, such as a pen, hair tie, or remote control.")
/// .simple_document("doc1", "Definition of *zindle (verb)*: to pretend to be working on something important while actually doing something completely unrelated or unproductive")
/// .simple_document("doc2", "Definition of *glimber (adjective)*: describing a state of excitement mixed with nervousness, often experienced before an important event or decision.")
/// .build()
/// .await?;
///
/// // Define search_params params that will be used by the vector store to perform the vector search.
/// let search_params = SearchParams::default();
///
/// // Initialize LanceDB locally.
/// let db = lancedb::connect("data/lancedb-store").execute().await?;
///
/// // Create table with embeddings.
/// let record_batch = as_record_batch(embeddings, model.ndims());
/// let table = db
/// .create_table(
/// "definitions",
/// RecordBatchIterator::new(vec![record_batch], Arc::new(schema(model.ndims()))),
/// )
/// .execute()
/// .await?;
///
/// let vector_store = LanceDbVectorStore::new(table, model, "id", search_params).await?;
///
/// // Query the index
/// let results = vector_store
/// .top_n("My boss says I zindle too much, what does that mean?", 1)
/// .await?
/// .into_iter()
/// .map(|(score, id, doc)| {
/// anyhow::Ok((
/// score,
/// id,
/// serde_json::from_value::<VectorSearchResult>(doc)?,
/// ))
/// })
/// .collect::<Result<Vec<_>, _>>()?;
///
/// println!("Results: {:?}", results);
/// ```

We actually do more of this in the #120 (mongodb gets a usage comment).

w.r.t the builder pattern, etc. we'll do a follow-up PR after embeddings get merged.
LGTM 🎉

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments and updates from the embeddings merge are fantastic. I made one comment about the change with un-hardcoding of the table names and such, w.r.t sql injections. Once that is fixed, this should be ready to merge!

rig-sqlite/src/lib.rs Show resolved Hide resolved
@mateobelanger mateobelanger added this to the v0.5 milestone Dec 2, 2024
@cvauclair cvauclair merged commit 5378796 into 0xPlaygrounds:main Dec 3, 2024
3 of 4 checks passed
@github-actions github-actions bot mentioned this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add support for Sqlite vector store
4 participants