Skip to content

VectorDBCloud/Open-Source-Embedding-Cookbook

Repository files navigation

Version Python License

This repository contains a collection of Python scripts demonstrating how to use open-source embeddings with various vector databases. These cookbooks provide practical examples for data ingestion and similarity search using popular vector databases.

Vector databases are specialized database systems designed to store and query high-dimensional vectors efficiently. They are crucial for machine learning applications, particularly in natural language processing and computer vision.

Table of Contents

  1. About Vector Database Cloud
  2. Introduction
  3. Supported Vector Databases
  4. Prerequisites
  5. Installation
  6. Dependencies
  7. Usage
  8. Cookbooks
  9. Customization
  10. Best Practices
  11. Troubleshooting
  12. Contributing
  13. Related Resources
  14. License
  15. Disclaimer

About Vector Database Cloud

Vector Database Cloud is a platform that provides one-click deployment of popular vector databases including Qdrant, Milvus, ChromaDB, and Pgvector on cloud. Our platform ensures a secure API, a comprehensive customer dashboard, efficient vector search, and real-time monitoring.

Introduction

Vector Database Cloud is designed to seamlessly integrate with your existing data workflows. Whether you're working with structured data, unstructured data, or high-dimensional vectors, you can leverage popular ETL (Extract, Transform, Load) tools to streamline the process of moving data into and out of Vector Database Cloud.

Supported Vector Databases

Prerequisites

  • Python 3.7+
  • Access to Vector Database Cloud (VectorDBCloud) with API URL and API key for each database

Installation

  1. Clone this repository:

    git clone https://github.com/VectorDBCloud/Open-Source-Embedding-Cookbook.git
    cd Open-Source-Embedding-Cookbook
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    

Dependencies

The requirements.txt file includes the following main dependencies:

  • sentence-transformers
  • psycopg2-binary
  • pymilvus
  • chromadb
  • qdrant-client

Usage

Each cookbook is a standalone Python script demonstrating how to:

  • Connect to the respective vector database
  • Use open-source embeddings (Sentence Transformers with 'all-MiniLM-L6-v2' model)
  • Insert sample data with embeddings
  • Perform similarity searches

Before running any script, set the appropriate environment variables:

export VECTORDBCLOUD_<DATABASE>_API_URL="https://your-vector-db-cloud-url.com"
export VECTORDBCLOUD_<DATABASE>_API_KEY="your-api-key"

Replace <DATABASE> with the specific database name (e.g., PGVECTOR, MILVUS, CHROMADB, QDRANT).

To run a cookbook:

python <cookbook_name>.py

For example:

python pgvector_cookbook.py

Cookbooks

  1. pgvector_cookbook.py: Demonstrates usage with pgvector
  2. milvus_cookbook.py: Demonstrates usage with Milvus
  3. chromadb_cookbook.py: Demonstrates usage with ChromaDB
  4. qdrant_cookbook.py: Demonstrates usage with Qdrant

Each cookbook includes examples of:

  • Connecting to the database
  • Creating a collection/table
  • Inserting sample data with embeddings
  • Performing a similarity search

Customization

To adapt these scripts for your own use case:

  1. Replace the sample data with your own dataset.
  2. Adjust the embedding model if needed (currently using 'all-MiniLM-L6-v2').
  3. Modify the schema or collection structure to fit your data requirements.
  4. Customize the similarity search query and parameters as per your needs.

Best Practices

When working with vector databases and embeddings, consider the following best practices:

  1. Choose the right embedding model: Select an embedding model that's appropriate for your data type and use case.

  2. Normalize your vectors: Ensure your vectors are normalized to unit length for consistent similarity calculations.

  3. Use appropriate index types: Choose the right index type for your specific use case to optimize search performance.

  4. Batch operations: When inserting or querying large amounts of data, use batch operations to improve efficiency.

  5. Monitor performance: Regularly monitor and optimize your database performance, especially as your data grows.

  6. Keep your embeddings up to date: Retrain or update your embeddings periodically to reflect changes in your data or improvements in embedding models.

  7. Implement error handling: Robust error handling can help prevent data loss and improve the reliability of your applications.

  8. Secure your API keys: Always keep your Vector Database Cloud API keys secure and never expose them in client-side code.

Related Resources

Contributing

We welcome contributions to improve and expand our Open-Source Embedding Cookbook! Here's how you can contribute:

  1. Fork the repository: Create your own fork of the code.

  2. Create a new branch: Make your changes in a new git branch.

  3. Make your changes: Enhance existing cookbooks or add new ones.

  4. Follow the style guidelines: Ensure your code follows our coding standards.

  5. Write clear commit messages: Your commit messages should clearly describe the changes you've made.

  6. Submit a pull request: Open a new pull request with your changes.

  7. Respond to feedback: Be open to feedback and make necessary adjustments to your pull request.

For more detailed information on contributing, please refer to our Contribution Guidelines.

We also encourage you to:

  • Report bugs and issues through our Issue Tracker.
  • Suggest new features or improvements.
  • Help improve documentation.
  • Share your experiences and use cases with the community.

Remember, all contributors are expected to adhere to our Code of Conduct. We appreciate your efforts to make this project better for everyone!

Troubleshooting

If you encounter issues:

  1. Ensure all environment variables are correctly set.
  2. Check your internet connection for API access.
  3. Verify that you have the correct permissions for the Vector Database Cloud services.
  4. Make sure all dependencies are correctly installed.

For specific error messages, please refer to the documentation of the respective vector database or create an issue in this repository.

License

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Copyright (c) 2024 Vector Database Cloud

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

  • Attribution — You must give appropriate credit to Vector Database Cloud, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests Vector Database Cloud endorses you or your use.

Additionally, we require that any use of this guide includes visible attribution to Vector Database Cloud. This attribution should be in the form of "Open Source Embedding curated by Vector Database Cloud" or "Based on Vector Database Cloud Open Source Embedding", along with a link to https://vectordbcloud.com, in any public-facing applications, documentation, or redistributions of this guide.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

For the full license text, visit: https://creativecommons.org/licenses/by/4.0/legalcode

Disclaimer

The information and resources provided in this community repository are for general informational purposes only. While we strive to keep the information up-to-date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information, products, services, or related graphics contained in this repository for any purpose. Any reliance you place on such information is therefore strictly at your own risk.

Vector Database Cloud configurations may vary, and it's essential to consult the official documentation before implementing any solutions or suggestions found in this community repository. Always follow best practices for security and performance when working with databases and cloud services.

The content in this repository may change without notice. Users are responsible for ensuring they are using the most current version of any information or code provided.

This disclaimer applies to Vector Database Cloud, its contributors, and any third parties involved in creating, producing, or delivering the content in this repository.

The use of any information or code in this repository may carry inherent risks, including but not limited to data loss, system failures, or security vulnerabilities. Users should thoroughly test and validate any implementations in a safe environment before deploying to production systems.

For complex implementations or critical systems, we strongly recommend seeking advice from qualified professionals or consulting services.

By using this repository, you acknowledge and agree to this disclaimer. If you do not agree with any part of this disclaimer, please do not use the information or resources provided in this repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages