Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ feat(Destination PGVector): new connector #45428

Merged
merged 30 commits into from
Sep 16, 2024

Conversation

aldogonzalez8
Copy link
Contributor

What

We don't have a connector for pgvector databases. PGvector is a big player and so we should support them.

How

Build a new destination connector for PGVector.

Review guide

Well, This a long PR with a bunch of new files so here is a walkthrough of the connector functionality in case it gives you more context on what we are building.

  1. airbyte-integrations/connectors/destination-pgvector/destination_pgvector/pgvector_processor.py: You can probably start with this one and jump to others :)

User Impact

People can start dropping data to Postgres DB with PGvector support.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@aldogonzalez8 aldogonzalez8 self-assigned this Sep 12, 2024
Copy link

vercel bot commented Sep 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 16, 2024 4:14pm

@aldogonzalez8
Copy link
Contributor Author

Working on these ones

destination-pgvector - ❌ Failed - Connectors must have user facing documentation: User facing documentation file is missing. Please create it under ./docs/integrations/s/.md.
destination-pgvector - ❌ Failed - Connectors must have a changelog entry for each version: Could not check changelog entry as the documentation file is missing. Please create it..
destination-pgvector - ❌ Failed - Connectors must have valid metadata.yaml file: Metadata file is invalid: Validation error: Could not find docs/integrations/destinations/pgvector.md..

Comment on lines +19 to +22
[tool.poetry.dependencies.airbyte-cdk]
# Version cnstrained by PyAirbyte (`airbyte`) version
version = ">=1.0"
extras = ["vector-db-based"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to use the latest (^5) version of CDK?

I can see that pyairbyte uses 4.6.2 https://github.com/airbytehq/PyAirbyte/blob/main/pyproject.toml#L18

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest (^5) version of CDK Pandas would be a problem as 2.2.0 introduced some change that broke SQLAlchemy interop.

We still can <5.0.0 CDK (see the below image), but not sure if is worth making an effort now till the pandas issue is resolved to have the latest CDK as there is still some stuff to fix with these versions that didn't conflict and we wish to ship a version of the connector soon. I could add this to the brainstorm list to work on subsequent iterations.

image
image
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also something to do in destination-snowflake-cortex (our base for pgvector) when we have time to pick this up.

Comment on lines 14 to 17
from airbyte_protocol.models import DestinationSyncMode

if TYPE_CHECKING:
from airbyte_protocol.models import ConfiguredAirbyteCatalog, ConfiguredAirbyteStream
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better not to use direct import of CDK-related packages

Suggested change
from airbyte_protocol.models import DestinationSyncMode
if TYPE_CHECKING:
from airbyte_protocol.models import ConfiguredAirbyteCatalog, ConfiguredAirbyteStream
from airbyte_cdk.models import DestinationSyncMode, ConfiguredAirbyteCatalog, ConfiguredAirbyteStream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[tool.poetry.dependencies]
python = "^3.9,<3.12"

airbyte = "^0.12.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious — why do we need both the airbyte and airbyte-cdk packages to build a destination?
If there are missing parts, should we move components or modules from airbyte to airbyte-cdk?

Copy link
Contributor Author

@aldogonzalez8 aldogonzalez8 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/pgvector_processor.py

from airbyte._processors.file.jsonl import JsonlWriter
from airbyte.secrets import SecretString

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/common/sql/sql_processor.py

from airbyte._util.name_normalizers import LowerCaseNormalizer
from airbyte.constants import AB_EXTRACTED_AT_COLUMN, AB_META_COLUMN, AB_RAW_ID_COLUMN, DEBUG_MODE
from airbyte.progress import progress
from airbyte.strategies import WriteStrategy
from airbyte.types import SQLTypeConverter
    from airbyte._batch_handles import BatchHandle
    from airbyte._processors.file.base import FileWriterBase
    from airbyte.secrets.base import SecretString

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/destination.py

from airbyte.secrets import SecretString
from airbyte.strategies import WriteStrategy

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/common/destinations/record_processor.py

from airbyte import exceptions as exc
from airbyte.strategies import WriteStrategy
from airbyte._batch_handles import BatchHandle

airbyte-integrations/connectors/destination-pgvector/unit_tests/destination_test.py

from airbyte.strategies import WriteStrategy

Copy link
Collaborator

@aaronsteers aaronsteers Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal would be to move the SQLProcessor logic (and other classes/functions) into the CDK. But that will take a bit more time, I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking because I have concerns about circular import of pyairbyte.

Should we incorporate airbyte inside CDK before publishing the connector?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be feasible without a larger project, but I think we will be able to propose as a priority in next cycle. Especially since we are trying to get a more formal path for destinations support for our partners, this work will become increasingle important.

And to clarify: it isn't a circular reference per se - but it is redundant/duplicative and we should refactor when we have a chance.

@aldogonzalez8 aldogonzalez8 changed the title Destination pgvector/new start ✨ feat(Destination PGVector): new connector Sep 14, 2024
@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Sep 14, 2024

/approve-regression-tests

Check job output.

✅ Approving regression tests

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Sep 15, 2024

/approve-regression-tests

Check job output.

✅ Approving regression tests

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Sep 16, 2024

/approve-regression-tests

Check job output.

✅ Approving regression tests

@aldogonzalez8 aldogonzalez8 merged commit 9ae2cbe into master Sep 16, 2024
34 checks passed
@aldogonzalez8 aldogonzalez8 deleted the destination-pgvector/new-start branch September 16, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/destination/pgvector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants