-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ feat(Destination PGVector): new connector #45428
Conversation
…tion_tests folder
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Working on these ones destination-pgvector - ❌ Failed - Connectors must have user facing documentation: User facing documentation file is missing. Please create it under ./docs/integrations/s/.md. |
[tool.poetry.dependencies.airbyte-cdk] | ||
# Version cnstrained by PyAirbyte (`airbyte`) version | ||
version = ">=1.0" | ||
extras = ["vector-db-based"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to use the latest (^5) version of CDK?
I can see that pyairbyte uses 4.6.2 https://github.com/airbytehq/PyAirbyte/blob/main/pyproject.toml#L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest (^5) version of CDK Pandas would be a problem as 2.2.0 introduced some change that broke SQLAlchemy interop.
We still can <5.0.0 CDK (see the below image), but not sure if is worth making an effort now till the pandas issue is resolved to have the latest CDK as there is still some stuff to fix with these versions that didn't conflict and we wish to ship a version of the connector soon. I could add this to the brainstorm list to work on subsequent iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also something to do in destination-snowflake-cortex (our base for pgvector) when we have time to pick this up.
from airbyte_protocol.models import DestinationSyncMode | ||
|
||
if TYPE_CHECKING: | ||
from airbyte_protocol.models import ConfiguredAirbyteCatalog, ConfiguredAirbyteStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better not to use direct import of CDK-related packages
from airbyte_protocol.models import DestinationSyncMode | |
if TYPE_CHECKING: | |
from airbyte_protocol.models import ConfiguredAirbyteCatalog, ConfiguredAirbyteStream | |
from airbyte_cdk.models import DestinationSyncMode, ConfiguredAirbyteCatalog, ConfiguredAirbyteStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the imports, thanks @artem1205
[tool.poetry.dependencies] | ||
python = "^3.9,<3.12" | ||
|
||
airbyte = "^0.12.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just curious — why do we need both the airbyte
and airbyte-cdk
packages to build a destination?
If there are missing parts, should we move components or modules from airbyte
to airbyte-cdk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
airbyte-integrations/connectors/destination-pgvector/destination_pgvector/pgvector_processor.py
from airbyte._processors.file.jsonl import JsonlWriter
from airbyte.secrets import SecretString
airbyte-integrations/connectors/destination-pgvector/destination_pgvector/common/sql/sql_processor.py
from airbyte._util.name_normalizers import LowerCaseNormalizer
from airbyte.constants import AB_EXTRACTED_AT_COLUMN, AB_META_COLUMN, AB_RAW_ID_COLUMN, DEBUG_MODE
from airbyte.progress import progress
from airbyte.strategies import WriteStrategy
from airbyte.types import SQLTypeConverter
from airbyte._batch_handles import BatchHandle
from airbyte._processors.file.base import FileWriterBase
from airbyte.secrets.base import SecretString
airbyte-integrations/connectors/destination-pgvector/destination_pgvector/destination.py
from airbyte.secrets import SecretString
from airbyte.strategies import WriteStrategy
airbyte-integrations/connectors/destination-pgvector/destination_pgvector/common/destinations/record_processor.py
from airbyte import exceptions as exc
from airbyte.strategies import WriteStrategy
from airbyte._batch_handles import BatchHandle
airbyte-integrations/connectors/destination-pgvector/unit_tests/destination_test.py
from airbyte.strategies import WriteStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal would be to move the SQLProcessor logic (and other classes/functions) into the CDK. But that will take a bit more time, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because I have concerns about circular import of pyairbyte.
Should we incorporate airbyte inside CDK before publishing the connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would be feasible without a larger project, but I think we will be able to propose as a priority in next cycle. Especially since we are trying to get a more formal path for destinations support for our partners, this work will become increasingle important.
And to clarify: it isn't a circular reference per se - but it is redundant/duplicative and we should refactor when we have a chance.
/approve-regression-tests
|
/approve-regression-tests
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
And tested successfully here:
/approve-regression-tests
|
What
We don't have a connector for pgvector databases. PGvector is a big player and so we should support them.
How
Build a new destination connector for PGVector.
Review guide
Well, This a long PR with a bunch of new files so here is a walkthrough of the connector functionality in case it gives you more context on what we are building.
airbyte-integrations/connectors/destination-pgvector/destination_pgvector/pgvector_processor.py
: You can probably start with this one and jump to others :)User Impact
People can start dropping data to Postgres DB with PGvector support.
Can this PR be safely reverted and rolled back?