✨ feat(Destination PGVector): new connector #45428

aldogonzalez8 · 2024-09-12T19:19:39Z

What

We don't have a connector for pgvector databases. PGvector is a big player and so we should support them.

How

Build a new destination connector for PGVector.

Review guide

Well, This a long PR with a bunch of new files so here is a walkthrough of the connector functionality in case it gives you more context on what we are building.

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/pgvector_processor.py: You can probably start with this one and jump to others :)

User Impact

People can start dropping data to Postgres DB with PGvector support.

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

…tion_tests folder

vercel · 2024-09-12T19:19:43Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Sep 16, 2024 4:14pm

aldogonzalez8 · 2024-09-12T20:13:08Z

Working on these ones

destination-pgvector - ❌ Failed - Connectors must have user facing documentation: User facing documentation file is missing. Please create it under ./docs/integrations/s/.md.
destination-pgvector - ❌ Failed - Connectors must have a changelog entry for each version: Could not check changelog entry as the documentation file is missing. Please create it..
destination-pgvector - ❌ Failed - Connectors must have valid metadata.yaml file: Metadata file is invalid: Validation error: Could not find docs/integrations/destinations/pgvector.md..

artem1205 · 2024-09-13T08:14:56Z

airbyte-integrations/connectors/destination-pgvector/pyproject.toml

+[tool.poetry.dependencies.airbyte-cdk]
+# Version cnstrained by PyAirbyte (`airbyte`) version
+version = ">=1.0"
+extras = ["vector-db-based"]


Is there any reason not to use the latest (^5) version of CDK?

I can see that pyairbyte uses 4.6.2 https://github.com/airbytehq/PyAirbyte/blob/main/pyproject.toml#L18

Latest (^5) version of CDK Pandas would be a problem as 2.2.0 introduced some change that broke SQLAlchemy interop.

We still can <5.0.0 CDK (see the below image), but not sure if is worth making an effort now till the pandas issue is resolved to have the latest CDK as there is still some stuff to fix with these versions that didn't conflict and we wish to ship a version of the connector soon. I could add this to the brainstorm list to work on subsequent iterations.

And also something to do in destination-snowflake-cortex (our base for pgvector) when we have time to pick this up.

artem1205 · 2024-09-13T08:17:16Z

...ons/connectors/destination-pgvector/destination_pgvector/common/catalog/catalog_providers.py

+from airbyte_protocol.models import DestinationSyncMode
+
+if TYPE_CHECKING:
+    from airbyte_protocol.models import ConfiguredAirbyteCatalog, ConfiguredAirbyteStream


it is better not to use direct import of CDK-related packages

Suggested change

from airbyte_protocol.models import DestinationSyncMode

if TYPE_CHECKING:

from airbyte_protocol.models import ConfiguredAirbyteCatalog, ConfiguredAirbyteStream

from airbyte_cdk.models import DestinationSyncMode, ConfiguredAirbyteCatalog, ConfiguredAirbyteStream

I fixed the imports, thanks @artem1205

artem1205 · 2024-09-13T08:28:36Z

airbyte-integrations/connectors/destination-pgvector/pyproject.toml

+[tool.poetry.dependencies]
+python = "^3.9,<3.12"
+
+airbyte = "^0.12.0"


I'm just curious — why do we need both the airbyte and airbyte-cdk packages to build a destination?
If there are missing parts, should we move components or modules from airbyte to airbyte-cdk?

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/pgvector_processor.py

from airbyte._processors.file.jsonl import JsonlWriter from airbyte.secrets import SecretString

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/common/sql/sql_processor.py

from airbyte._util.name_normalizers import LowerCaseNormalizer from airbyte.constants import AB_EXTRACTED_AT_COLUMN, AB_META_COLUMN, AB_RAW_ID_COLUMN, DEBUG_MODE from airbyte.progress import progress from airbyte.strategies import WriteStrategy from airbyte.types import SQLTypeConverter from airbyte._batch_handles import BatchHandle from airbyte._processors.file.base import FileWriterBase from airbyte.secrets.base import SecretString

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/destination.py

from airbyte.secrets import SecretString from airbyte.strategies import WriteStrategy

airbyte-integrations/connectors/destination-pgvector/destination_pgvector/common/destinations/record_processor.py

from airbyte import exceptions as exc from airbyte.strategies import WriteStrategy from airbyte._batch_handles import BatchHandle

airbyte-integrations/connectors/destination-pgvector/unit_tests/destination_test.py

from airbyte.strategies import WriteStrategy

The goal would be to move the SQLProcessor logic (and other classes/functions) into the CDK. But that will take a bit more time, I think.

I'm asking because I have concerns about circular import of pyairbyte.

Should we incorporate airbyte inside CDK before publishing the connector?

I don't think it would be feasible without a larger project, but I think we will be able to propose as a priority in next cycle. Especially since we are trying to get a more formal path for destinations support for our partners, this work will become increasingle important.

And to clarify: it isn't a circular reference per se - but it is redundant/duplicative and we should refactor when we have a chance.

aldogonzalez8 · 2024-09-14T20:33:34Z

/approve-regression-tests

Check job output.

✅ Approving regression tests

aldogonzalez8 · 2024-09-15T02:47:51Z

/approve-regression-tests

Check job output.

✅ Approving regression tests

aaronsteers

Looks good!

And tested successfully here:

https://colab.research.google.com/drive/1o4xTpnP58lUABMRxsDp0AibhPAtSomRL#scrollTo=BeGeIDzAxbPS&uniqifier=5

aldogonzalez8 · 2024-09-16T16:29:51Z

/approve-regression-tests

Check job output.

✅ Approving regression tests

aaronsteers and others added 23 commits August 16, 2024 11:41

checkpoint: copy in files from destination-snowflake-cortex

ffab5f7

rename base python src folder

e25b9a1

some global renames

edcbe73

refactor/replace SQLConfig and SQLTypeConverter classes

90e482b

refactor of sqlprocessor class

53fd970

updated config.py for PGVector

54faae9

updated Destination class

cf071e7

update main.py

8d082be

update pyproject, poetry and metadata files

16bded5

make common unit test destination_test.py pass

8ca7b35

fix unit test class name

f64279a

fix integration tests

d6bee0c

update acceptance-tests-config, and sample_config and spec in integra…

f648cb3

…tion_tests folder

Update readme file

afe701e

Update bootstrap file

c6f5614

Update icon to postgres one

207731c

fix command in readme file

769f4e3

Merge branch 'master' into destination-pgvector/new-start

04b09b6

Merge branch 'master' into destination-pgvector/new-start

9adea7a

remove todos from metadata.yaml

47af773

remove todos from pgvector_processor.py

1be7a53

update definitionId in metadata.yaml

a60967b

chore: format code

d1b9fa7

aldogonzalez8 self-assigned this Sep 12, 2024

octavia-squidington-iii added area/connectors Connector related issues connectors/destination/pgvector labels Sep 12, 2024

aldogonzalez8 requested a review from a team September 12, 2024 20:12

fix(destination-pgvector): fix cli entrypoint

a09b84c

aaronsteers and others added 2 commits September 12, 2024 16:18

chore: enable pypi publish

1de95ed

add pgvector doc

976f9ab

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Sep 12, 2024

vercel bot deployed to Preview September 12, 2024 23:13 View deployment

artem1205 reviewed Sep 13, 2024

View reviewed changes

aaronsteers and others added 3 commits September 13, 2024 06:35

fix: missing run() function

aee08eb

fix image tag version

aa86c32

fix import of models

2e27767

aldogonzalez8 requested review from artem1205 and aaronsteers September 14, 2024 19:18

aldogonzalez8 changed the title ~~Destination pgvector/new start~~ ✨ feat(Destination PGVector): new connector Sep 14, 2024

aaronsteers approved these changes Sep 15, 2024

View reviewed changes

Fix release date

589102a

vercel bot deployed to Preview September 16, 2024 16:14 View deployment

aldogonzalez8 merged commit 9ae2cbe into master Sep 16, 2024
34 checks passed

aldogonzalez8 deleted the destination-pgvector/new-start branch September 16, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ feat(Destination PGVector): new connector #45428

✨ feat(Destination PGVector): new connector #45428

aldogonzalez8 commented Sep 12, 2024

vercel bot commented Sep 12, 2024 •

edited

Loading

aldogonzalez8 commented Sep 12, 2024

artem1205 Sep 13, 2024

aldogonzalez8 Sep 13, 2024

aldogonzalez8 Sep 13, 2024

artem1205 Sep 13, 2024

aldogonzalez8 Sep 14, 2024

artem1205 Sep 13, 2024

aldogonzalez8 Sep 13, 2024 •

edited

Loading

aaronsteers Sep 13, 2024 •

edited

Loading

artem1205 Sep 13, 2024

aaronsteers Sep 13, 2024

aldogonzalez8 commented Sep 14, 2024 •

edited by github-actions bot

Loading

aldogonzalez8 commented Sep 15, 2024 •

edited by github-actions bot

Loading

aaronsteers left a comment

aldogonzalez8 commented Sep 16, 2024 •

edited by github-actions bot

Loading

✨ feat(Destination PGVector): new connector #45428

✨ feat(Destination PGVector): new connector #45428

Conversation

aldogonzalez8 commented Sep 12, 2024

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Sep 12, 2024 • edited Loading

aldogonzalez8 commented Sep 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aldogonzalez8 Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

aaronsteers Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aldogonzalez8 commented Sep 14, 2024 • edited by github-actions bot Loading

aldogonzalez8 commented Sep 15, 2024 • edited by github-actions bot Loading

aaronsteers left a comment

Choose a reason for hiding this comment

aldogonzalez8 commented Sep 16, 2024 • edited by github-actions bot Loading

vercel bot commented Sep 12, 2024 •

edited

Loading

aldogonzalez8 Sep 13, 2024 •

edited

Loading

aaronsteers Sep 13, 2024 •

edited

Loading

aldogonzalez8 commented Sep 14, 2024 •

edited by github-actions bot

Loading

aldogonzalez8 commented Sep 15, 2024 •

edited by github-actions bot

Loading

aldogonzalez8 commented Sep 16, 2024 •

edited by github-actions bot

Loading