Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism to allow using custom ParquetFileReaderFactory from the ParquetFormat options #13773

Open
nathanielc opened this issue Dec 13, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@nathanielc
Copy link

Is your feature request related to a problem or challenge?

I'd like to use a custom ParquetFileReaderFactory to return impls of the AsyncFileReader so I can cache hot parquet files locally.

My general plan is to build an impl of the AsyncFileReader that can first check the cache and the fall back to the default object store based impl. The ParquetFileReaderFactory makes this possible by providing a mechanism to return custom AsyncFileReaders. However the ParquetFormat and its options do not have a mechanism to specify a custom impl of the ParquetFileReaderFactory.

Describe the solution you'd like

A possible solution is to add a field to ParquetOptions that is the file_reader_factory that should be used. However from my limited reading of the configuration system being used config option need to serialize to/from strings which would not work in this case.

Is there another place or mechanism we can leverage?

The key bit of code that I expect would need to change is the create_physical_plan implementation of FileFormat on the ParquetFormat. In this method the ParquetExecBuilder is created however there is no logic to set its file_reader_factory. Working backwards from this code leads us to the ParquetOptions as a possible way to configure the builder, but has the problems already mentioned.

Describe alternatives you've considered

As my real goal is to be able to locally cache parquet files I could perhaps do this via the ObjectStore trait. However that trait has a large surface and compared to the AsyncFIleReader trait. When I asked a similar question in discord I was directed away from that solution and towards using a catalog to manage the TableProviders more directly. In exploring that direction it seems that the simplest and most appropriate trait to leverage is the AsyncFileReader trait.

Open to other ideas as well.

Additional context

No response

@nathanielc nathanielc added the enhancement New feature or request label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant