You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My general plan is to build an impl of the AsyncFileReader that can first check the cache and the fall back to the default object store based impl. The ParquetFileReaderFactory makes this possible by providing a mechanism to return custom AsyncFileReaders. However the ParquetFormat and its options do not have a mechanism to specify a custom impl of the ParquetFileReaderFactory.
Describe the solution you'd like
A possible solution is to add a field to ParquetOptions that is the file_reader_factory that should be used. However from my limited reading of the configuration system being used config option need to serialize to/from strings which would not work in this case.
Is there another place or mechanism we can leverage?
The key bit of code that I expect would need to change is the create_physical_plan implementation of FileFormat on the ParquetFormat. In this method the ParquetExecBuilder is created however there is no logic to set its file_reader_factory. Working backwards from this code leads us to the ParquetOptions as a possible way to configure the builder, but has the problems already mentioned.
Describe alternatives you've considered
As my real goal is to be able to locally cache parquet files I could perhaps do this via the ObjectStore trait. However that trait has a large surface and compared to the AsyncFIleReader trait. When I asked a similar question in discord I was directed away from that solution and towards using a catalog to manage the TableProviders more directly. In exploring that direction it seems that the simplest and most appropriate trait to leverage is the AsyncFileReader trait.
Open to other ideas as well.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge?
I'd like to use a custom ParquetFileReaderFactory to return impls of the AsyncFileReader so I can cache hot parquet files locally.
My general plan is to build an impl of the AsyncFileReader that can first check the cache and the fall back to the default object store based impl. The ParquetFileReaderFactory makes this possible by providing a mechanism to return custom AsyncFileReaders. However the ParquetFormat and its options do not have a mechanism to specify a custom impl of the ParquetFileReaderFactory.
Describe the solution you'd like
A possible solution is to add a field to ParquetOptions that is the file_reader_factory that should be used. However from my limited reading of the configuration system being used config option need to serialize to/from strings which would not work in this case.
Is there another place or mechanism we can leverage?
The key bit of code that I expect would need to change is the create_physical_plan implementation of FileFormat on the ParquetFormat. In this method the ParquetExecBuilder is created however there is no logic to set its file_reader_factory. Working backwards from this code leads us to the ParquetOptions as a possible way to configure the builder, but has the problems already mentioned.
Describe alternatives you've considered
As my real goal is to be able to locally cache parquet files I could perhaps do this via the ObjectStore trait. However that trait has a large surface and compared to the AsyncFIleReader trait. When I asked a similar question in discord I was directed away from that solution and towards using a catalog to manage the TableProviders more directly. In exploring that direction it seems that the simplest and most appropriate trait to leverage is the AsyncFileReader trait.
Open to other ideas as well.
Additional context
No response
The text was updated successfully, but these errors were encountered: