Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: Add documentation+FAQs for Parquet DataSource #153

Conversation

Meghajit
Copy link
Member

PR for #109

@Meghajit Meghajit self-assigned this May 23, 2022
@Meghajit Meghajit linked an issue May 23, 2022 that may be closed by this pull request
@Meghajit Meghajit marked this pull request as ready for review May 30, 2022 11:10
@Meghajit
Copy link
Member Author

Added the configurations + some common gotchas for setting up configurations. Also, some FAQs and how to create the stream config for parquet. I haven’t touched the diagrams yet, as they will require some effort to be redrawn since the original resources are missing. However, have changed the description in relevent places.


```

or data partitioned, such as:
Copy link
Contributor

@kevinbheda kevinbheda Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo here, I believe we meant date partitioned here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via commit 31f5ec3

…' into feat/issue#109-add-parquet-source-documentation
- A Stream defines a logical grouping of a data source and its associated [`protobuf`](https://developers.google.com/protocol-buffers)
schema. All data produced by a source follows the protobuf schema. The source can be a bounded one such as `KAFKA_SOURCE` or `KAFKA_CONSUMER`
in which case, a single stream can consume from one or more topics all sharing the same schema. Otherwise, the source
can be an unbounded one such as `PARQUET_SOURCE` in which case, one or more parquet files as provided are consumed in a single stream.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka is unbounded and parquet is bounded data source. Fixing this as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Member Author

@Meghajit Meghajit Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via commit 91a73a5

A Stream defines a logical grouping of a data source and its associated [`protobuf`](https://developers.google.com/protocol-buffers)
schema. All data produced by a source follows the protobuf schema. The source can be a bounded one such as `KAFKA_SOURCE` or `KAFKA_CONSUMER`
in which case, a single stream can consume from one or more topics all sharing the same schema. Otherwise, the source
can be an unbounded one such as `PARQUET_SOURCE` in which case, one or more parquet files as provided are consumed in a single stream.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@Meghajit Meghajit Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via commit 91a73a5

@kevinbheda kevinbheda merged commit d97e37e into raystack:dagger-parquet-file-processing Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

doc: Add documentation+FAQs for Parquet DataSource
2 participants