-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: Add documentation+FAQs for Parquet DataSource #153
Changes from 14 commits
9e6e304
7a2eec1
2dfd54a
930e206
36fe8c8
f297fdd
6bf9adf
0d4271d
83fb139
ccbe793
f1ed0d2
33ec343
6402329
f67c832
21f4030
91a73a5
31f5ec3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,11 +6,19 @@ For Stream processing and hence for dagger user must know about some basic conce | |
|
||
### Stream Processing | ||
|
||
`Stream processing` commonly known as `Real-Time processing` lets users process and query continuous streams of unbounded Data which is Kafka events for Dagger. | ||
`Stream processing` commonly known as `Real-Time processing` lets users process and query a series of data at the same | ||
time as it is being produced. The source that is producing this data can either be a bounded source such as Parquet Files | ||
or an unbounded source such as Kafka. | ||
|
||
### Streams | ||
|
||
A group of Kafka topics sharing the same schema define a stream. The schema is defined using [`protobuf`](https://developers.google.com/protocol-buffers). You can have any number of schemas you want but the streaming queries become more complex with the addition of new schemas. | ||
A Stream defines a logical grouping of a data source and its associated [`protobuf`](https://developers.google.com/protocol-buffers) | ||
schema. All data produced by a source follows the protobuf schema. The source can be a bounded one such as `KAFKA_SOURCE` or `KAFKA_CONSUMER` | ||
in which case, a single stream can consume from one or more topics all sharing the same schema. Otherwise, the source | ||
can be an unbounded one such as `PARQUET_SOURCE` in which case, one or more parquet files as provided are consumed in a single stream. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as https://github.com/odpf/dagger/pull/153/files#r890081887. Will fix this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed via commit 91a73a5 |
||
|
||
Dagger allows creation of multiple streams, each having its own different schema, for use-cases such as SQL joins. However, the SQL | ||
queries become more complex as the number of streams increase. | ||
|
||
### Apache Flink | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,13 +6,79 @@ This page contains how-to guides for creating a Dagger job and configure it. | |
|
||
Dagger is a stream processing framework built with Apache Flink to process/aggregate/transform protobuf data. To run a dagger in any environment you need to have the following things set up beforehand. | ||
|
||
#### `JDK and Gradle` | ||
### `JDK and Gradle` | ||
|
||
- Java 1.8 and gradle(5+) need to be installed to run in local mode. Follow this [link](https://www.oracle.com/in/java/technologies/javase/javase-jdk8-downloads.html) to download Java-1.8 in your setup and [this](https://gradle.org/install/) to set up gradle. | ||
|
||
#### `Kafka Cluster` | ||
### `A Source` | ||
|
||
- Dagger use [Kafka](https://kafka.apache.org/) as the source of Data. So you need to set up Kafka(1.0+) either in a local or clustered environment. Follow this [quick start](https://kafka.apache.org/quickstart) to set up Kafka in the local machine. If you have a clustered Kafka you can configure it to use in Dagger directly. | ||
Dagger currently supports 3 kinds of Data Sources. Here are the requirements for each: | ||
|
||
##### `KAFKA_SOURCE` and `KAFKA_CONSUMER` | ||
|
||
Both these sources use [Kafka](https://kafka.apache.org/) as the source of data. So you need to set up Kafka(1.0+) either | ||
in a local or clustered environment. Follow this [quick start](https://kafka.apache.org/quickstart) to set up Kafka in | ||
the local machine. If you have a clustered Kafka you can configure it to use in Dagger directly. | ||
|
||
##### `PARQUET_SOURCE` | ||
|
||
This source uses Parquet files as the source of data. The parquet files can be either hourly partitioned, such as | ||
```text | ||
root_folder | ||
- booking_log | ||
- dt=2022-02-05 | ||
- hr=09 | ||
* g6agdasgd6asdgvadhsaasd829ajs.parquet | ||
* . . . (more parquet files) | ||
- (...more hour folders) | ||
- (... more date folders) | ||
|
||
``` | ||
|
||
or data partitioned, such as: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. small typo here, I believe we meant date partitioned here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed via commit 31f5ec3 |
||
|
||
```text | ||
root_folder | ||
- shipping_log | ||
- dt=2021-01-11 | ||
* hs7hasd6t63eg7wbs8swssdasdasdasda.parquet | ||
* ...(more parquet files) | ||
* (... more date folders) | ||
|
||
``` | ||
|
||
The file paths can be either in the local file system or in GCS bucket. When parquet files are provided from GCS bucket, | ||
Dagger will require a `core_site.xml` to be configured in order to connect and read from GCS. A sample `core_site.xml` is | ||
present in dagger and looks like this: | ||
```xml | ||
<configuration> | ||
<property> | ||
<name>google.cloud.auth.service.account.enable</name> | ||
<value>true</value> | ||
</property> | ||
<property> | ||
<name>google.cloud.auth.service.account.json.keyfile</name> | ||
<value>/Users/dummy/secrets/google_service_account.json</value> | ||
</property> | ||
<property> | ||
<name>fs.gs.requester.pays.mode</name> | ||
<value>CUSTOM</value> | ||
<final>true</final> | ||
</property> | ||
<property> | ||
<name>fs.gs.requester.pays.buckets</name> | ||
<value>my_sample_bucket_name</value> | ||
<final>true</final> | ||
</property> | ||
<property> | ||
<name>fs.gs.requester.pays.project.id</name> | ||
<value>my_billing_project_id</value> | ||
<final>true</final> | ||
</property> | ||
</configuration> | ||
``` | ||
You can look into the official [GCS Hadoop Connectors](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) | ||
documentation to know more on how to edit this xml as per your needs. | ||
|
||
#### `Flink [optional]` | ||
|
||
|
@@ -26,7 +92,7 @@ Dagger is a stream processing framework built with Apache Flink to process/aggre | |
$ ./gradlew dagger-core:runFlink | ||
``` | ||
|
||
- Tu run the Flink jobs in the local machine with java jar and local properties run the following commands. | ||
- To run the Flink jobs in the local machine with java jar and local properties run the following commands. | ||
|
||
```sh | ||
# Creating a fat jar | ||
|
@@ -36,11 +102,20 @@ $ ./gradlew :dagger-core:fatJar | |
$ java -jar dagger-core/build/libs/dagger-core-<dagger-version>-fat.jar ConfigFile=<filepath> | ||
``` | ||
|
||
#### `Protobuf Data` | ||
|
||
- Dagger exclusively supports [protobuf](https://developers.google.com/protocol-buffers) encoded data i.e. Dagger consumes protobuf data from Kafka topics, do the processing and produces data in protobuf format to a Kafka topic(when the sink is Kafka). | ||
- So you need to push proto data to a Kafka topic to run a dagger. This you can do using any of the Kafka client libraries. Follow this [tutorial](https://www.conduktor.io/how-to-produce-and-consume-protobuf-records-in-apache-kafka/) to produce proto data to a Kafka topic. | ||
- Also you need to define the [java compiled protobuf schema](https://developers.google.com/protocol-buffers/docs/javatutorial) in the classpath or use our in-house schema registry tool like [Stencil](https://github.com/odpf/stencil) to let dagger know about the data schema. Stencil is a event schema registry that provides an abstraction layer for schema handling, schema caching, and dynamic schema updates. [These configurations](../reference/configuration.md#schema-registry) needs to be set if you are using stencil for proto schema handling. | ||
#### `Protobuf Schema` | ||
|
||
- Dagger exclusively supports [protobuf](https://developers.google.com/protocol-buffers) encoded data. That is, for a | ||
source reading from Kafka, Dagger consumes protobuf data from Kafka topics and does the processing. For a source reading | ||
from Parquet Files, dagger uses protobuf schema to parse the Row Group. When pushing the results to a sink, Dagger produces | ||
data as per the output protobuf schema to a Kafka topic(when the sink is Kafka). | ||
- When using Kafka as a source, you can push data to a Kafka topic as per protobuf format using any of the Kafka client | ||
libraries. You can follow this [tutorial](https://www.conduktor.io/how-to-produce-and-consume-protobuf-records-in-apache-kafka/). | ||
- For all kinds of sources, you need to define the | ||
[java compiled protobuf schema](https://developers.google.com/protocol-buffers/docs/javatutorial) in the classpath or | ||
use our in-house schema registry tool like [Stencil](https://github.com/odpf/stencil) to let dagger know about the data | ||
schema. Stencil is an event schema registry that provides an abstraction layer for schema handling, schema caching, and | ||
dynamic schema updates. [These configurations](../reference/configuration.md#schema-registry) needs to be set if you are | ||
using stencil for proto schema handling. | ||
|
||
#### `Sinks` | ||
|
||
|
@@ -59,26 +134,29 @@ $ java -jar dagger-core/build/libs/dagger-core-<dagger-version>-fat.jar ConfigFi | |
|
||
## Common Configurations | ||
|
||
- These configurations are mandatory for dagger creation and are sink independent. Here you need to set the Kafka source-level information as well as SQL required for the dagger. In local execution, they would be set inside [`local.properties`](https://github.com/odpf/dagger/blob/main/dagger-core/env/local.properties) file. In the clustered environment they can be passed as job parameters to the Flink exposed job creation API. | ||
- Configuration for a given schema involving one or more Kafka topics is consolidated as a Stream. This involves properties for the Kafka cluster, schema, etc. In daggers, you could configure one or more streams for a single job. | ||
- These configurations are mandatory for dagger creation and are sink independent. Here you need to set configurations such as the source details, the protobuf schema class, the SQL query to be applied on the streaming data, etc. In local execution, they would be set inside [`local.properties`](https://github.com/odpf/dagger/blob/main/dagger-core/env/local.properties) file. In the clustered environment they can be passed as job parameters to the Flink exposed job creation API. | ||
- Configuration for a given schema involving a single source is consolidated as a Stream. In daggers, you can configure one or more streams for a single job. To know how to configure a stream based on a source, check [here](../reference/configuration.md#streams) | ||
- The `FLINK_JOB_ID` defines the name of the flink job. `ROWTIME_ATTRIBUTE_NAME` is the key name of [row time attribute](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/concepts/time_attributes/) required for stream processing. | ||
- In clustered mode, you can set up the `parallelism` configuration for distributed processing. | ||
- Read more about the mandatory configurations [here](../reference/configuration.md). | ||
|
||
```properties | ||
STREAMS=[ | ||
{ | ||
STREAMS = [{ | ||
"SOURCE_KAFKA_TOPIC_NAMES": "test-topic", | ||
"INPUT_SCHEMA_TABLE": "data_stream", | ||
"INPUT_SCHEMA_PROTO_CLASS": "com.tests.TestMessage", | ||
"INPUT_SCHEMA_EVENT_TIMESTAMP_FIELD_INDEX": "41", | ||
"SOURCE_KAFKA_CONFIG_BOOTSTRAP_SERVERS": "localhost:9092", | ||
"SOURCE_KAFKA_CONFIG_AUTO_COMMIT_ENABLE": "", | ||
"SOURCE_KAFKA_CONFIG_AUTO_OFFSET_RESET": "latest", | ||
"SOURCE_KAFKA_CONFIG_GROUP_ID": "dummy-consumer-group", | ||
"NAME": "local-kafka-stream" | ||
} | ||
] | ||
"SOURCE_KAFKA_CONSUMER_CONFIG_BOOTSTRAP_SERVERS": "localhost:9092", | ||
"SOURCE_KAFKA_CONSUMER_CONFIG_AUTO_COMMIT_ENABLE": "false", | ||
"SOURCE_KAFKA_CONSUMER_CONFIG_AUTO_OFFSET_RESET": "latest", | ||
"SOURCE_KAFKA_CONSUMER_CONFIG_GROUP_ID": "dummy-consumer-group", | ||
"SOURCE_KAFKA_NAME": "local-kafka-stream", | ||
"SOURCE_DETAILS": [ | ||
{ | ||
"SOURCE_TYPE": "UNBOUNDED", | ||
"SOURCE_NAME": "KAFKA_CONSUMER" | ||
}], | ||
}] | ||
|
||
FLINK_ROWTIME_ATTRIBUTE_NAME=rowtime | ||
FLINK_JOB_ID=TestDagger | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kafka is unbounded and parquet is bounded data source. Fixing this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed via commit 91a73a5