KafkaIO should return one split for each of partition. #491

rangadi · 2016-12-01T21:04:16Z

KafkaIO should return one split for each of partition.

This is the actual unit of parallelism for Kafka topic. desiredNumSplits that Dataflow passes to a custom source is very low when maxNumWorkers is set. It asks for
just one split for each of the workers. This limits use of CPU cores on the workers essentially making autoscaling use more resources without improving performance.

This includes a hack to force single split in many unit tests since DirectPipelineRunner and InProcessPipelineRunner don't seem to read from more than one split.

This is the actual unit of parallelism for Kafka topic. desiredNumSplits that Dataflow passes to a custom source is not very low when maxNumWorkers is set, it asks for just one split for each of the workers. This limits use of CPU cores on the workers essentially making autoscaling use more resources without improving peformance. This includes a hack to force single split in many unit tests since DirectPipelineRunner and InProcessPipelineRunner don't seem to read from more than one split.

rangadi · 2016-12-01T21:04:52Z

+R: @davorbonaci, @dhalperi, @tgroh

DirectPipelineRunner does not call getInitialSplits(). Rather than forcing single split through a special config, force it when it invoked from within KafkaIO itself.

rangadi · 2016-12-01T23:03:48Z

Update based on Thomas comment on chat.
DirectPipelineRunner does not call getInitialSplits().
Rather than forcing single split through a special config, force it
when it invoked from within KafkaIO itself.

tgroh · 2016-12-08T22:28:32Z

contrib/kafka/src/main/java/com/google/cloud/dataflow/contrib/kafka/KafkaIO.java

-      }
-      for (int i = 0; i < partitions.size(); i++) {
-        assignments.get(i % numSplits).add(partitions.get(i));
+      if (desiredNumSplits < 0) {


Can this be inlined? More specifically, you could factor out the partitions preprocessing, and then just call the constructor in generateInitialSplits

I was not sure exactly what you meant.. I minimized the diff by reusing old code that handles the generic case where number of partitions and splits might not match. PTAL.

…isting code.

rangadi · 2016-12-08T23:17:07Z

I was not sure exactly what you meant.. I minimized the diff by reusing old code that handles the generic case where number of partitions and splits might not match. PTAL.

generateInitialSplits.

rangadi · 2016-12-09T00:03:20Z

Updated after a clarification from Thomas. It makes sense. There is no special case for single split in generateInitialSplits(). createReader() creates single reader if there aren't any partitions assigned (as happens with direct runner).

Updated couple of javadoc comments as well.

tgroh · 2016-12-09T01:13:58Z

[ERROR] src/test/java/com/google/cloud/dataflow/contrib/kafka/KafkaIOTest.java:[29,8] (imports) UnusedImports: Unused import: com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory.

Otherwise LGTM

rangadi · 2016-12-09T01:17:18Z

Thanks. Just pushed the fix for unused import. I will ping once travis-ci is happy.

rangadi · 2016-12-09T05:39:31Z

@tgroh all the checks passed. Thanks for the review.

Recent PR GoogleCloudPlatform#491 changes how KafkaIO splits. This makes it incompatible with Dataflow update across these two versions.

googlebot added the cla: yes label Dec 1, 2016

Based on Thomas comment.

cdd5f5f

DirectPipelineRunner does not call getInitialSplits(). Rather than forcing single split through a special config, force it when it invoked from within KafkaIO itself.

davorbonaci assigned tgroh Dec 8, 2016

tgroh reviewed Dec 8, 2016

View reviewed changes

Simplify diff by handling single split & split per partition cases ex…

1531683

…isting code.

review comments. Remove special handling of single partition in

10e9a08

generateInitialSplits.

fix indentation

1485c4d

fix unused import

668a3e2

tgroh merged commit 4e8f101 into GoogleCloudPlatform:master Dec 10, 2016

rangadi pushed a commit to rangadi/DataflowJavaSDK that referenced this pull request Dec 14, 2016

Increase KafkaIO version to 0.2.0

03bf9b7

Recent PR GoogleCloudPlatform#491 changes how KafkaIO splits. This makes it incompatible with Dataflow update across these two versions.

rangadi mentioned this pull request Dec 14, 2016

Increase KafkaIO version to 0.2.0 #504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KafkaIO should return one split for each of partition. #491

KafkaIO should return one split for each of partition. #491

rangadi commented Dec 1, 2016

rangadi commented Dec 1, 2016

rangadi commented Dec 1, 2016

tgroh Dec 8, 2016

rangadi Dec 8, 2016

rangadi commented Dec 8, 2016

rangadi commented Dec 9, 2016

tgroh commented Dec 9, 2016

rangadi commented Dec 9, 2016

rangadi commented Dec 9, 2016

KafkaIO should return one split for each of partition. #491

KafkaIO should return one split for each of partition. #491

Conversation

rangadi commented Dec 1, 2016

rangadi commented Dec 1, 2016

rangadi commented Dec 1, 2016

tgroh Dec 8, 2016

Choose a reason for hiding this comment

rangadi Dec 8, 2016

Choose a reason for hiding this comment

rangadi commented Dec 8, 2016

rangadi commented Dec 9, 2016

tgroh commented Dec 9, 2016

rangadi commented Dec 9, 2016

rangadi commented Dec 9, 2016