Version 2.0.0-beta1
Pre-releaseThe Dataflow SDK for Java 2.0.0-beta1 is the first 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
New Functionality
This release is based on a subset of Apache Beam 0.4.0. The most relevant changes for Cloud Dataflow customers include:
- Improvements to compression:
CompressedSource
supports reading ZIP-compressed files.TextIO.Write
andAvroIO.Write
support compressed output. AvroIO
functionality:Write
supports the addition of custom user metadata.BigQueryIO
functionality:Write
now splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs enabling it to handle very large datasets.BigtableIO
functionality:Write
supports unboundedPCollection
s and so can be used in theDataflowRunner
in the streaming mode.
For complete details, please see the release notes for Apache Beam 0.3.0-incubating and 0.4.0.
Other Apache Beam modules from version 0.4.0 can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.
Major changes from Dataflow SDK 1.x for Java
This release has a number of significant changes from the 1.x series of releases.
All users need to read and understand these changes in order to upgrade to 2.x versions.
Package rename and restructuring
As part of generalizing Apache Beam to work well with environments beyond Google Cloud Platform, the SDK code has been renamed and restructured.
-
Renamed
com.google.cloud.dataflow
toorg.apache.beam
The SDK is now declared in the package
org.apache.beam
instead ofcom.google.cloud.dataflow
. You need to update all your import statements with this change. -
New subpackages:
runners.dataflow
,runners.direct
, andio.gcp
Runners have been reorganized into their own packages, so many things from
com.google.cloud.dataflow.sdk.runners
have moved into eitherorg.apache.beam.runners.direct
ororg.apache.beam.runners.dataflow
.Pipeline options specific to running on the Dataflow service have moved from
com.google.cloud.dataflow.sdk.options
toorg.apache.beam.runners.dataflow.options
.Most I/O connectors to Google Cloud Platform services have been moved into subpackages. For example,
BigQueryIO
has moved fromcom.google.cloud.dataflow.sdk.io
toorg.apache.beam.sdk.io.gcp.bigquery
.Most IDEs will be able to help identify the new locations. To verify the new location for specific files, you can use
t
to search the code on GitHub. The Dataflow SDK 1.x for Java releases are built from the GoogleCloudPlatform/DataflowJavaSDK repository. The Dataflow SDK 2.x for Java releases correspond to code from the apache/beam repository.
Runners
-
Removed
Pipeline
from Runner namesThe names of all Runners have been shortened by removing
Pipeline
from the names. For example,DirectPipelineRunner
is nowDirectRunner
, andDataflowPipelineRunner
is nowDataflowRunner
. -
Require setting
--tempLocation
to a Google Cloud Storage pathInstead of allowing allowing you to specify only one of
--stagingLocation
or--tempLocation
and then Dataflow inferring the other, the Dataflow Service now requires--gcpTempLocation
to be set to a Google Cloud Storage path, but it can be inferred from the more general--tempLocation
. Unless overridden, this will also be used for the--stagingLocation
. -
DirectRunner
supports unboundedPCollection
sThe
DirectRunner
continues to run on a user's local machine, but now additionally supports multithreaded execution, unboundedPCollection
s, and triggers for speculative and late outputs. It more closely aligns to the documented Beam model, and may (correctly) cause additional unit tests failures.As this functionality is now in the
DirectRunner
, theInProcessPipelineRunner
(Dataflow SDK 1.6+ for Java) has been removed. -
Replaced
BlockingDataflowPipelineRunner
withPipelineResult.waitToFinish()
The
BlockingDataflowPipelineRunner
is now removed. If your code programmatically expects to run a pipeline and wait until it has terminated, then it should use theDataflowRunner
and explicitly callpipeline.run().waitToFinish()
.If you used
--runner BlockingDataflowPipelineRunner
on the command line to interactively induce your main program to block until the pipeline has terminated, then this is a concern of the main program; it should provide an option such as--blockOnRun
that will induce it to callwaitToFinish()
.
ParDo
and DoFn
-
DoFn
s use annotations instead of method overridesIn order to allow for more flexibility and customization,
DoFn
now uses method annotations to customize processing instead of requiring users to override specific methods.The differences between the new
DoFn
and the old are demonstrated in the following code sample. Previously (with Dataflow SDK 1.x for Java), your code would look like this:new DoFn<Foo, Baz>() { @Override public void processElement(ProcessContext c) { … } }
Now (with Dataflow SDK 2.x for Java), your code will look like this:
new DoFn<Foo, Baz>() { @ProcessElement // <-- This is the only difference public void processElement(ProcessContext c) { … } }
If your
DoFn
accessedProcessContext#window()
, then there is a further change. Instead of this:public class MyDoFn extends DoFn<Foo, Baz> implements RequiresWindowAccess { @Override public void processElement(ProcessContext c) { … (MyWindowType) c.window() … } }
you will write this:
public class MyDoFn extends DoFn<Foo, Baz> { @ProcessElement public void processElement(ProcessContext c, MyWindowType window) { … window … } }
or:
return new DoFn<Foo, Baz>() { @ProcessElement public void processElement(ProcessContext c, MyWindowType window) { … window … } }
The runtime will automatically provide the window to your
DoFn
. -
DoFn
s are reused across multiple bundlesIn order to allow for performance improvements, the same
DoFn
may now be reused to process multiple bundles of elements, instead of guaranteeing a fresh instance per bundle. AnyDoFn
that keeps local state (e.g. instance variables) beyond the end of a bundle may encounter behavioral changes, as the next bundle
will now start with that state instead of a fresh copy.To manage the lifecycle, new
@Setup
and@Teardown
methods have been added. The full lifecycle is as follows (while a failure may truncate the lifecycle at any point):@Setup
: Per-instance initialization of theDoFn
, such as opening reusable connections.- Any number of the sequence:
@StartBundle
: Per-bundle initialization, such as resetting the state of theDoFn
.@ProcessElement
: The usual element processing.@FinishBundle
: Per-bundle concluding steps, such as flushing side effects.
@Teardown
: Per-instance teardown of the resources held by theDoFn
, such as closing reusable connections.
Note: This change is expected to have limited impact in practice. However, it does not generate a compile-time error and has the potential to silently cause unexpected results.
PTransforms
-
Removed
.named()
Remove the
.named()
methods from PTransforms and sub-classes. Instead usePCollection.apply(“name”, PTransform)
. -
Renamed
PTransform.apply()
toPTransform.expand()
PTransform.apply()
was renamed toPTransform.expand()
to avoid confusion withPCollection.apply()
. All user-written composite transforms will need to rename the overriddenapply()
method toexpand()
. There is no change to how pipelines are constructed.
Additional breaking changes
Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.