Skip to content
This repository has been archived by the owner on Nov 11, 2022. It is now read-only.

Version 2.0.0-beta1

Pre-release
Pre-release
Compare
Choose a tag to compare
@dhalperi dhalperi released this 10 Jan 21:31
· 111 commits to master since this release
v2.0.0-beta1

The Dataflow SDK for Java 2.0.0-beta1 is the first 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.

  • Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases. Please see below for details.
  • Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.

Beta

This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:

  • No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
  • Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
  • Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.

New Functionality

This release is based on a subset of Apache Beam 0.4.0. The most relevant changes for Cloud Dataflow customers include:

  • Improvements to compression: CompressedSource supports reading ZIP-compressed files. TextIO.Write and AvroIO.Write support compressed output.
  • AvroIO functionality: Write supports the addition of custom user metadata.
  • BigQueryIO functionality: Write now splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs enabling it to handle very large datasets.
  • BigtableIO functionality: Write supports unbounded PCollections and so can be used in the DataflowRunner in the streaming mode.

For complete details, please see the release notes for Apache Beam 0.3.0-incubating and 0.4.0.

Other Apache Beam modules from version 0.4.0 can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.

Major changes from Dataflow SDK 1.x for Java

This release has a number of significant changes from the 1.x series of releases.

All users need to read and understand these changes in order to upgrade to 2.x versions.

Package rename and restructuring

As part of generalizing Apache Beam to work well with environments beyond Google Cloud Platform, the SDK code has been renamed and restructured.

  • Renamed com.google.cloud.dataflow to org.apache.beam

    The SDK is now declared in the package org.apache.beam instead of com.google.cloud.dataflow. You need to update all your import statements with this change.

  • New subpackages: runners.dataflow, runners.direct, and io.gcp

    Runners have been reorganized into their own packages, so many things from com.google.cloud.dataflow.sdk.runners have moved into either org.apache.beam.runners.direct or org.apache.beam.runners.dataflow.

    Pipeline options specific to running on the Dataflow service have moved from com.google.cloud.dataflow.sdk.options to org.apache.beam.runners.dataflow.options.

    Most I/O connectors to Google Cloud Platform services have been moved into subpackages. For example, BigQueryIO has moved from com.google.cloud.dataflow.sdk.io to org.apache.beam.sdk.io.gcp.bigquery.

    Most IDEs will be able to help identify the new locations. To verify the new location for specific files, you can use t to search the code on GitHub. The Dataflow SDK 1.x for Java releases are built from the GoogleCloudPlatform/DataflowJavaSDK repository. The Dataflow SDK 2.x for Java releases correspond to code from the apache/beam repository.

Runners

  • Removed Pipeline from Runner names

    The names of all Runners have been shortened by removing Pipeline from the names. For example, DirectPipelineRunner is now DirectRunner, and DataflowPipelineRunner is now DataflowRunner.

  • Require setting --tempLocation to a Google Cloud Storage path

    Instead of allowing allowing you to specify only one of --stagingLocation or --tempLocation and then Dataflow inferring the other, the Dataflow Service now requires --gcpTempLocation to be set to a Google Cloud Storage path, but it can be inferred from the more general --tempLocation. Unless overridden, this will also be used for the --stagingLocation.

  • DirectRunner supports unbounded PCollections

    The DirectRunner continues to run on a user's local machine, but now additionally supports multithreaded execution, unbounded PCollections, and triggers for speculative and late outputs. It more closely aligns to the documented Beam model, and may (correctly) cause additional unit tests failures.

    As this functionality is now in the DirectRunner, the InProcessPipelineRunner (Dataflow SDK 1.6+ for Java) has been removed.

  • Replaced BlockingDataflowPipelineRunner with PipelineResult.waitToFinish()

    The BlockingDataflowPipelineRunner is now removed. If your code programmatically expects to run a pipeline and wait until it has terminated, then it should use the DataflowRunner and explicitly call pipeline.run().waitToFinish().

    If you used --runner BlockingDataflowPipelineRunner on the command line to interactively induce your main program to block until the pipeline has terminated, then this is a concern of the main program; it should provide an option such as --blockOnRun that will induce it to call waitToFinish().

ParDo and DoFn

  • DoFns use annotations instead of method overrides

    In order to allow for more flexibility and customization, DoFn now uses method annotations to customize processing instead of requiring users to override specific methods.

    The differences between the new DoFn and the old are demonstrated in the following code sample. Previously (with Dataflow SDK 1.x for Java), your code would look like this:

    new DoFn<Foo, Baz>() {
      @Override
      public void processElement(ProcessContext c) { … }
    }

    Now (with Dataflow SDK 2.x for Java), your code will look like this:

    new DoFn<Foo, Baz>() {
      @ProcessElement   // <-- This is the only difference
      public void processElement(ProcessContext c) { … }
    }

    If your DoFn accessed ProcessContext#window(), then there is a further change. Instead of this:

    public class MyDoFn extends DoFn<Foo, Baz> implements RequiresWindowAccess {
      @Override
      public void processElement(ProcessContext c) {
        … (MyWindowType) c.window() …
      }
    }

    you will write this:

    public class MyDoFn extends DoFn<Foo, Baz> {
      @ProcessElement
      public void processElement(ProcessContext c, MyWindowType window) {
        … window …
      }
    }

    or:

    return new DoFn<Foo, Baz>() {
      @ProcessElement
      public void processElement(ProcessContext c, MyWindowType window) {
        … window …
      }
    }

    The runtime will automatically provide the window to your DoFn.

  • DoFns are reused across multiple bundles

    In order to allow for performance improvements, the same DoFn may now be reused to process multiple bundles of elements, instead of guaranteeing a fresh instance per bundle. Any DoFn that keeps local state (e.g. instance variables) beyond the end of a bundle may encounter behavioral changes, as the next bundle
    will now start with that state instead of a fresh copy.

    To manage the lifecycle, new @Setup and @Teardown methods have been added. The full lifecycle is as follows (while a failure may truncate the lifecycle at any point):

    • @Setup: Per-instance initialization of the DoFn, such as opening reusable connections.
    • Any number of the sequence:
      • @StartBundle: Per-bundle initialization, such as resetting the state of the DoFn.
      • @ProcessElement: The usual element processing.
      • @FinishBundle: Per-bundle concluding steps, such as flushing side effects.
    • @Teardown: Per-instance teardown of the resources held by the DoFn, such as closing reusable connections.

    Note: This change is expected to have limited impact in practice. However, it does not generate a compile-time error and has the potential to silently cause unexpected results.

PTransforms

  • Removed .named()

    Remove the .named() methods from PTransforms and sub-classes. Instead use PCollection.apply(“name”, PTransform).

  • Renamed PTransform.apply() to PTransform.expand()

    PTransform.apply() was renamed to PTransform.expand() to avoid confusion with PCollection.apply(). All user-written composite transforms will need to rename the overridden apply() method to expand(). There is no change to how pipelines are constructed.

Additional breaking changes

Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.