This repository has been archived by the owner on Nov 11, 2022. It is now read-only.
Releases: GoogleCloudPlatform/DataflowJavaSDK
Releases · GoogleCloudPlatform/DataflowJavaSDK
Version 0.4.20150727
- Removed the requirement to explicitly set
--project
if Google Cloud SDK has the default project configuration set. - Added support for creating BigQuery sources from a query.
- Added support for custom unbounded sources in the
DirectPipelineRunner
andDataflowPipelineRunner
. SeeUnboundedSource
for details. - Removed unnecessary
ExecutionContext
argument inBoundedSource.createReader
and related methods. - Changed
BoundedReader.splitAtFraction
to require thread-safety (i.e. safe to call asynchronously withadvance
orstart
). AddedRangeTracker
to help implement thread-safe readers. Users are heavily encouraged to use the class rather than implementing an ad-hoc solution. - Modified
Combine
transforms by lifting them into (and above) theGroupByKey
resulting in better performance. - Modified triggers such that after a
GroupByKey
, the system will switch to a "Continuation Trigger", which attempts to preserve the original intention regarding handling of speculative and late triggerings instead of returning to the default trigger. - Added
WindowFn.getOutputTimestamp
and changedGroupByKey
behavior to allow incomplete overlapping windows to not hold up progress of earlier, completed windows. - Changed triggering behavior so that empty panes are produced if they are the first pane after the watermark (
ON_TIME
) or the final pane. - Removed the
Window.Trigger
intermediate builder class. - Added validation that allowed lateness is specified on the
Window
PTransform
when a trigger is specified. - Re-enabled verification of
GroupByKey
usage. Specifically, the key must have a deterministic coder and usingGroupByKey
with an unboundedPCollection
requires windowing or triggers. - Changed
PTransform
names so that they may no longer contain the=
or;
characters.
Version 0.4.20150710
- Added support for per-window tables to
BigQueryIO
. - Added support for a custom source implementation for Avro. See
AvroSource
for more details. - Removed 250GiB Google Cloud Storage file size upload restriction.
- Fixed
BigQueryIO.Write
table creation bug in streaming mode. - Changed
Source.createReader()
andBoundedSource.createReader()
to be abstract. - Moved
Source.splitIntoBundles()
toBoundedSource.splitIntoBundles()
. - Added support for reading bounded views of a PubSub stream in
PubsubIO
for non-streaming Dataflow pipeline runners andDirectPipelineRunner
. - Added support for getting a
Coder
using aClass
to theCoderRegistry
. - Changed
CoderRegistry.registerCoder(Class<T>, Coder<T>)
to enforce that the provided coder actually encodes values of the given class, and its use with raw types of generic classes is forbidden as it will rarely work correctly. - Migrate to
Create.withCoder()
andCreateTimestamped.withCoder()
instead of callingsetCoder()
on the outcomingPCollection
when theCreate
PTransform
is being applied. - Added three successively more detailed
WordCount
examples. - Removed
PTransform.getDefaultName()
which was redundant withPTransform.getKindString()
. - Added support a unique name check for PTransform's during job creation.
- Removed
PTransform.withName()
andPTransform.setName()
. The name of a transform is now immutable after construction. Library transforms (likeCombine
) can provide builder-like methods to change the name. Names can always be overridden at the location where the transform is applied usingapply("name", transform)
. - Added the ability to select the network for worker VMs using
DataflowPipelineWorkerPoolOptions.setNetwork(String)
.
Version 0.4.20150602
- Added a dependency on the gcloud core component version 2015.02.05 or newer. Update to the latest version of gcloud by running gcloud components update. See Application Default Credentials for more details on how credentials can be specified.
- Removed previously deprecated Flatten.create(). Use Flatten.pCollections() instead.
- Removed previously deprecated Coder.isDeterministic(). Implement Coder.verifyDeterministic() instead.
- Replaced DoFn.Context#createAggregator with DoFn#createAggregator.
- Added support for querying the current value of an Aggregator. See PipelineResult for more information.
- Added experimental DoFnWithContext to simplify accessing additional information from a DoFn.
- Removed experimental RequiresKeyedState.
- Added CannotProvideCoderException to indicate inability to infer a coder, instead of returning null in such cases.
- Added CoderProperties for assembling test suites for user-defined coders.
- Replaced a constructor of PDone with a static factory PDone.in(Pipeline).
- Updated string formatting of the TIMESTAMP values returned by the BigQuery source, when using DirectPipelineRunner or when BigQuery data is used as a side input, which aligns it with the case when BigQuery data is used as a main input.
- Added a requirement that the value returned by Source.Reader.getCurrent() must be immutable and remain valid indefinitely.
- Replaced some usage of Source with BoundedSource. For example, Read.from() transform can now only be applied to BoundedSource objects.
- Moved experimental late-data handling, i.e., the data that arrives to the streaming pipeline after the watermark has passed it, from PubSubIO to Window. Late data will default to being dropped at the first GroupByKey following a Read operation. To allow late data through use Window.Bound#withAllowedLateness.
- Added experimental support for accumulating elements within a window across panes.
Version 0.4.20150414
- Initial Beta release of the Dataflow SDK for Java.
- Improved execution performance in many areas of the system.
- Added support for progress estimation and dynamic work rebalancing for user-defined sources.
- Added support for user-defined sources to provide the timestamp of the values read via
Reader.getCurrentTimestamp()
. - Added support for user-defined sinks.
- Added support for custom types in
PubsubIO
. - Added support for reading and writing XML files. See
XmlSource
andXmlSink
. - Renamed
DatastoreIO.Write.to
toDatastoreIO.writeTo
. In addition, entities written to Cloud Datastore must have complete keys. - Renamed
ReadSource
transform intoRead
. - Replaced
Source.createBasicReader
withSource.createReader
. - Added support for triggers, which allows getting early or partial results for a window, and specifying when to process late data. See
Window.into.triggering
. - Reduced visibility of
PTransform
'sgetInput()
,getOutput()
,getPipeline()
, andgetCoderRegistry()
. These methods will soon be deleted. - Renamed
DoFn.ProcessContext#windows
toDoFn.ProcessContext#window
. In order for aDoFn
to callDoFn.ProcessContext#window
, it must implementRequiresWindowAccess
. - Added
DoFn.ProcessContext#windowingInternals
to enable windowing on third-party runners. - Added support for side inputs when running streaming pipelines on the
[Blocking]DataflowPipelineRunner
. - Changed
[Keyed]CombineFn.addInput()
to return the new accumulator value. RenamedCombine.perElement().withHotKeys()
toCombine.perElement().withHotKeyFanout()
. - Renamed
First.of
toSample.any
andRateLimiting
toIntraBundleParallelization
to better represent its functionality.
Version 0.3.20150326
- Added support for accessing
PipelineOptions
in the Dataflow worker. - Removed one of the type parameters in
PCollectionView
, which may require simple changes to user's code that usesPCollectionView
. - Changed side input API to apply per window. Calls to
sideInput()
now return values only in the specific window corresponding to the window of the main input element, and not the whole side inputPCollectionView
. Consequently,sideInput()
can no longer be called fromstartBundle
andfinishBundle
of aDoFn
. - Added support for viewing a
PCollection
as aMap
when used as a side input. SeeView.asMap()
. - Renamed custom source API to use term "bundle" instead of "shard" in all names. Additionally, term "fork" is replaced with "dynamic split".
- Custom source
Reader
now requires implementing new methodstart()
. Existing code can be fixed by simply adding this method that just callsadvance()
and returns its value. Additionally, code that uses theReader
should be updated to use bothstart()
andadvance()
, instead ofadvance()
only.
Version 0.3.20150227
- Initial Alpha version of the Dataflow SDK for Java with support for streaming pipelines.
- Added determinism checker in
AvroCoder
to make it easier to interoperate withGroupByKey
. - Added support for accessing
PipelineOptions
in the worker. - Added support for compressed sources.
Version 0.3.20150211
- Removed the dependency on the
gcloud core
component version 2015.02.05 or newer.
Version 0.3.20150210
Caution: depends on the gcloud core
component version 2015.02.05 or newer.
- Included streaming pipeline runner, which, for now, requires additional whitelisting.
- Renamed several windowing-related APIs in a non-backward-compatible way.
- Added support for custom sources, which you can use to read from your own input formats.
- Introduced worker parallelism: one task per processor.
Version 0.3.20141216
- Initial Alpha version of the Dataflow SDK for Java.
Version 0.3.20150109
- Fixed several platform-specific issues for Microsoft Windows.
- Fixed several Java 8-specific issues.
- Added a few new examples.