Releases: GoogleCloudPlatform/DataflowPythonSDK
Future Releases
All future releases will be announced on Release Notes: Dataflow SDK for Python and releases will be available on PyPI.
See README.md for more information.
Version 0.2.7
The 0.2.7 release includes the following changes:
- Introduce
OperationCounters.should_sample
for sampling for size estimation. - Implement fixed sharding in
TextFileSink
. - Use multiple file rename threads in
finalize_write
method. - Retry idempotent I/O operations on GCS timeout.
Version 0.2.6
The 0.2.6 release includes the following changes:
- Allow
Pipeline
objects to be used in Pythonwith
statements. - Several bug fixes.
Version 0.2.5
The 0.2.5 release includes the following changes:
- Support for creating custom sources, and reading them with
DirectRunner
andDataflowRunner
. DiskCachedPipelineRunner
as a disk backed alternative toDirectRunner
.- Ignore undeclared side outputs of
DoFn
s in cloud executor. - Fix pickling issue when the Seaborn package is loaded.
- Enable gzip compression on text files sink.
Version 0.2.4
The 0.2.4 release includes the following changes:
- Support for large iterable side inputs.
- Enable support for all supported counter types.
- Modify --requirements_file behavior to locally cache packages.
- Support for non-native
TextFileSink
. - Several fixes.
Version 0.2.3
The 0.2.3 release includes several fixes:
- Removed version pin for google-apitools package.
- Removed version pin for oath2client package.
- Better inter-op with the gcloud package
- Raising correct exception for failures in start/finish DoFn methods.
Version 0.2.2
The 0.2.2 release includes the following changes:
- Improved memory footprint for DirectPipelineRunner.
- Multiple bug fixes (BigQuerySink schema handling for record field types, more clear error messages for missing files, etc.).
- Several performance improvements (cythonize some files, reduced debug logging, etc.).
- New example
using more complex BigQuery schemas
This release supports only batch execution. Streaming processing is not available yet.
The batch execution can be done locally (for development/testing) or in the Google cloud using the Cloud Dataflow service. Running against the Google cloud requires whitelisting using this form.
Version 0.2.1
The 0.2.1 release includes the following changes:
- Optimized performance for the following features:
- Logging
- Shuffle Writing
- Using Coders
- Compiling some of the worker modules with Cython
- Changed the default behavior for Cloud execution: Instead of downloading the SDK from a Cloud Storage bucket, you now download the SDK as a tarball from GitHub. When you run jobs using the Dataflow service, the SDK version used will match the version you've downloaded (to your local environment). You can use the --sdk_location pipeline option to override this behavior and provide an explicit tarball location (Cloud Storage path or URL).
- Fixed several pickling issues related to how Dataflow serializes user functions and data.
- Fixed several worker lease expiration issues experienced when processing large datasets.
- Improved validation to detect various common errors, such as access issues and invalid parameter combinations, much earlier in time.
Version 0.2.0
Initial release of the open-sourced Datafow SDK for Python.