-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python BQ] Allow setting a fixed number of Storage API streams #28592
Changes from 3 commits
ca8705a
364b902
e21c72e
1cbf9c2
3ec836e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1869,6 +1869,7 @@ def __init__( | |
# TODO(https://github.com/apache/beam/issues/20712): Switch the default | ||
# when the feature is mature. | ||
with_auto_sharding=False, | ||
num_storage_api_streams=0, | ||
ignore_unknown_columns=False, | ||
load_job_project_id=None, | ||
max_insert_payload_size=MAX_INSERT_PAYLOAD_SIZE, | ||
|
@@ -2018,6 +2019,8 @@ def __init__( | |
determined number of shards to write to BigQuery. This can be used for | ||
all of FILE_LOADS, STREAMING_INSERTS, and STORAGE_WRITE_API. Only | ||
applicable to unbounded input. | ||
num_storage_api_streams: If set, the Storage API sink will default to | ||
using this number of write streams. Only applicable to unbounded data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. num_storage_api_streams: specifies the number of write streams that the Storage API sink will use. This parameter is only applicable to unbounded data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we check this should be always set now for the unbounded data since it won't work otherwise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
streaming writes with at-least-once still works without setting this parameter There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. changed the documentation, thanks for the suggestion! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Shall we add something like "This parameter must be set for Storage API writes with the exactly once method."? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hesitate on doing this because conventionally we don't do runner-based checks in the SDK. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I clarified it in the public BigQuery connector doc (https://beam.apache.org/documentation/io/built-in/google-bigquery/) |
||
ignore_unknown_columns: Accept rows that contain values that do not match | ||
the schema. The unknown values are ignored. Default is False, | ||
which treats unknown values as errors. This option is only valid for | ||
|
@@ -2060,6 +2063,7 @@ def __init__( | |
self.use_at_least_once = use_at_least_once | ||
self.expansion_service = expansion_service | ||
self.with_auto_sharding = with_auto_sharding | ||
self._num_storage_api_streams = num_storage_api_streams | ||
self.insert_retry_strategy = insert_retry_strategy | ||
self._validate = validate | ||
self._temp_file_format = temp_file_format or bigquery_tools.FileFormat.JSON | ||
|
@@ -2259,6 +2263,7 @@ def find_in_nested_dict(schema): | |
triggering_frequency=triggering_frequency, | ||
use_at_least_once=self.use_at_least_once, | ||
with_auto_sharding=self.with_auto_sharding, | ||
num_storage_api_streams=self._num_storage_api_streams, | ||
expansion_service=self.expansion_service)) | ||
|
||
if is_rows: | ||
|
@@ -2521,6 +2526,7 @@ def __init__( | |
triggering_frequency=0, | ||
use_at_least_once=False, | ||
with_auto_sharding=False, | ||
num_storage_api_streams=0, | ||
expansion_service=None): | ||
"""Initialize a StorageWriteToBigQuery transform. | ||
|
||
|
@@ -2558,6 +2564,7 @@ def __init__( | |
self._triggering_frequency = triggering_frequency | ||
self._use_at_least_once = use_at_least_once | ||
self._with_auto_sharding = with_auto_sharding | ||
self._num_storage_api_streams = num_storage_api_streams | ||
self._expansion_service = ( | ||
expansion_service or _default_io_expansion_service()) | ||
self.schematransform_config = SchemaAwareExternalTransform.discover_config( | ||
|
@@ -2569,6 +2576,7 @@ def expand(self, input): | |
expansion_service=self._expansion_service, | ||
rearrange_based_on_discovery=True, | ||
autoSharding=self._with_auto_sharding, | ||
numStreams=self._num_storage_api_streams, | ||
createDisposition=self._create_disposition, | ||
table=self._table, | ||
triggeringFrequencySeconds=self._triggering_frequency, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me some time to parse it a bit, nit and might be common practice, but
auto_sharding = (num_streams == 0)
looks better