[Bug]: BigQuery Storage Write API does not write with no complaint #28168

onurdialpad · 2023-08-25T20:52:02Z

What happened?

I wanted to test Storage Write API with SDK 2.49.0 and tried to write a simple data on Dataflow but the "writing" step does not do anything, no logging there as well.

Here is my code snippet.

  with beam.Pipeline(options=pipeline_options) as pipeline:
    ...
    # pylint: disable=line-too-long
    result = objects_for_storage | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(table_spec,
                                                                                schema=_SCHEMA,
                                                                                method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API)
    _ = (result.failed_rows_with_errors
         | 'Get Errors' >> beam.Map(lambda e: {
              "destination": e[0],
              "row": json.dumps(e[1]),
              "error_message": e[2][0]['message']
            })
         | "LogElements" >> beam.ParDo(LogElements()))

Here the step does not produce output

Issue Priority

Priority: 1

Issue Components

The text was updated successfully, but these errors were encountered:

onurdialpad · 2023-08-28T15:50:13Z

.add-labels P1
.remove-labels P3

liferoad · 2023-08-28T16:42:17Z

Can you share more details about your job? Streaming ? Source? Since this is a Dataflow job, you could open a cloud ticket: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests

ahmedabu98 · 2023-08-28T16:48:35Z

Hey @onurdialpad, can you provide a reproducible snippet?

onurdialpad · 2023-08-28T17:28:07Z

@ahmedabu98 sure, here a snippet. I tested the snippet as well and it did the same thing. No write, no log/error.

import logging
import sys
from typing import Dict, Iterable, List

import apache_beam as beam
from apache_beam.options import pipeline_options as beam_pipeline_options


class CustomPipelineOptions(beam_pipeline_options.PipelineOptions):

  @classmethod
  def _add_argparse_args(cls, parser):
    parser.add_argument('--tablePrefix', help='The name of the table to write data to')

if __name__ == '__main__':
  _TABLE_PREFIX = 'tablePrefix'
  _PROJECT = 'project'

  table_schema = {
    'fields': [{
      'name': 'rand', 'type': 'STRING', 'mode': 'NULLABLE'
    }]
  }

  pipeline_options = CustomPipelineOptions(flags=sys.argv, streaming=True, save_main_session=True)

  with beam.Pipeline(options=pipeline_options) as pipeline:
    options = pipeline_options.get_all_options()
    project = options[_PROJECT]
    table_spec: str = f'{project}:mydataset.{_TABLE_PREFIX}'

    coll = pipeline | beam.Create([
      {
        'rand': 'Mahatma Gandhi'
      },
      {
        'rand': 'ABCD'
      },
    ])
    
    result = coll | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(table_spec,
                                                                 schema=table_schema,
                                                                 method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API)

@liferoad Data comes from Pub/Sub in the real environment and the Dataflow is running with streaming mode. It works if I use STREAMING_INSERTS as method here instead of STORAGE_WRITE_API. No matter what I try STORAGE_WRITE_API method does not work for me.

ahmedabu98 · 2023-08-29T14:15:37Z

@onurdialpad I tried running the snippet you provided on 2.49.0 and it worked for both local and Dataflows runners. Can you provide us any relevant logs you're seeing?

+1 to @liferoad's suggestion of opening a Dataflow ticket, it will help the internal engineers debug your pipeline better

onurdialpad · 2023-08-29T17:18:52Z

@ahmedabu98 thanks for trying that, Can you elaborate what you meant by "it worked", did you see the job wrote records to the BQ?

Regarding with opening ticket to Dataflow, sure I will do it. Just a note, when I try to run the snippet with DirectRunner on the local it does not write anything to BQ with no log. To clarify: it "works" but not as intended which means it is supposed to write to BQ but it does not, it just works without doing anything.

ahmedabu98 · 2023-08-30T14:19:08Z

I mean that it did write records to the BQ table.

However, I think it's because the snippet uses a batch source (beam.Create()). I tried again with a streaming source and I'm getting a pipeline that doesn't output anything. I'm seeing these errors too:

This streaming case may be broken.. I'm still investigating why

ahmedabu98 · 2023-08-30T15:18:10Z

Hey @onurdialpad, I'm still digging into it but I've narrowed it down to runner V2 (both Java and Python jobs exhibit this behavior). I suspect a recent internal change is tripping up this behavior.

I'll continue investigating but for now, you may be able to mitigate this by running with the legacy runner. Python Dataflow jobs default to runner v2 but you can disable it as long as you're using a Beam version that is before ~~2.50.0~~ 2.45.0. Just use --experiments=disable_runner_v2_until_2023

ahmedabu98 · 2023-08-30T15:29:22Z

Ahh sorry nevermind, this xlang storage write connector was implemented on 2.47.0, so that mitigation won't work

onurdialpad · 2023-08-30T16:37:48Z

Hey @ahmedabu98 thanks for the effort! It's interesting that the snippet I share here doesn't produce any output on BQ side even it uses batch source (beam.Create()) as you mentioned

ahmedabu98 · 2023-09-21T15:48:20Z

Hey @onurdialpad, we've confirmed it is a bug in Dataflow's Runner V2 that gets hit by Storage Write API with autosharding.

One workaround is to use at_least_once=True, which will use at-least-once semantics (as opposed to exactly-once). More on that here: https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics

I'm going to open a PR to also allow setting a fixed number of shards as another workaround, which may be available for Beam 2.51.0 and will work for exactly-once semantics.

kennknowles · 2023-09-25T19:45:36Z

This is tagged as blocking 2.51.0 which is in progress now. This does seem like a major lack of functionality. I see followups and comments on and about #28592. Is there a cherrypick open or is it not yet resolved?

ahmedabu98 · 2023-09-25T19:54:18Z

Hey @kennknowles, this is resolved and a CP is ready in #28631

ahmedabu98 · 2023-09-25T19:56:54Z

Fixed by #28618 (follow-up of #28592)

onurdialpad added awaiting triage bug labels Aug 25, 2023

github-actions bot added python P3 labels Aug 25, 2023

onurdialpad changed the title ~~[Bug]: BigQuery Storage Write API doesn'~~ [Bug]: BigQuery Storage Write API does not write with no complaint Aug 25, 2023

github-actions bot added the bigquery label Aug 25, 2023

github-actions bot added P1 and removed P3 labels Aug 28, 2023

ahmedabu98 mentioned this issue Aug 30, 2023

Updating Storage API Autosharding documentation to include that it doesn't work on Runner V2 #28233

Merged

tvalentyn added the io label Sep 1, 2023

ahmedabu98 mentioned this issue Sep 21, 2023

[Python BQ] Allow setting a fixed number of Storage API streams #28592

Merged

liferoad added this to the 2.51.0 Release milestone Sep 21, 2023

ahmedabu98 closed this as completed Sep 25, 2023

github-actions bot modified the milestones: 2.51.0 Release, 2.52.0 Release Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: BigQuery Storage Write API does not write with no complaint #28168

[Bug]: BigQuery Storage Write API does not write with no complaint #28168

onurdialpad commented Aug 25, 2023 •

edited

Loading

onurdialpad commented Aug 28, 2023 •

edited

Loading

liferoad commented Aug 28, 2023

ahmedabu98 commented Aug 28, 2023

onurdialpad commented Aug 28, 2023 •

edited

Loading

ahmedabu98 commented Aug 29, 2023

onurdialpad commented Aug 29, 2023

ahmedabu98 commented Aug 30, 2023

ahmedabu98 commented Aug 30, 2023 •

edited

Loading

ahmedabu98 commented Aug 30, 2023

onurdialpad commented Aug 30, 2023

ahmedabu98 commented Sep 21, 2023

kennknowles commented Sep 25, 2023

ahmedabu98 commented Sep 25, 2023

ahmedabu98 commented Sep 25, 2023

[Bug]: BigQuery Storage Write API does not write with no complaint #28168

[Bug]: BigQuery Storage Write API does not write with no complaint #28168

Comments

onurdialpad commented Aug 25, 2023 • edited Loading

What happened?

Issue Priority

Issue Components

onurdialpad commented Aug 28, 2023 • edited Loading

liferoad commented Aug 28, 2023

ahmedabu98 commented Aug 28, 2023

onurdialpad commented Aug 28, 2023 • edited Loading

ahmedabu98 commented Aug 29, 2023

onurdialpad commented Aug 29, 2023

ahmedabu98 commented Aug 30, 2023

ahmedabu98 commented Aug 30, 2023 • edited Loading

ahmedabu98 commented Aug 30, 2023

onurdialpad commented Aug 30, 2023

ahmedabu98 commented Sep 21, 2023

kennknowles commented Sep 25, 2023

ahmedabu98 commented Sep 25, 2023

ahmedabu98 commented Sep 25, 2023

onurdialpad commented Aug 25, 2023 •

edited

Loading

onurdialpad commented Aug 28, 2023 •

edited

Loading

onurdialpad commented Aug 28, 2023 •

edited

Loading

ahmedabu98 commented Aug 30, 2023 •

edited

Loading