Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YAML] - Kafka Proto String schema #29835

Merged
merged 3 commits into from
Jan 9, 2024

Conversation

ffernandez92
Copy link
Contributor

@ffernandez92 ffernandez92 commented Dec 20, 2023


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @riteshghorse for label python.
R: @damondouglas for label java.
R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@ffernandez92
Copy link
Contributor Author

ffernandez92 commented Dec 21, 2023

@brucearctor this PR contains the String Proto schema feature for Kafka. There is a test check that fails because of CData section too big found, line 100046, column 254 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 100046)
Apparently the test for that class (that i haven't touched) are generating XML files that are validated downstream. Looks like we are hitting a limitation where the XML result is bigger than 10MB.

I've tested this with Dataflow as well using different configurations and it seems to be working fine.

@ffernandez92
Copy link
Contributor Author

A bit more info about that failed test:

  • The test: beam_PreCommit_Java_Kafka_IO_Direct runs fine. However, the step: Publish JUnit Test Results shows the following error when publishing the test results:
Run EnricoMi/publish-unit-test-result-action@v2

[32](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:33)
/usr/local/bin/docker run --name ghcrioenricomipublishunittestresultactionv2110_da9c74 --label 9ada59 --workdir /github/workspace --rm -e "GRADLE_ENTERPRISE_ACCESS_KEY" -e "GRADLE_ENTERPRISE_CACHE_USERNAME" -e "GRADLE_ENTERPRISE_CACHE_PASSWORD" -e "KUBELET_GCLOUD_CONFIG_PATH" -e "GRADLE_BUILD_ACTION_SETUP_COMPLETED" -e "GRADLE_BUILD_ACTION_CACHE_RESTORED" -e "INPUT_COMMIT" -e "INPUT_COMMENT_MODE" -e "INPUT_FILES" -e "INPUT_GITHUB_TOKEN" -e "INPUT_GITHUB_TOKEN_ACTOR" -e "INPUT_GITHUB_RETRIES" -e "INPUT_CHECK_NAME" -e "INPUT_COMMENT_TITLE" -e "INPUT_FAIL_ON" -e "INPUT_ACTION_FAIL" -e "INPUT_ACTION_FAIL_ON_INCONCLUSIVE" -e "INPUT_JUNIT_FILES" -e "INPUT_NUNIT_FILES" -e "INPUT_XUNIT_FILES" -e "INPUT_TRX_FILES" -e "INPUT_TIME_UNIT" -e "INPUT_TEST_FILE_PREFIX" -e "INPUT_REPORT_INDIVIDUAL_RUNS" -e "INPUT_REPORT_SUITE_LOGS" -e "INPUT_DEDUPLICATE_CLASSES_BY_FILE_NAME" -e "INPUT_LARGE_FILES" -e "INPUT_IGNORE_RUNS" -e "INPUT_JOB_SUMMARY" -e "INPUT_COMPARE_TO_EARLIER_COMMIT" -e "INPUT_PULL_REQUEST_BUILD" -e "INPUT_EVENT_FILE" -e "INPUT_EVENT_NAME" -e "INPUT_TEST_CHANGES_LIMIT" -e "INPUT_CHECK_RUN_ANNOTATIONS" -e "INPUT_CHECK_RUN_ANNOTATIONS_BRANCH" -e "INPUT_SECONDS_BETWEEN_GITHUB_READS" -e "INPUT_SECONDS_BETWEEN_GITHUB_WRITES" -e "INPUT_SECONDARY_RATE_LIMIT_WAIT_SECONDS" -e "INPUT_JSON_FILE" -e "INPUT_JSON_THOUSANDS_SEPARATOR" -e "INPUT_JSON_SUITE_DETAILS" -e "INPUT_JSON_TEST_CASE_RESULTS" -e "INPUT_SEARCH_PULL_REQUESTS" -e "HOME" -e "GITHUB_JOB" -e "GITHUB_REF" -e "GITHUB_SHA" -e "GITHUB_REPOSITORY" -e "GITHUB_REPOSITORY_OWNER" -e "GITHUB_REPOSITORY_OWNER_ID" -e "GITHUB_RUN_ID" -e "GITHUB_RUN_NUMBER" -e "GITHUB_RETENTION_DAYS" -e "GITHUB_RUN_ATTEMPT" -e "GITHUB_REPOSITORY_ID" -e "GITHUB_ACTOR_ID" -e "GITHUB_ACTOR" -e "GITHUB_TRIGGERING_ACTOR" -e "GITHUB_WORKFLOW" -e "GITHUB_HEAD_REF" -e "GITHUB_BASE_REF" -e "GITHUB_EVENT_NAME" -e "GITHUB_SERVER_URL" -e "GITHUB_API_URL" -e "GITHUB_GRAPHQL_URL" -e "GITHUB_REF_NAME" -e "GITHUB_REF_PROTECTED" -e "GITHUB_REF_TYPE" -e "GITHUB_WORKFLOW_REF" -e "GITHUB_WORKFLOW_SHA" -e "GITHUB_WORKSPACE" -e "GITHUB_EVENT_PATH" -e "GITHUB_PATH" -e "GITHUB_ENV" -e "GITHUB_STEP_SUMMARY" -e "GITHUB_STATE" -e "GITHUB_OUTPUT" -e "GITHUB_ACTION" -e "GITHUB_ACTION_REPOSITORY" -e "GITHUB_ACTION_REF" -e "RUNNER_OS" -e "RUNNER_ARCH" -e "RUNNER_NAME" -e "RUNNER_ENVIRONMENT" -e "RUNNER_TOOL_CACHE" -e "RUNNER_TEMP" -e "RUNNER_WORKSPACE" -e "ACTIONS_RUNTIME_URL" -e "ACTIONS_RUNTIME_TOKEN" -e "ACTIONS_CACHE_URL" -e "ACTIONS_RESULTS_URL" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/runner/_work/_temp/_github_home":"/github/home" -v "/runner/_work/_temp/_github_workflow":"/github/workflow" -v "/runner/_work/_temp/_runner_file_commands":"/github/file_commands" -v "/runner/_work/beam/beam":"/github/workspace" ghcr.io/enricomi/publish-unit-test-result-action:v2.11.0
[33](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:34)
2023-12-21 08:50:11 +0000 - publish -  INFO - Available memory to read files: 17.9 GiB
[34](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:35)
2023-12-21 08:50:13 +0000 - publish -  INFO - Reading files **/build/test-results/**/*.xml (34 files, 85.9 MiB)
[35](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:36)
2023-12-21 08:50:14 +0000 - publish -  INFO - Detected 34 JUnit XML files (85.9 MiB)
[36](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:37)
2023-12-21 08:50:14 +0000 - publish -  INFO - Finished reading 34 files in 1.43 seconds
[37](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:38)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 100046, column 254
[38](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:39)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 99247, column 127
[39](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:40)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 98891, column 128
[40](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:41)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 99542, column 58
[41](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:42)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 99182, column 243
[42](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:43)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 98943, column 96
[43](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:44)
2023-12-21 08:50:15 +0000 - publish -  INFO - Publishing failure results for commit f295bda46585a7acff61ed373379a3b7e0dfeff5
[44](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:45)
2023-12-21 08:50:17 +0000 - publish -  INFO - Created check https://github.com/apache/beam/runs/19853749247
[45](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:46)
2023-12-21 08:50:17 +0000 - publish -  INFO - Created job summary
[46](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:47)
2023-12-21 08:50:17 +0000 - publish -  INFO - Commenting on pull requests disabled
[47](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:48)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 100046, column 254
[48](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:49)
Error: Error processing result file: CData section too big found, line 100046, column 254 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 100046)
[49](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:50)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 99247, column 127
[50](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:51)
Error: Error processing result file: CData section too big found, line 99247, column 127 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 99247)
[51](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:52)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 98891, column 128
[52](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:53)
Error: Error processing result file: CData section too big found, line 98891, column 128 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 98891)
[53](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:54)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 99[54](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:55)2, column 58
54
Error: Error processing result file: CData section too big found, line 99542, column 58 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 99542)
[55](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:56)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 99182, column 243
[56](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:57)
Error: Error processing result file: CData section too big found, line 99182, column 243 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 99182)
[57](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:58)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 98943, column 96
[58](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:59)
Error: Error processing result file: CData section too big found, line 98943, column 96 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 98943)

@brucearctor
Copy link
Contributor

brucearctor commented Dec 21, 2023

I still want to dig more closely into the code, only have superficially skimmed at this point...

Rereading @ffernandez92 comments -- this seems likely to be an issue with EnricoMi/publish-unit-test-result-action@v2 ... and not the test itself. So, that's a positive!

Copy link
Contributor

github-actions bot commented Jan 5, 2024

Reminder, please take a look at this pr: @riteshghorse @damondouglas @damondouglas

@brucearctor
Copy link
Contributor

Will be curious anyone else's thoughts.

As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.

I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

Copy link
Contributor

@Polber Polber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I gave a few suggestions, mostly nits on formatting and organization.

Comment on lines +179 to 182
} else {
throw new IllegalArgumentException(
"Expecting both descriptorPath and messageName to be non-null.");
"At least a descriptorPath or a proto Schema is required.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More for my understanding - why is a schema provided by the Configuration required here? The other data formats use the schema from the incoming PCollectionRowTuple input to create the schema for the outgoing PCollectionRowTuple output. Can the output Proto schema not be constructed from the input Row schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a separate schema in the Configuration offers flexibility and explicit control over the translation process, particularly when addressing variations in field mapping, data types, nested structures, default values, and schema evolution. Other alternatives, such as using the StorageApiProto, were considered, but this approach could potentially prevent the resulting output from matching the expected Proto schema for the subsequent reader. Another option explored was similar to the approach used in Scio (https://spotify.github.io/scio/io/Protobuf.html#write-protobuf-files), where a wrapper is created. However, this method introduces a layer of abstraction, potentially resulting in the output not precisely aligning with the user's desired schema. I remain open to suggestions for alternative approaches in this context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting -- I didn't realize a translation function existed from beam row to proto, I imagine there are things around: https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.html

We'd definitely need to understand the translation much more, to ensure sufficiently deterministic. Passing the information explicitly removes all doubt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in that case I think providing an explicit schema, at least optionally, makes sense. Perhaps adding support for an implicit schema could be provided in a future FR.

@Polber
Copy link
Contributor

Polber commented Jan 8, 2024

Will be curious anyone else's thoughts.

As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.

I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.

@damccorm do you have any reservations on merging?

@damccorm
Copy link
Contributor

damccorm commented Jan 9, 2024

Will be curious anyone else's thoughts.
As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.
I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.

@damccorm do you have any reservations on merging?

Where is the failed check? If the test infra is flaky then I agree we shouldn't block on it. If we are turning a meaningful suite permared then I think we should address that before proceeding. Looking at the current pr, I only see the failing Kafka check which looks like it is running into a timeout (maybe its stuck)? We likely should not ignore that

@damccorm
Copy link
Contributor

damccorm commented Jan 9, 2024

I added #29964 to address the timeout issue

@brucearctor
Copy link
Contributor

Will be curious anyone else's thoughts.
As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.
I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.
@damccorm do you have any reservations on merging?

Where is the failed check? If the test infra is flaky then I agree we shouldn't block on it. If we are turning a meaningful suite permared then I think we should address that before proceeding. Looking at the current pr, I only see the failing Kafka check which looks like it is running into a timeout (maybe its stuck)? We likely should not ignore that

Also see --> #29835 (comment)

Error: Error processing result file: CData section too big found, ...

Seems to be an issue with a limitation on EnricoMi/publish-unit-test-result-action@v2 ...?

Since an issue filed, it also seems like we can proceed, and see whether this is a persistent or flaky problem, and then prioritize fixing if warranted -- rather than being a blocker.

Copy link
Contributor

@Polber Polber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, assuming the conversation over the failing test is resolved before merging.

@damccorm
Copy link
Contributor

damccorm commented Jan 9, 2024

Will be curious anyone else's thoughts.
As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.
I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.
@damccorm do you have any reservations on merging?

Where is the failed check? If the test infra is flaky then I agree we shouldn't block on it. If we are turning a meaningful suite permared then I think we should address that before proceeding. Looking at the current pr, I only see the failing Kafka check which looks like it is running into a timeout (maybe its stuck)? We likely should not ignore that

Also see --> #29835 (comment)

Error: Error processing result file: CData section too big found, ...

Seems to be an issue with a limitation on EnricoMi/publish-unit-test-result-action@v2 ...?

Oh I see - this silently failed and the workflow still succeeded. Yeah, I think this is fine to ignore. Its actually not a new issue (e.g. a scheduled run on master ran into this earlier today - https://github.com/apache/beam/actions/runs/7458106150)

Since an issue filed, it also seems like we can proceed, and see whether this is a persistent or flaky problem, and then prioritize fixing if warranted -- rather than being a blocker.

Have we actually filed the issue? I don't see one referenced in the comments and couldn't find one

@brucearctor brucearctor merged commit 6066af3 into apache:master Jan 9, 2024
89 of 90 checks passed
@brucearctor
Copy link
Contributor

Merged ... And filed: #29966 ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants