Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce a size limit on StringSetData #32650

Merged
merged 2 commits into from
Oct 8, 2024
Merged

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Oct 4, 2024

Fix #32649

  • Make StringSetData set mutable. This avoids copy and create new ImutableSet every time

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 4, 2024

Tested TextIOIT: write then read 100,000 files

  • user counter elements_read 3,754,869, should be 100,000,000, counter still dropped after overall stringset grows

  • "Elements added 100,000" and job progress both works when job running

  • It has

WARNING 2024-10-04T03:25:05.938Z StringSetData reaches capacity. Current size: 100023, last element size: 77. Further incoming elements will be dropped.

warning log.

Dataflow job id: 2024-10-03_20_11_47-11842727745111768860

In contrast, same TextIOIT running on master job stuck (for 20-25 min) at add StringSetData (see also #32649)

image

then during read, Dataflow UI counters not updating (other than throughput chart)

image

job id 2024-10-03_20_13_23-15614223084919276642

@Abacn Abacn force-pushed the stringsetcap branch 3 times, most recently from ad83b4d to d8c1399 Compare October 4, 2024 13:38
@Abacn
Copy link
Contributor Author

Abacn commented Oct 4, 2024

After more testing changed the cap to be 1 MB. This should be good for ~10k number of elements.

also tested on a simple pipeline GenerateSequence->Report LIneage on every element

HEAD (jobid: 2024-10-04_06_15_24-8505055598897377194)

throughput, several hundreds elements / s, cloud profiler shows stringset add takes most of time:

image

This PR (job id: 2024-10-04_06_23_39-4373466717598467131)

pipeline finishes immediately:

image

Saw log:

StringSetData reaches capacity. Current size: 1000004, last element size: 58. Further incoming elements will be dropped.

confirm the cap is effective.

user counter work: elements_read 1,000,000

* Make StringSetData set mutable. This avoids
  copy and create new ImutableSet every time
@Abacn Abacn marked this pull request as ready for review October 4, 2024 16:58
Copy link
Contributor

github-actions bot commented Oct 4, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

// check capacity both before insert and after insert one, so the warning only emit once.
if (currentSize > STRING_SET_SIZE_LIMIT) {
LOG.warn(
"StringSetData reaches capacity. Current size: {}, last element size: {}. Further "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds a bit cryptic for the user, and dropping elements sounds scary - could we have a message that would be informative and actionable from the user's perspective?

Copy link
Contributor Author

@Abacn Abacn Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the language such that this is about a metrics (not the actual data in processing). This is not quite actionable from end user. This happens when there are lots of source/sink in the pipeline and Beam metrics system cannot record all of them. For the current occurance (file-based-io) it should be largely mitigated by #32662. The remaining case -- read lots of file from match that does not have a common directory -- will need this guard to avoid mess up the metrics system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little worried about this specific change (silent drop) and possible consequences this can have. The silent drop can cause more confusion as it can go unnoticed (as it is now, most customers don't look at job logs as long as job is working fine which will be the case here). Without the drop the job metrics degrades which is bad in itself but noticeable by customer and actionable.

Also StringSet metric is core metric in Beam Lineage just build on top of that uses it. So lineage specific changes should not be pushed down the StringSet metric itself as a customer can use StringSet metric to have their own metric of this type for something else.

The root of the issue here is the size limit:

  • 20 MB ReportWorkItem limit
  • 32 Mb GFE limit (even if we bump up the ReportWorkItem).

This limit gets hit whenever metric (irrespective of type of metric) size become too large. Generally it is in the case when metrics are emitted per-record (leading to increase in size of data). See previous case where counter metrics caused this: https://buganizer.corp.google.com/issues/195841708#comment42 in that case customer was recommended not to emit per-record.

The challenge here is that in above case and in case where customer does this we can suggest them to change the usage pattern (limiting, but the approach we have for now) but in case of Lineage which uses StringSet metric it is being done by the system itself and customer does not have control over it.

As you said the current occurrence (file-based-io) it should be largely mitigated by #32662 (thank you for the prompt change). We still need to solve "read lots of file from match that does not have a common directory" see below.

Beyond the above change for file-based-io my thoughts are:

  1. We should not limit metric size of a particular metric like StringSet. If we decide to do it to protect customers from issues arising from above size limit then this should be a limit on size of metric being reported and applicable to all metrics for consistent behavior rather than a particular metric type.
  2. For lineage specifically if we want to limit the cardinality of reported lineage information to fit in size limit the limit should be applied to the Lineage being reported. For example, a possible approach can be a) limit the unique Dataplex FQN which can be reported as source/sinks in a job. b) we can also do some grouping based on fqn pattern like . total size > n then report at on level top etc.

Also cc: @robertwb if he has any specific thoughts/views on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,
If we just want to enforce a stop-gap size check for lineage rather than doing at StringSet metric level we can do at Lineage class to enforce that number of sources/sink and/or their data size is under x.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Lineage.java#L33

Copy link
Contributor Author

@Abacn Abacn Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. The trigger of the issue was large Lineage but the cause is Beam metrics implementation cannot handle large metrics (in total size), so the change is proposed in StringSetCell (the actual place worker store the metrics before reporting them). We can of course revisit the change when the limitation is fixed in implementations. For now the goal is make sure the whole metrics system does not break due to large stringset.

Copy link
Contributor

@rohitsinha54 rohitsinha54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside the ongoing discussion about limiting stringset size other changes LGTM.

Thanks for prompt fix. Appreciate a ton!

Copy link
Contributor

@rohitsinha54 rohitsinha54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Copy link
Contributor

github-actions bot commented Oct 4, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor Author

Abacn commented Oct 4, 2024

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set

@rohitsinha54
Copy link
Contributor

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set

Thank you for considering and exploring this. This seem like correct description of current implementation. We can incrementally add more to support the above need.

Here is what I am thinking,
StringSet can expose a an API size() which gives the current size of StringSet metric. This will of course be backed by DelegateStringSet and runner's StringSetData. This will allow Lineage class to get visibility into the current size. Based on the current size Lineage class enforce a limit. Generically other consumer/user who uses StringSet metric might want to know the current size of the Set.

At large beside the two points mentioned in (#32650 (comment)) I believe a better design will be to have StringSet metric expose APIs and feature enabling consumers such as Lineage to use the metric for their use cases (in this case size limit) (bottom up) rather than top down wherein the use case of consumer is built in the metric itself. We want the beam's core metric to be generic.

Also the size limit of metric is a Dataflow specific limitation based of limit of ReportWorkItem/GFE. Other runner might or might not have it. We should avoid coupling it with Beam's core metric.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 4, 2024

Other runner might or might not have it. We should avoid coupling it with Beam's core metric

Though didn't get into it, I suspect there is a limit not limited to Dataflow. Under portable framework where runner API backed by grpc messages, then when client query for job status, the response size should have certain limit

@rohitsinha54
Copy link
Contributor

Other runner might or might not have it. We should avoid coupling it with Beam's core metric

Though didn't get into it, I suspect there is a limit not limited to Dataflow. Under portable framework where runner API backed by grpc messages, then when client query for job status, the response size should have certain limit

That makes sense.

@rohitsinha54
Copy link
Contributor

WDYT of breaking this PR in two parts:

  1. Making StringSetData set mutable
  2. Size limit enforcement (This can go in later separately)

@Abacn
Copy link
Contributor Author

Abacn commented Oct 4, 2024

WDYT of breaking this PR in two parts:

  1. Making StringSetData set mutable
  2. Size limit enforcement (This can go in later separately)

Yeah this PR essentially contains two fix, corresponding to two related issues in #32649. Mutable set will fix slowness.
However we still need limit enforcement to fix metrics breakage. Even with #32662 in the metrics system will still break for FileIO read patterns like gs://bucket/folder/**/some_data.txt so I think both fix is needed for the current release

@rohitsinha54
Copy link
Contributor

WDYT of breaking this PR in two parts:

  1. Making StringSetData set mutable
  2. Size limit enforcement (This can go in later separately)

Yeah this PR essentially contains two fix, corresponding to two related issues in #32649. Mutable set will fix slowness. However we still need limit enforcement to fix metrics breakage. Even with #32662 in the metrics system will still break for FileIO read patterns like gs://bucket/folder/**/some_data.txt so I think both fix is needed for the current release

Yes correct, I was thinking we can submit Making StringSetData set mutable today but it is upto you if you want to keep them together.

@liferoad
Copy link
Collaborator

liferoad commented Oct 8, 2024

Can we do the minimal here to unblock Beam 2.60.0? Later, we can keep improving. @rohitsinha54 @tvalentyn

@liferoad
Copy link
Collaborator

liferoad commented Oct 8, 2024

Please update CHANGES.md to mention the impacted Beam versions by this issue. @Abacn

@Abacn
Copy link
Contributor Author

Abacn commented Oct 8, 2024

Please update CHANGES.md to mention the impacted Beam versions by this issue. @Abacn

Did in #32664. The issue was introduced in Beam 2.59.0

@rohitsinha54
Copy link
Contributor

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set

Thank you for considering and exploring this. This seem like correct description of current implementation. We can incrementally add more to support the above need.

Here is what I am thinking, StringSet can expose a an API size() which gives the current size of StringSet metric. This will of course be backed by DelegateStringSet and runner's StringSetData. This will allow Lineage class to get visibility into the current size. Based on the current size Lineage class enforce a limit. Generically other consumer/user who uses StringSet metric might want to know the current size of the Set.

At large beside the two points mentioned in (#32650 (comment)) I believe a better design will be to have StringSet metric expose APIs and feature enabling consumers such as Lineage to use the metric for their use cases (in this case size limit) (bottom up) rather than top down wherein the use case of consumer is built in the metric itself. We want the beam's core metric to be generic.

Also the size limit of metric is a Dataflow specific limitation based of limit of ReportWorkItem/GFE. Other runner might or might not have it. We should avoid coupling it with Beam's core metric.

We discussed this offline.

Here is the summary of the discussion for future.
If size is exposed based on the StringSetData then the size will be for that specific bundle since one metric may have multiple metrics containers (backed by different stringsetdata). This can confuse user if exposed a api level and also breaks the abstraction which Beam as a programming model strives to provide.

The current size enforcement also has this issue but since it is not exposed to user to consume. the stringsetdata limit happens to work with FileIO because currently FileIO report source and sink metrics in a loop, so it happens to write to single metrics container on a single worker. Hence the size enforcement is tied to FileIO implementation. So this is not an ideal fix but we need to solve the FileIO large file issue somehow.

I will propose to drop the limit enforcement on string set completely in anyway. We will limit the number of reported files in FileIO itself to ensure we do not hit it. This will avoid FileIO implication and implementation being tied to a metric type.

See comment. #32662 (comment)

@Abacn
Copy link
Contributor Author

Abacn commented Oct 8, 2024

communicated offline. Decided to keep the size limit, as it is parallel to #32662 . If the known issue fixed in #32662 the size limit could still happen

@Abacn Abacn merged commit 8a6f248 into apache:master Oct 8, 2024
105 checks passed
@Abacn Abacn deleted the stringsetcap branch October 8, 2024 23:31
Abacn added a commit to Abacn/beam that referenced this pull request Oct 8, 2024
* Enforce a size limit on StringSetData

* Make StringSetData set mutable. This avoids
  copy and create new ImutableSet every time

* adjust warning log
damccorm pushed a commit that referenced this pull request Oct 9, 2024
* Enforce a size limit on StringSetData

* Make StringSetData set mutable. This avoids
  copy and create new ImutableSet every time

* adjust warning log
@Abacn Abacn mentioned this pull request Nov 8, 2024
17 tasks
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
* Enforce a size limit on StringSetData

* Make StringSetData set mutable. This avoids
  copy and create new ImutableSet every time

* adjust warning log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Slowness and/or broken metrics visualization when Lineage metrics is large
4 participants