-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][prism]: Excessive Splitting #29180
Comments
Referencing my comment from the other issue here, since we might be referring to the same problem of the pipeline hanging, but it may be related to excessive splitting rather than failures caused by unregistered functions. It also happens in other Dumping some logs below.
Success
Unexpected behavior
|
I'm facing a similar issue posted by the previous comment by @johannaojeling in this sample pipeline: It's derived from the minimal wordcount and changed to allow for xlang transform execution demostration. However, it is also able to run as a "standalone" pipeline if no expansion server is provided using just Go pipeline code. Without any expansion server (i.e., just Go code), it works fine under DirectRunner or DataflowRunner, and does not work on Prism with the stalling loop reading from the TextIO:
It does not seems to complete the pipeline regardless of how much it is left running. I have tried to use register.Function2x1 in order to possibly fix it but it remains consistently failing when executing with Prism. Not sure if this is caused by unregistered functions but the symptom is the same. Should I open a separate issue for tracking this? |
I suspect that this is an issue with Prism doing overly aggressive splitting which is probably not playing nicely with textio, and not related to function registration. |
@ronoaldo I think a different issue should track what you've seen. Do tag me in that issue, as I think it shouldn't be too hard to start to improve the split approach to better tolerate fileio, while quickly passing the separation harness test. The splitting decision is very open to change right now. I opened this one to track a specific bug in Prism WRT pipeline hangs, and I seem to have missed closing it after resolving the race condition*. The minimal_wordcount now correctly fails with errors relating to registration, instead of sometimes hanging forever, so I'd like to close this issue.
|
Lastly, @ronoaldo I'm also unable to replicate the bug you're seeing at repo HEAD or through versions of prism from Beam releases 2.50.0 to 2.53.0. What version are you seeing it on? (edit: you do say it's all the time...) OK. So far I'm simply unable to replicate it. Hmmm. |
I can trigger the "aggressive splitting" if I add a While this is still a basic heuristic, having the TotalCount of elements emitted from the last progress available will let us make it more sophisticated in the future, based on the ratios of dataChannel inputs consumed since last time and total outputs and so on, so it's not only splitting when a bundle appears to be stuck. (eg. if we produced less than half output per input since the last progress tick.) Anyway... #29968 might solve your issue, if you're able to patch it in and test it. |
Updated this issue to track the split discussion, since I do have a PR that appears to avoid the splitting aggression, at the cost of the general throughput (described elsewhere). |
…ss aggressively. (#29968) * [prism] Return total element count to progress loop. Split less aggressively. * Update comments. --------- Co-authored-by: lostluck <[email protected]>
Sorry for the lack of quick reply! I'll try to run the same pipeline inside a docker container so that could help reproduce? I'll try the fix as well and report back, thanks for the quick reply! |
tl, dr: network speed/storage caching/cdn was causing the latency and triggering the bug I reported; I guess that using the sleep is the best way to reproduce it. Just sharing my findings when testing... Previously it was consistently failing for me, perhaps due do the network speed/latency. At first I thought it was due to Go 1.21.4 -- testing with another Go version was not causing it but I was able to reproduce the issue with that specific version using Docker. Then I went ahead and started a for loop to test several go minor versions. And they all passed without issues. I then switched the file from kinglear to othello and it failed once, then never failed again. I guess that somehow then Cloud Storage cached the file and that reduced network latency during the local pipeline execution. Combining this with your comment, it does seem to be related to the latency as by adding a time.Sleep reproduced it. I'll test with the newer version of the library now to confirm that it passes as well. Thanks! |
Thank you for reporting! Being inside Google's VPN definitely doesn't hurt my personal latency. Longer term I'd like prism to track metrics like this so it's easier to write a test to reproduce it, and avoid a regression. I've added some initial considerations to #29650 |
What happened?
The current policy for Prism is to split very aggressively when the element index hasn't advanced.
It should split, but only when a bundle is not making other indication of progress. It's a separate issue to optimize the splits for processing throughput.
(Previously, this issue was about a hang due to not having DoFns registered, but other deflake work which now reliably catches bundle failures, the original issue is no more. Commentary pulled this issue into the aggressive split problem.)
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: