Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: create buffers and buckets before updating Vertices #2112

Merged
merged 1 commit into from
Oct 5, 2024

Conversation

juliev0
Copy link
Contributor

@juliev0 juliev0 commented Oct 2, 2024

Fixes #2083

I simply swapped 1. the Job creation to occur prior to 2. the creation/deletion/Update of the Vertex CRs.

Previously the creation of the Jobs was dependent on the Vertex specs (newBuffers and newBuckets are calculated based on Vertex specs).

So if there's a transient error that occurs after or in the middle of creating/updating the Vertex CRs, the next reconciliation will incorrectly calculate the Jobs it needs to perform, as some or all of the Vertex specs have changed.

Copy link

codecov bot commented Oct 2, 2024

Codecov Report

Attention: Patch coverage is 16.12903% with 26 lines in your changes missing coverage. Please review.

Project coverage is 64.45%. Comparing base (3dbed43) to head (e5cb995).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/reconciler/pipeline/controller.go 16.12% 24 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2112      +/-   ##
==========================================
+ Coverage   64.25%   64.45%   +0.19%     
==========================================
  Files         324      324              
  Lines       30650    30650              
==========================================
+ Hits        19695    19755      +60     
+ Misses       9913     9852      -61     
- Partials     1042     1043       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@whynowy
Copy link
Member

whynowy commented Oct 2, 2024

This looks good to me. @kohlisid and @KeranYang , can you double check if there's any issue with the order switch?

@juliev0
Copy link
Contributor Author

juliev0 commented Oct 2, 2024

I actually haven't tested it to tell you the truth yet. I wanted to see if it would pass the CI. If we can build an image for it, I can start to incorporate it into Numaplane, which is aggressively testing this. We can keep it in Draft stage for a little while until we feel confident.

@juliev0
Copy link
Contributor Author

juliev0 commented Oct 2, 2024

What this does mean is that the Buffers and Buckets that the vertices are currently using will disappear and those Vertices will not be able to work with the isbsvc. Then the vertices will get updated and they will get reconciled by Vertex Controller to replace the old Pods with the new ones, and I think those should come up.

Of course, if the Pipeline is paused first, then the Vertices will be down when the buffers and buckets get wiped and replaced, so nothing should be broken then.

@KeranYang
Copy link
Member

Thanks Julie for making the change. I don't think this change will break anything so should be safe to merge. My only concern is that we didn't validate the fix. +1 on using numaplane e2e to aggressively test it. Meanwhile, we should develop our own numaflow test case to catch such issue - a test with pipeline topology change would work.

@juliev0 , if we merge this in, will numaplane e2e immediately get it? If so, we can monitor.

@kohlisid
Copy link
Contributor

kohlisid commented Oct 2, 2024

Thanks @juliev0
Looks good on a high level! +1 on testing it out before rolling it.
Meanwhile will think of defining some cases for verification on our end as well that can be incorporated to help with such changes in the future.

@juliev0
Copy link
Contributor Author

juliev0 commented Oct 5, 2024

I saw this same thing happen again when running our e2e test so I guess it's not that rare. Meanwhile, I don't believe I've seen any adverse consequences of this change. Therefore, I'll make this "ready for review". Feel free to merge either now or we can give it a little more time if you prefer.

@juliev0 juliev0 marked this pull request as ready for review October 5, 2024 16:11
@juliev0 juliev0 requested review from whynowy and vigith as code owners October 5, 2024 16:11
@whynowy whynowy merged commit 06515a2 into numaproj:main Oct 5, 2024
25 checks passed
@KeranYang
Copy link
Member

Thanks, @juliev0 . Please continue monitoring and let us know if this fixes.

@juliev0 juliev0 deleted the jobs-before-vertices branch October 6, 2024 14:48
whynowy pushed a commit that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No "Create Job" executed - Possible Race Condition
4 participants