-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix overwriting etcd data when local subnet file exists #1505
Fix overwriting etcd data when local subnet file exists #1505
Conversation
If host_A goes offline and when it comes online again, it gets a different subnet, what will happen to the pods running on that host? They will be using the old subnet, or? |
I persist the subnet information to the /opt/software/ flannel/subnet.env file. I expect that the subnet can be reassigned when the node restarts, but he loads the local file. When calling updateSubnet method, the etcd is not locked, resulting in the data of the same key being overwritten. You can reproduce it quickly in this way. |
@sjoerdsimons could you review this PR? I am not familiar with etcd as subnet manager |
This looks correct to me; I can't really work out what the previous logic was trying to do to be honest, I don't see why it would ever update subnet that's no longer correlated with the nodes IP (which should imply someone else is using it). So dropping that whole chunk seems fine. |
Thanks a lot! |
@sjoerdsimons @zhangzhangzf can we wrap any testing around this? our concern is that it has been one way for so long and we're ripping out code which people could be relying on. if this is backwards compat with no issues we can merge |
@luthermonson That's one for @zhangzhangzf I was just asked (and gave) my opinion about it ;) It is a bit of a behavioural change for sure; How it will impact users i don't really know. I hope some from the flannel maintainers can answer that. After wondering about it in the back of my head for a few days I suspect what this code meant to do is to allow nodes that change their IP address to recover their previous subnet; However how it does this is fundamentally racy (as it's only the ipv4 IP address that "identifies" the owner). Basically what the new removed code handles correctly is:
Where it goes wrong is:
The latter case is not that likely to happen in practise as there is some randomisation involved in lease allocation (it picks a random subnet out of the first 100 free subnets).. At least assuming people don't do silly things like cloning machines including a persistently stored subnet.env file.... But if it happens it's quite bad obviously, which is what this patch fixes. The code that's left after this patch with respect to previous subnet handling will only opportunistically re-use the subnet from subnet.env if there is no current lease tied to the nodes IP address and there is no current lease for the previous subnet. Which is race-free.. So I guess the question whether deployments rely on flannel keeping/recovering the same subnet after changing the host IP address. Which is not something i really cannot answer |
@sjoerdsimons You're right. This patch fixes the latter problem.But I don't agree that this problem doesn't happen frequently. The 100 subnets in the subnets pool are taken from small to large, and one is pick randomly, which is easy to repeat in large-scale clusters.Especially when the cluster is not stable enough. Repetition can lead to bad problems. We have found this problem many times in the deployment environment. |
@luthermonson I didn't find the compatibility problem. The maintainers can discuss it in the next step. I have deployed the repaired code in a large-scale cluster, which can solve the latter problem described by sjoerdsimons . |
Long answer: Short answer: |
Hey @AleksandrNull, thanks for the comment. If you upvote the PR, would you mind approving it? Then we could merge @zhangzhangzf could you please rebase? I added a linter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@zhangzhangzf could you please rebase? Then we will merge |
@zhangzhangzf we will release the new version of flannel without this PR. But we can add it in subsequent patch releases once you rebase. Thanks |
Sorry, I'm a junior user of GitHub. I'm not sure whether the operation I just did solved the problem? |
No worries! I think they did :) |
Description
Fix#1289
Remove the code that loads the local subnet file and update etcd. This behavior will overwrite the records of the same key in etcd.
A description of the problem is recorded in the issues#1289. As in this case, the flannel processes of both nodes are running on the same subnet.This will cause the node network to become unavailable
Release Note
None required