Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overwriting etcd data when local subnet file exists #1505

Merged
merged 2 commits into from
Mar 8, 2022

Conversation

zhangzhangzf
Copy link
Contributor

Description

Fix#1289
Remove the code that loads the local subnet file and update etcd. This behavior will overwrite the records of the same key in etcd.
A description of the problem is recorded in the issues#1289. As in this case, the flannel processes of both nodes are running on the same subnet.This will cause the node network to become unavailable

Release Note

None required

@zhangzhangzf zhangzhangzf changed the title Fix overwriting etcd database data when local subnet file exists Fix overwriting etcd data when local subnet file exists Nov 19, 2021
@manuelbuil
Copy link
Collaborator

#1289

@manuelbuil
Copy link
Collaborator

If host_A goes offline and when it comes online again, it gets a different subnet, what will happen to the pods running on that host? They will be using the old subnet, or?

@zhangzhangzf
Copy link
Contributor Author

I persist the subnet information to the /opt/software/ flannel/subnet.env file. I expect that the subnet can be reassigned when the node restarts, but he loads the local file. When calling updateSubnet method, the etcd is not locked, resulting in the data of the same key being overwritten. You can reproduce it quickly in this way.
Now, we simulation the scene.The lease term of nodeA expires and nodeB joins the cluster.
(1) Stop the flannel daemon of nodesA and nodeB and delete the etcd records of nodesA and nodeB.
(2) Start flannel on nodeA
(3) Delete nodeA data in etcd (you will see the process exit because of the watch mechanism)
(3) Configure the subnet information of nodeB to be consistent with node A
(4) Start nodeB flannel
(5) Start nodeA flannel
You will see two flannels working in the same network segment

@manuelbuil
Copy link
Collaborator

@sjoerdsimons could you review this PR? I am not familiar with etcd as subnet manager

@sjoerdsimons
Copy link
Contributor

This looks correct to me; I can't really work out what the previous logic was trying to do to be honest, I don't see why it would ever update subnet that's no longer correlated with the nodes IP (which should imply someone else is using it). So dropping that whole chunk seems fine.

@manuelbuil
Copy link
Collaborator

This looks correct to me; I can't really work out what the previous logic was trying to do to be honest, I don't see why it would ever update subnet that's no longer correlated with the nodes IP (which should imply someone else is using it). So dropping that whole chunk seems fine.

Thanks a lot!

@luthermonson
Copy link
Contributor

@sjoerdsimons @zhangzhangzf can we wrap any testing around this? our concern is that it has been one way for so long and we're ripping out code which people could be relying on. if this is backwards compat with no issues we can merge

@sjoerdsimons
Copy link
Contributor

@luthermonson That's one for @zhangzhangzf I was just asked (and gave) my opinion about it ;)

It is a bit of a behavioural change for sure; How it will impact users i don't really know. I hope some from the flannel maintainers can answer that.

After wondering about it in the back of my head for a few days I suspect what this code meant to do is to allow nodes that change their IP address to recover their previous subnet; However how it does this is fundamentally racy (as it's only the ipv4 IP address that "identifies" the owner).

Basically what the new removed code handles correctly is:

  • Node A gets a fresh Lease
  • Node A shuts down flannel and gets different IP address (e.g. it rebooted?)
  • Node A wants rejoins the flannel mesh and the lease did not expire yet
    • In this case it will find its previous lease for the subnet and take ownership again; Which means everything is happy

Where it goes wrong is:

  • Node A gets a fresh Lease (LA)
  • Node A shuts down flannel (e.g. it is temporarily shut down)
  • LA expires (which can happen after "just" an hour if the default renew margin is used)
  • Node B starts and gets a fresh lease (LB) with the same subnet as the now goes lease LA
  • Node A starts again and tries to recover it's previous subnet
    • In this case it will find LB which is owned by node B and hijack it. At which point there are two active nodes with the same subnet, which ofcourse goes badly wrong

The latter case is not that likely to happen in practise as there is some randomisation involved in lease allocation (it picks a random subnet out of the first 100 free subnets).. At least assuming people don't do silly things like cloning machines including a persistently stored subnet.env file.... But if it happens it's quite bad obviously, which is what this patch fixes.

The code that's left after this patch with respect to previous subnet handling will only opportunistically re-use the subnet from subnet.env if there is no current lease tied to the nodes IP address and there is no current lease for the previous subnet. Which is race-free..

So I guess the question whether deployments rely on flannel keeping/recovering the same subnet after changing the host IP address. Which is not something i really cannot answer

@zhangzhangzf
Copy link
Contributor Author

@sjoerdsimons You're right. This patch fixes the latter problem.But I don't agree that this problem doesn't happen frequently. The 100 subnets in the subnets pool are taken from small to large, and one is pick randomly, which is easy to repeat in large-scale clusters.Especially when the cluster is not stable enough. Repetition can lead to bad problems. We have found this problem many times in the deployment environment.

@zhangzhangzf
Copy link
Contributor Author

@sjoerdsimons @zhangzhangzf can we wrap any testing around this? our concern is that it has been one way for so long and we're ripping out code which people could be relying on. if this is backwards compat with no issues we can merge

@luthermonson I didn't find the compatibility problem. The maintainers can discuss it in the next step. I have deployed the repaired code in a large-scale cluster, which can solve the latter problem described by sjoerdsimons .

@AleksandrNull
Copy link
Contributor

Long answer:
I'm trying to wrap my head around the problem. One side of it: node keeps information about the state which got propagated back to the cluster even if node was physically deleted from etcd. That's bad in general. Once node was deleted/re-created it shouldn't be able to force her subnet/segment be it's own again or update any data in etcd based on some leftovers. From security perspective that's not good behavior. From user perspective right now it can be expected so it's need to be considered.
In regards of the problems in large cluster: I believe it's an exaggeration, and let me explain why: I can understand that nodes going down and back but k8s built in a way that you don't have to remove them on each and every event unless you build 'reliable' cluster on spot instances. So basically I do see one issue here: if nodes was deleted from the cluster it actually need to be wiped. The same is true for the certs/envs/etc of k8s in general.

Short answer:
I believe losing of node data in regards of subnet/ips should not impact users in general because idea is about scalability and cattle. So, no one should statically refer to the pod IPs. It will change behavior but for good :) So I'm upvoting the change and have only one suggestion: may be it worth to merge two IFs :)

@manuelbuil
Copy link
Collaborator

Long answer: I'm trying to wrap my head around the problem. One side of it: node keeps information about the state which got propagated back to the cluster even if node was physically deleted from etcd. That's bad in general. Once node was deleted/re-created it shouldn't be able to force her subnet/segment be it's own again or update any data in etcd based on some leftovers. From security perspective that's not good behavior. From user perspective right now it can be expected so it's need to be considered. In regards of the problems in large cluster: I believe it's an exaggeration, and let me explain why: I can understand that nodes going down and back but k8s built in a way that you don't have to remove them on each and every event unless you build 'reliable' cluster on spot instances. So basically I do see one issue here: if nodes was deleted from the cluster it actually need to be wiped. The same is true for the certs/envs/etc of k8s in general.

Short answer: I believe losing of node data in regards of subnet/ips should not impact users in general because idea is about scalability and cattle. So, no one should statically refer to the pod IPs. It will change behavior but for good :) So I'm upvoting the change and have only one suggestion: may be it worth to merge two IFs :)

Hey @AleksandrNull, thanks for the comment. If you upvote the PR, would you mind approving it? Then we could merge

@zhangzhangzf could you please rebase? I added a linter

Copy link
Contributor

@AleksandrNull AleksandrNull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@manuelbuil
Copy link
Collaborator

manuelbuil commented Feb 25, 2022

@zhangzhangzf could you please rebase? Then we will merge

@manuelbuil
Copy link
Collaborator

@zhangzhangzf we will release the new version of flannel without this PR. But we can add it in subsequent patch releases once you rebase. Thanks

@zhangzhangzf
Copy link
Contributor Author

zhangzhangzf commented Mar 6, 2022

workflows awaiting approval

Sorry, I'm a junior user of GitHub. I'm not sure whether the operation I just did solved the problem?

@manuelbuil
Copy link
Collaborator

workflows awaiting approval

Sorry, I'm a junior user of GitHub. I'm not sure whether the operation I just did solved the problem?

No worries! I think they did :)

@manuelbuil manuelbuil merged commit 7d379f0 into flannel-io:master Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants