Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a route conflict bug. #1546

Merged
merged 10 commits into from
May 4, 2022

Conversation

zhangzhangzf
Copy link
Contributor

@zhangzhangzf zhangzhangzf commented Mar 9, 2022

Description

Fixed a routing conflict bug.

Recurrence path:
(1) Cetos7, Flannel work in vxlan mode.
(2) Select two normal nodes from the cluster as the target nodes.(nodeA and nodeB)
(3) Flannel daemon remove in nodeA and nodeB.
(4) Input "systemctl stop flanneld" in NodeB. (Simulate nodeB down.)
(5) Delete the network segment corresponding to the nodeA in etcd.(nodeA flannel will exit).
(6) Copy nodeA subnet.env to NodeB. (Used to simulate nodeB obtain the subnets used by node).
(7) Start flanneld in nodeB.
(8) Execute "ip r s" in nodeB.
Through the above operations, you will see that the route to the local docker0 is conflict with previous.

Reason:
When nodeB down, it cannot get nodeA subnets is deleted by etcd, so it's route is not update. When nodeB flannel starting, nodeB pick the subnet who used by nodeA (the subnets is not in etcd now). The previous route has no chance to be updated.

Resolvent:
When a node pick a network. We should ensure that this subnet has no conflicts in the routing table.

@zhangzhangzf zhangzhangzf changed the title Fixed a routing conflict bug. Fixed a route conflict bug. Mar 9, 2022
Copy link
Collaborator

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what your algorithm is doing please? Thanks

@@ -101,3 +101,15 @@ done
if [ "$combined_opts" = true ]; then
echo "${combined_opts_key}=\"${docker_opts}\"" >>$docker_env
fi

if [ "x${FLANNEL_SUBNET}" != "x" ];then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is x? If yo want to check if it exists, what not using if [ -n "$FLANNEL_SUBNET" ]; then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is x? If yo want to check if it exists, what not using if [ -n "$FLANNEL_SUBNET" ]; then?

Yes, -n is right.

if [ "x${FLANNEL_SUBNET}" != "x" ];then
dot_four=`echo ${FLANNEL_SUBNET}|cut -d. -f4|cut -d/ -f1`
dot_four=$((${dot_four}-1))
subnets_pre=`echo ${subnets}|cut -d. -f 1,2,3`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subnets does not exist, or?

Copy link
Contributor Author

@zhangzhangzf zhangzhangzf Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subnets does not exist, or?

This is a way for me to judge whether the string is empty. Please ignore this bad expression

@zhangzhangzf
Copy link
Contributor Author

zhangzhangzf commented Mar 18, 2022

This is a very special situation. I have explained the recurrence path in PR.

Node pick a subnets named 192.168.6.15/27. And there is such a route (192.168.6.14/27 via 31.24.28.1 dev eth0 ) in the routing table before flannel is started. When flannel and docker up. There append a route (192.168.6.14/27 dev docker0 proto kernel scope link ..). But the last route(192.168.6.14/27 via 31.24.28.1 dev eth0) was not deleted.

This code means to clean up the route before completing the real startup of flannel

@manuelbuil
Copy link
Collaborator

This is a very special situation. I have explained the recurrence path in PR.

Node pick a subnets named 192.168.6.15/27. And there is such a route (192.168.6.14/27 via 31.24.28.1 dev eth0 ) in the routing table before flannel is started. When flannel and docker up. There append a route (192.168.6.14/27 dev docker0 proto kernel scope link ..). But the last route(192.168.6.14/27 via 31.24.28.1 dev eth0) was not deleted.

This code means to clean up the route before completing the real startup of flannel

In my opinion, flannel does not need to delete that subnet. The user needs to be careful that there are no active routes which could conflict with the flannel subnets. What if that route is important for the user and flannel deletes it?

@zhangzhangzf
Copy link
Contributor Author

This is a very special situation. I have explained the recurrence path in PR.
Node pick a subnets named 192.168.6.15/27. And there is such a route (192.168.6.14/27 via 31.24.28.1 dev eth0 ) in the routing table before flannel is started. When flannel and docker up. There append a route (192.168.6.14/27 dev docker0 proto kernel scope link ..). But the last route(192.168.6.14/27 via 31.24.28.1 dev eth0) was not deleted.
This code means to clean up the route before completing the real startup of flannel

In my opinion, flannel does not need to delete that subnet. The user needs to be careful that there are no active routes which could conflict with the flannel subnets. What if that route is important for the user and flannel deletes it?

I agree that deleting a user's routing table is rude. However, in my reproduction path, this routing information is written by flannel. If the subnets to be occupied by the flannel already exists in the local routing table, we should remind the user of this failure in another way. For example, flannel failed to start. Instead of normal process and abnormal service.

@manuelbuil
Copy link
Collaborator

This is a very special situation. I have explained the recurrence path in PR.
Node pick a subnets named 192.168.6.15/27. And there is such a route (192.168.6.14/27 via 31.24.28.1 dev eth0 ) in the routing table before flannel is started. When flannel and docker up. There append a route (192.168.6.14/27 dev docker0 proto kernel scope link ..). But the last route(192.168.6.14/27 via 31.24.28.1 dev eth0) was not deleted.
This code means to clean up the route before completing the real startup of flannel

In my opinion, flannel does not need to delete that subnet. The user needs to be careful that there are no active routes which could conflict with the flannel subnets. What if that route is important for the user and flannel deletes it?

I agree that deleting a user's routing table is rude. However, in my reproduction path, this routing information is written by flannel. If the subnets to be occupied by the flannel already exists in the local routing table, we should remind the user of this failure in another way. For example, flannel failed to start. Instead of normal process and abnormal service.

I like that approach of flannel logging the problem or failing if it finds that

When the route table already contains the subnet to be allocated, exit and prompt the user to handle the error
@manuelbuil manuelbuil merged commit 74b5e50 into flannel-io:master May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants