Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subnet conflict in different worker nodes #1289

Closed
xiangyanw opened this issue May 5, 2020 · 2 comments
Closed

Subnet conflict in different worker nodes #1289

xiangyanw opened this issue May 5, 2020 · 2 comments
Labels

Comments

@xiangyanw
Copy link

In one of our production kubernetes cluster, two worker nodes were competing for one subnet, which led to connectivity issue in one of the worker nodes.

What happened:

  1. Host A joined the cluster, obtained a subnet and saved config in /run/flannel/subnet.env;
  2. Host A went offline (no lease renew were done after that);
  3. After TTL, this subnet lease were removed from etcd;
  4. Host B joined the cluster and obtained the same subnet which was assigned to Host A;
  5. Host A came online again, reused its previous subnet and renewd its lease;
  6. Now two different hosts compete for the same subnet, whoever renews the lease will update the subnet's public IP in etcd, which cause network issue in the other host.

Note: Host A & B have different public IP.

Expected Behavior

If a host went offline and then come online again, it should not reuse its previous subnet, if that subnet has been assigned to another host already.

Current Behavior

If a host went offline and then come online again, it will reuse its previous subnet and renew the lease, no matter whether this subnet is being used by another host or not.

Possible Solution

In flannel/subnet/etcdv2/local_manager.go tryAcquireLease, validate the subnet's public IP before reusing subnet.

// no existing match, check if there was a previous subnet to use var sn ip.IP4Net
if !m.previousSubnet.Empty() {
// use previous subnet
if l := findLeaseBySubnet(leases, m.previousSubnet); l != nil {
// Make sure the existing subnet is still within the configured network
if isSubnetConfigCompat(config, l.Subnet) {
log.Infof("Found lease (%v) matching previously leased subnet, reusing", l.Subnet)

Steps to Reproduce (for bugs)

  1. Add a worker (Host A) into the kubernetes cluster, confirm that the subnet info is saved in /run/flannel/subnet.env;
  2. Stop flanneld service;
  3. After TTL (24 hours), confirm that the subnet lease was removed from etcd;
  4. Add another host (Host B) into the cluster, verify that the same subnet is assigned to this host; (this may not always happen, may need to try multiple times)
  5. Restart flanneld service in Host A;
  6. Now we can see that Host A seizes the subnet and both Host A & B are using the same subnet.

Context

When two hosts are using one subnet, only one of the nodes is working. The other node lost the connection with other nodes in the cluster and lead to business outage.

Your Environment

  • Flannel version: 0.9.1
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: 3.1.10
  • Kubernetes version (if used): 1.11.5
  • Operating System and version: CentOS 7.4
  • Link to your project (optional):
@rainbow211
Copy link

Recently, we have encountered this problem many times in our production environment. #1289

Your Environment

Flannel version: 0.12.0
Etcd version: 3.4.12
Kubernetes version: 1.18.4
Operating System and version: CentOS 7.4

@stale
Copy link

stale bot commented Jan 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 25, 2023
@stale stale bot closed this as completed Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants