✨ AWS CNI Mode (#461)

* implements #298 Support for target-type IP This adds a new operational mode of adding the **Ingress Pods as Targets in the Target Groups directly**, instead of the current operation mode where the Ingress pods are accessed through a **HostPort**. The core idea is based on the fact that standard AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs. Their IPs are directly reachable from the ALB/NLB target groups like the nodes are too, which means there is no necessity for the HostPort indirection. There are several drivers and advantages accessing the pod directly vs a HostPort: ref: https://kubernetes.io/docs/concepts/configuration/overview/#services This has been the biggest trouble in operations, that the update of node members in target groups is slower than the nodes are physically replaced, which ends up in a black hole of **no Ingresses available at a time**. We are facing regularly downtimes especially when spot interruptions or ASG node rollings happen that the ALB/NLB takes up to 2 minutes to reflect the group change. For smaller clusters this leads to no Skipper instance being registered hence no target available to forward traffic to. With this new mode the registration happens independently of ASGs and instantly, the scheduling of pods up to be serving traffic from ALB takes less than 10 seconds! With HostPort there is an eventual dependency on available nodes to scale the Ingress. Plus the Ingress pod cannot be replaced in place but requires a termination first and then rescheduling. For a certain time, which can be more than a minute, this node is offline as an Ingress. With this mode the (host networking ?) and HostPort is obsolete, which allows node independent scaling of Skipper pods! Skipper becomes a regular deployment and its ReplicaSet can be independent on the cluster size, which simplifies operations especially for smaller clusters. We are using a custom HPA metric attached to node group size to counteract this deployment / daemonset hybrid combination, which is obsolete now! Core idea is the event based registration to Kubernetes using pod `Informer` that receives immediate notifications about pod changes, which allow almost zero delayed updates on the target groups. The registration happens as soon as the pod received an IP from AWS VPC. Hence the readiness probe of the ALB/NLB starts to monitor already during scheduling of the pod, serving the earliest possible. Tests in lab show pods serving ALB traffic well under 10s from scheduling! Deregistration happens bound to the kubernetes event. That means the LB is now in sync with the cluster and will stop sending traffic before the pod is actually terminated. This implement save deregistering without traffic loss. Tests in lab show even under aggressive deployment scalings there are no packet losses measured! Since the IP based TGs are managed now by this controller, they represent pods and thus all of them are shown healthy, otherwise cleaned up by this controller. * client-go Informer: This high level functions are providing a convenient access to event registrations of kubernetes. Since the event registration is the key of fast response and efficient compared to high rate polling, using this standard factory methods seems standing to reason. * successful transistion of TG from type Instance to type IP vice versa * the controller registers pods that are discovered * the controller deregisters pods that are "terminating" status * the controller recovers desired state if manual intervention on the TG happened by "resyncing" * it removes pods that are killed or dead | access mode | HostNetwork | HostPort | Result | Notes | | :---------: | :---------: | :------: | :----: | :----------------------------------------------: | | `Instance` | `true` | `true` | `ok` | status quo | | `IP` | `true` | `true` | `ok` | PodIP == HostIP --> limited benefits | | `IP` | `false` | `true` | `ok` | PodIP != HostIP --> limited scaling, node bounde | | `IP` | `true` | `false` | `N/A` | impossible, HN implies HP | | `IP` | `false` | `false` | `OK` | goal achieved: free scaling and HA | Example logs: ``` time="2021-12-17T15:36:36Z" level=info msg="Deleted Pod skipper-ingress-575548f66-k99vs IP:172.16.49.249" time="2021-12-17T15:36:36Z" level=info msg="Deregistering CNI targets: [172.16.49.249]" time="2021-12-17T15:36:37Z" level=info msg="New Pod skipper-ingress-575548f66-qff2q IP:172.16.60.193" time="2021-12-17T15:36:40Z" level=info msg="Registering CNI targets: [172.16.60.193]" ``` * extended the service account with required RBAC permissions to watch/list pods * added example of Skipper without a HostPort and HostNetwork Signed-off-by: Samuel Lang <[email protected]> * fixing golangci-lint timeouts ``` Run make lint golangci-lint run ./... level=error msg="Running error: context loading failed: failed to load packages: timed out to load packages: context deadline exceeded" level=error msg="Timeout exceeded: try increasing it by passing --timeout option" ``` ref: golangci/golangci-lint#825 Seems like an incompatibility with the client-go pkg and the module go version. Updating to 1.16 and recreating the go.sum fixes the timeouts and slims down the dependency list. Also updating the install script for github action script of golangci-lint as it is deprecated Signed-off-by: Samuel Lang <[email protected]>
zalando-incubator · Jan 31, 2022 · 55ec0b6 · 55ec0b6
1 parent 62c2568
commit 55ec0b6
Show file tree

Hide file tree

Showing 24 changed files with 1,449 additions and 22 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -14,7 +14,7 @@ jobs:
     - run: go get github.com/mattn/goveralls
       env:
         GO111MODULE: off
-    - run: curl -sfL https://install.goreleaser.com/github.com/golangci/golangci-lint.sh | sh -s -- -b $(go env GOPATH)/bin ${GOLANGCI_RELEASE}
+    - run: curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin ${GOLANGCI_RELEASE}
       env:
         GOLANGCI_RELEASE: v1.32.1
     - run: make test

diff --git a/Makefile b/Makefile
@@ -24,7 +24,7 @@ test:
 
 ## lint: runs golangci-lint
 lint:
-	golangci-lint run ./...
+	golangci-lint run --timeout 5m ./...
 
 ## fmt: formats all go files
 fmt:

diff --git a/README.md b/README.md
@@ -31,6 +31,7 @@ This information is used to manage AWS resources for each ingress objects of the
 - Can be used in clusters created by [Kops](https://github.com/kubernetes/kops), see our [deployment guide for Kops](deploy/kops.md)
 - [Support Multiple TLS Certificates per ALB (SNI)](https://aws.amazon.com/blogs/aws/new-application-load-balancer-sni/).
 - Support for AWS WAF and WAFv2
+- Support for AWS CNI pod direct access
 
 ## Upgrade
 
@@ -600,6 +601,28 @@ By default, the controller will expose both HTTP and HTTPS ports on the load bal
 The controller used to have only the `--health-check-port` flag available, and would use the same port as health check and the target port.
 Those ports are now configured individually. If you relied on this behavior, please include the `--target-port` in your configuration.
 
+## AWS CNI Mode (experimental)
+
+The default operation mode of the controller (`target-access-mode=HostPort`) is to link the target groups to the autoscaling group. The target group type is `instance`, requiring the ingress pod to be accessible through a `HostNetwork` and `HostPort`.
+
+In *AWS CNI Mode* (`target-access-mode=AWSCNI`) the controller actively manages the target group members. Since AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs, they can receive the traffic directly, being managed through a target group type is `ip`, which means there is no necessity for the HostPort indirection.
+
+### Notes
+
+- For security reasons the HostPort requirement might be of concern
+- Direct management of the target group members is significantly faster compared to the AWS linked mode, but it requires a running controller for updates. As of now, the controller is not prepared for high availability replicated setup.
+- The registration and deregistration is synced with the pod lifecycle, hence a pod in terminating phase is deregistered from the target group before shut down.
+- Ingress pods are not bound to nodes in CNI mode and the deployment can scale independently.
+
+### Configuration options
+
+| access mode | HostNetwork | HostPort |                      Notes                      |
+| :---------: | :---------: | :------: | :---------------------------------------------: |
+| `HostPort`  |   `true`    |  `true`  | default setup                                   |
+| `AWSCNI`    |   `true`    |  `true`  | PodIP == HostIP: limited scaling and host bound |
+| `AWSCNI`    |   `false`   |  `true`  | PodIP != HostIP: limited scaling and host bound |
+| `AWSCNI`    |   `false`   | `false`  | free scaling, pod VPC CNI IP used               |
+
 ## Trying it out
 
 The Ingress Controller's responsibility is limited to managing load balancers, as described above. To have a fully

diff --git a/aws/adapter.go b/aws/adapter.go
@@ -74,6 +74,12 @@ type Adapter struct {
 	denyInternalRespBody        string
 	denyInternalRespContentType string
 	denyInternalRespStatusCode  int
+	TargetCNI                   *TargetCNIconfig
+}
+
+type TargetCNIconfig struct {
+	Enabled       bool
+	TargetGroupCh chan []string
 }
 
 type manifest struct {
@@ -129,6 +135,8 @@ const (
 	LoadBalancerTypeNetwork     = "network"
 	IPAddressTypeIPV4           = "ipv4"
 	IPAddressTypeDualstack      = "dualstack"
+	TargetAccessModeAWSCNI      = "AWSCNI"
+	TargetAccessModeHostPort    = "HostPort"
 )
 
 var (
@@ -225,6 +233,10 @@ func NewAdapter(clusterID, newControllerID, vpcID string, debug, disableInstrume
 		nlbCrossZone:               DefaultNLBCrossZone,
 		nlbHTTPEnabled:             DefaultNLBHTTPEnabled,
 		customFilter:               DefaultCustomFilter,
+		TargetCNI: &TargetCNIconfig{
+			Enabled:       false,
+			TargetGroupCh: make(chan []string, 10),
+		},
 	}
 
 	adapter.manifest, err = buildManifest(adapter, clusterID, vpcID)
@@ -432,6 +444,12 @@ func (a *Adapter) WithInternalDomains(domains []string) *Adapter {
 	return a
 }
 
+// WithTargetAccessMode returns the receiver adapter after defining the target access mode
+func (a *Adapter) WithTargetAccessMode(t string) *Adapter {
+	a.TargetCNI.Enabled = t == TargetAccessModeAWSCNI
+	return a
+}
+
 // WithDenyInternalDomains returns the receiver adapter after setting
 // the denyInternalDomains config.
 func (a *Adapter) WithDenyInternalDomains(deny bool) *Adapter {
@@ -556,13 +574,26 @@ func (a *Adapter) FindManagedStacks() ([]*Stack, error) {
 // config to have relevant Target Groups and registers/deregisters single
 // instances (that do not belong to ASG) in relevant Target Groups.
 func (a *Adapter) UpdateTargetGroupsAndAutoScalingGroups(stacks []*Stack, problems *problem.List) {
-	targetGroupARNs := make([]string, 0, len(stacks))
+	allTargetGroupARNs := make([]string, 0, len(stacks))
 	for _, stack := range stacks {
 		if len(stack.TargetGroupARNs) > 0 {
-			targetGroupARNs = append(targetGroupARNs, stack.TargetGroupARNs...)
+			allTargetGroupARNs = append(allTargetGroupARNs, stack.TargetGroupARNs...)
 		}
 	}
+	// split the full list into TG types
+	targetTypesARNs, err := categorizeTargetTypeInstance(a.elbv2, allTargetGroupARNs)
+	if err != nil {
+		problems.Add("failed to categorize Target Type Instance: %w", err)
+		return
+	}
+
+	// update the CNI TG list
+	if a.TargetCNI.Enabled {
+		a.TargetCNI.TargetGroupCh <- targetTypesARNs[elbv2.TargetTypeEnumIp]
+	}
 
+	// remove the IP TGs from the list keeping all other TGs including problematic #127 and nonexistent #436
+	targetGroupARNs := difference(allTargetGroupARNs, targetTypesARNs[elbv2.TargetTypeEnumIp])
 	// don't do anything if there are no target groups
 	if len(targetGroupARNs) == 0 {
 		return
@@ -665,6 +696,7 @@ func (a *Adapter) CreateStack(certificateARNs []string, scheme, securityGroup, o
 		http2:                             http2,
 		tags:                              a.stackTags,
 		internalDomains:                   a.internalDomains,
+		targetAccessModeCNI:               a.TargetCNI.Enabled,
 		denyInternalDomains:               a.denyInternalDomains,
 		denyInternalDomainsResponse: denyResp{
 			body:        a.denyInternalRespBody,
@@ -720,6 +752,7 @@ func (a *Adapter) UpdateStack(stackName string, certificateARNs map[string]time.
 		http2:                             http2,
 		tags:                              a.stackTags,
 		internalDomains:                   a.internalDomains,
+		targetAccessModeCNI:               a.TargetCNI.Enabled,
 		denyInternalDomains:               a.denyInternalDomains,
 		denyInternalDomainsResponse: denyResp{
 			body:        a.denyInternalRespBody,
@@ -1009,3 +1042,53 @@ func nonTargetedASGs(ownedASGs, targetedASGs map[string]*autoScalingGroupDetails
 
 	return nonTargetedASGs
 }
+
+// SetTargetsOnCNITargetGroups implements desired state for CNI target groups
+// by polling the current list of targets thus creating a diff of what needs to be added and removed.
+func (a *Adapter) SetTargetsOnCNITargetGroups(endpoints, cniTargetGroupARNs []string) error {
+	log.Debugf("setting targets on CNI target groups: '%v'", cniTargetGroupARNs)
+	for _, targetGroupARN := range cniTargetGroupARNs {
+		tgh, err := a.elbv2.DescribeTargetHealth(&elbv2.DescribeTargetHealthInput{TargetGroupArn: &targetGroupARN})
+		if err != nil {
+			log.Errorf("unable to describe target health %v", err)
+			// continue for processing of the rest of the target groups
+			continue
+		}
+		registeredInstances := make([]string, len(tgh.TargetHealthDescriptions))
+		for i, target := range tgh.TargetHealthDescriptions {
+			registeredInstances[i] = *target.Target.Id
+		}
+		toRegister := difference(endpoints, registeredInstances)
+		if len(toRegister) > 0 {
+			log.Info("Registering CNI targets: ", toRegister)
+			err := registerTargetsOnTargetGroups(a.elbv2, []string{targetGroupARN}, toRegister)
+			if err != nil {
+				return err
+			}
+		}
+		toDeregister := difference(registeredInstances, endpoints)
+		if len(toDeregister) > 0 {
+			log.Info("Deregistering CNI targets: ", toDeregister)
+			err := deregisterTargetsOnTargetGroups(a.elbv2, []string{targetGroupARN}, toDeregister)
+			if err != nil {
+				return err
+			}
+		}
+	}
+	return nil
+}
+
+// difference returns the elements in `a` that aren't in `b`.
+func difference(a, b []string) []string {
+	mb := make(map[string]struct{}, len(b))
+	for _, x := range b {
+		mb[x] = struct{}{}
+	}
+	var diff []string
+	for _, x := range a {
+		if _, found := mb[x]; !found {
+			diff = append(diff, x)
+		}
+	}
+	return diff
+}
diff --git a/aws/adapter_test.go b/aws/adapter_test.go
@@ -947,3 +947,78 @@ func TestWithxlbHealthyThresholdCount(t *testing.T) {
 		require.Equal(t, uint(4), b.nlbHealthyThresholdCount)
 	})
 }
+
+func TestAdapter_SetTargetsOnCNITargetGroups(t *testing.T) {
+	tgARNs := []string{"asg1"}
+	thOut := elbv2.DescribeTargetHealthOutput{TargetHealthDescriptions: []*elbv2.TargetHealthDescription{}}
+	m := &mockElbv2Client{
+		outputs: elbv2MockOutputs{
+			describeTargetHealth: &apiResponse{response: &thOut, err: nil},
+			registerTargets:      R(mockDTOutput(), nil),
+			deregisterTargets:    R(mockDTOutput(), nil),
+		},
+	}
+	a := &Adapter{elbv2: m, TargetCNI: &TargetCNIconfig{}}
+
+	t.Run("adding a new endpoint", func(t *testing.T) {
+		require.NoError(t, a.SetTargetsOnCNITargetGroups([]string{"1.1.1.1"}, tgARNs))
+		require.Equal(t, []*elbv2.RegisterTargetsInput{{
+			TargetGroupArn: aws.String("asg1"),
+			Targets:        []*elbv2.TargetDescription{{Id: aws.String("1.1.1.1")}},
+		}}, m.rtinputs)
+		require.Equal(t, []*elbv2.DeregisterTargetsInput(nil), m.dtinputs)
+	})
+
+	t.Run("two new endpoints, registers the new EPs only", func(t *testing.T) {
+		thOut = elbv2.DescribeTargetHealthOutput{TargetHealthDescriptions: []*elbv2.TargetHealthDescription{
+			{Target: &elbv2.TargetDescription{Id: aws.String("1.1.1.1")}}},
+		}
+		m.rtinputs, m.dtinputs = nil, nil
+
+		require.NoError(t, a.SetTargetsOnCNITargetGroups([]string{"1.1.1.1", "2.2.2.2", "3.3.3.3"}, tgARNs))
+		require.Equal(t, []*elbv2.TargetDescription{
+			{Id: aws.String("2.2.2.2")},
+			{Id: aws.String("3.3.3.3")},
+		}, m.rtinputs[0].Targets)
+		require.Equal(t, []*elbv2.DeregisterTargetsInput(nil), m.dtinputs)
+	})
+
+	t.Run("removing one endpoint, causing deregistration of it", func(t *testing.T) {
+		thOut = elbv2.DescribeTargetHealthOutput{TargetHealthDescriptions: []*elbv2.TargetHealthDescription{
+			{Target: &elbv2.TargetDescription{Id: aws.String("1.1.1.1")}},
+			{Target: &elbv2.TargetDescription{Id: aws.String("2.2.2.2")}},
+			{Target: &elbv2.TargetDescription{Id: aws.String("3.3.3.3")}},
+		}}
+		m.rtinputs, m.dtinputs = nil, nil
+
+		require.NoError(t, a.SetTargetsOnCNITargetGroups([]string{"1.1.1.1", "3.3.3.3"}, tgARNs))
+		require.Equal(t, []*elbv2.RegisterTargetsInput(nil), m.rtinputs)
+		require.Equal(t, []*elbv2.TargetDescription{{Id: aws.String("2.2.2.2")}}, m.dtinputs[0].Targets)
+	})
+
+	t.Run("restoring desired state after external manipulation, adding and removing one", func(t *testing.T) {
+		thOut = elbv2.DescribeTargetHealthOutput{TargetHealthDescriptions: []*elbv2.TargetHealthDescription{
+			{Target: &elbv2.TargetDescription{Id: aws.String("1.1.1.1")}},
+			{Target: &elbv2.TargetDescription{Id: aws.String("2.2.2.2")}},
+			{Target: &elbv2.TargetDescription{Id: aws.String("4.4.4.4")}},
+		}}
+		m.rtinputs, m.dtinputs = nil, nil
+
+		require.NoError(t, a.SetTargetsOnCNITargetGroups([]string{"1.1.1.1", "2.2.2.2", "3.3.3.3"}, tgARNs))
+		require.Equal(t, []*elbv2.TargetDescription{{Id: aws.String("3.3.3.3")}}, m.rtinputs[0].Targets)
+		require.Equal(t, []*elbv2.TargetDescription{{Id: aws.String("4.4.4.4")}}, m.dtinputs[0].Targets)
+	})
+}
+
+func TestWithTargetAccessMode(t *testing.T) {
+	t.Run("WithTargetAccessMode enables AWS CNI mode", func(t *testing.T) {
+		a := &Adapter{TargetCNI: &TargetCNIconfig{Enabled: false}}
+		a = a.WithTargetAccessMode("AWSCNI")
+		require.True(t, a.TargetCNI.Enabled)
+	})
+	t.Run("WithTargetAccessMode disables AWS CNI mode", func(t *testing.T) {
+		a := &Adapter{TargetCNI: &TargetCNIconfig{Enabled: true}}
+		a = a.WithTargetAccessMode("HostPort")
+		require.False(t, a.TargetCNI.Enabled)
+	})
+}
diff --git a/aws/asg.go b/aws/asg.go
@@ -260,6 +260,25 @@ func describeTargetGroups(elbv2svc elbv2iface.ELBV2API) (map[string]struct{}, er
 	return targetGroups, err
 }
 
+// map the target group slice into specific types such as instance, ip, etc
+func categorizeTargetTypeInstance(elbv2svc elbv2iface.ELBV2API, allTGARNs []string) (map[string][]string, error) {
+	targetTypes := make(map[string][]string)
+	err := elbv2svc.DescribeTargetGroupsPagesWithContext(context.TODO(), &elbv2.DescribeTargetGroupsInput{},
+		func(resp *elbv2.DescribeTargetGroupsOutput, lastPage bool) bool {
+			for _, tg := range resp.TargetGroups {
+				for _, v := range allTGARNs {
+					if v != aws.StringValue(tg.TargetGroupArn) {
+						continue
+					}
+					targetTypes[aws.StringValue(tg.TargetType)] = append(targetTypes[aws.StringValue(tg.TargetType)], aws.StringValue(tg.TargetGroupArn))
+				}
+			}
+			return true
+		})
+	log.Debugf("categorized target group arns: %#v", targetTypes)
+	return targetTypes, err
+}
+
 // tgHasTags returns true if the specified resource has the expected tags.
 func tgHasTags(descs []*elbv2.TagDescription, arn string, tags map[string]string) bool {
 	for _, desc := range descs {

diff --git a/aws/asg_test.go b/aws/asg_test.go
@@ -263,6 +263,7 @@ func TestAttach(t *testing.T) {
 					TargetGroups: []*elbv2.TargetGroup{
 						{
 							TargetGroupArn: aws.String("foo"),
+							TargetType:     aws.String(elbv2.TargetTypeEnumInstance),
 						},
 					},
 				}, nil),
@@ -301,6 +302,7 @@ func TestAttach(t *testing.T) {
 					TargetGroups: []*elbv2.TargetGroup{
 						{
 							TargetGroupArn: aws.String("foo"),
+							TargetType:     aws.String(elbv2.TargetTypeEnumInstance),
 						},
 					},
 				}, nil),
@@ -466,9 +468,11 @@ func TestAttach(t *testing.T) {
 					TargetGroups: []*elbv2.TargetGroup{
 						{
 							TargetGroupArn: aws.String("foo"),
+							TargetType:     aws.String(elbv2.TargetTypeEnumInstance),
 						},
 						{
 							TargetGroupArn: aws.String("bar"),
+							TargetType:     aws.String(elbv2.TargetTypeEnumInstance),
 						},
 					},
 				}, nil),
@@ -674,3 +678,53 @@ func TestProcessChunked(t *testing.T) {
 		})
 	}
 }
+
+func Test_categorizeTargetTypeInstance(t *testing.T) {
+	for _, test := range []struct {
+		name         string
+		targetGroups map[string][]string
+	}{
+		{
+			name: "one from any type",
+			targetGroups: map[string][]string{
+				elbv2.TargetTypeEnumInstance: {"instancy"},
+				elbv2.TargetTypeEnumAlb:      {"albly"},
+				elbv2.TargetTypeEnumIp:       {"ipvy"},
+				elbv2.TargetTypeEnumLambda:   {"lambada"},
+			},
+		},
+		{
+			name: "one type many target groups",
+			targetGroups: map[string][]string{
+				elbv2.TargetTypeEnumInstance: {"instancy", "foo", "void", "bar", "blank"},
+			},
+		},
+		{
+			name: "several types many target groups",
+			targetGroups: map[string][]string{
+				elbv2.TargetTypeEnumInstance: {"instancy", "foo", "void", "bar", "blank"},
+				elbv2.TargetTypeEnumAlb:      {"albly", "alblily"},
+				elbv2.TargetTypeEnumIp:       {"ipvy"},
+			},
+		},
+	} {
+		t.Run(test.name, func(t *testing.T) {
+			tg := []string{}
+			tgResponse := []*elbv2.TargetGroup{}
+			for k, v := range test.targetGroups {
+				for _, i := range v {
+					tg = append(tg, i)
+					tgResponse = append(tgResponse, &elbv2.TargetGroup{TargetGroupArn: aws.String(i), TargetType: aws.String(k)})
+				}
+			}
+
+			mockElbv2Svc := &mockElbv2Client{outputs: elbv2MockOutputs{describeTargetGroups: R(&elbv2.DescribeTargetGroupsOutput{TargetGroups: tgResponse}, nil)}}
+			got, err := categorizeTargetTypeInstance(mockElbv2Svc, tg)
+			assert.NoError(t, err)
+			for k, v := range test.targetGroups {
+				assert.Len(t, got[k], len(v))
+				assert.Equal(t, got[k], v)
+			}
+		})
+	}
+}
diff --git a/aws/cf.go b/aws/cf.go
@@ -164,6 +164,7 @@ type stackSpec struct {
 	denyInternalDomainsResponse       denyResp
 	internalDomains                   []string
 	tags                              map[string]string
+	targetAccessModeCNI               bool
 }
 
 type healthCheck struct {