✨ AWS CNI Mode #461

universam1 · 2021-12-17T15:40:37Z

new operational mode of Direct Ingress Pod access

implements #298 Support for target-type IP

This adds a new operational mode of adding the Ingress Pods as Targets in the Target Groups directly, instead of the current operation mode where the Ingress pods are accessed through a HostPort.

The core idea is based on the fact that standard AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs. Their IPs are directly reachable from the ALB/NLB target groups like the nodes are too, which means there is no necessity for the HostPort indirection.

There are several drivers and advantages accessing the pod directly vs a HostPort:

HostNetwork and HostPort conflict with best practices

ref: https://kubernetes.io/docs/concepts/configuration/overview/#services

Delayed, eventual consistency of AutoscalingGroup to ALB/NLB target group updates

This has been the biggest trouble in operations, that the update of node members in target groups is slower than the nodes are physically replaced, which ends up in a black hole of no Ingresses available at a time. We are facing regularly downtimes especially when spot interruptions or ASG node rollings happen that the ALB/NLB takes up to 2 minutes to reflect the group change. For smaller clusters this leads to no Skipper instance being registered hence no target available to forward traffic to.
With this new mode the registration happens independently of ASGs and instantly, the scheduling of pods up to be serving traffic from ALB takes less than 10 seconds!

Independent scaling from nodes

With HostPort there is an eventual dependency on available nodes to scale the Ingress.
Plus the Ingress pod cannot be replaced in place but requires a termination first and then rescheduling. For a certain time, which can be more than a minute, this node is offline as an Ingress.
With this mode the (host networking ?) and HostPort is obsolete, which allows node independent scaling of Skipper pods! Skipper becomes a regular deployment and its ReplicaSet can be independent on the cluster size, which simplifies operations especially for smaller clusters. We are using a custom HPA metric attached to node group size to counteract this deployment / daemonset hybrid combination, which is obsolete now!

Synced de/registering and instantaneous response

Core idea is the event based registration to Kubernetes using pod Informer that receives immediate notifications about pod changes, which allow almost zero delayed updates on the target groups.

The registration happens as soon as the pod received an IP from AWS VPC. Hence the readiness probe of the ALB/NLB starts to monitor already during scheduling of the pod, serving the earliest possible. Tests in lab show pods serving ALB traffic well under 10s from scheduling!

Deregistration happens bound to the kubernetes event. That means the LB is now in sync with the cluster and will stop sending traffic before the pod is actually terminated. This implement save deregistering without traffic loss. Tests in lab show even under aggressive deployment scalings there are no packet losses measured!

TG without unhealthy targets

Since the IP based TGs are managed now by this controller, they represent pods and thus all of them are shown healthy, otherwise cleaned up by this controller.

Implementation details:

client-go Informer: This high level functions are providing a convenient access to event registrations of kubernetes. Since the event registration is the key of fast response and efficient compared to high rate polling, using this standard factory methods seems standing to reason.

Tests

successful transistion of TG from type Instance to type IP vice versa
the controller registers pods that are discovered
the controller deregisters pods that are "terminating" status
the controller recovers desired state if manual intervention on the TG happened by "resyncing"
it removes pods that are killed or dead

HostNetwork and HostPort combinations tested:

access mode	HostNetwork	HostPort	Result	Notes
`Instance`	`true`	`true`	`ok`	status quo
`IP`	`true`	`true`	`ok`	PodIP == HostIP --> limited benefits
`IP`	`false`	`true`	`ok`	PodIP != HostIP --> limited scaling, node bounde
`IP`	`true`	`false`	`N/A`	impossible, HN implies HP
`IP`	`false`	`false`	`OK`	goal achieved: free scaling and HA

Example logs:

time="2021-12-17T15:36:36Z" level=info msg="Deleted Pod skipper-ingress-575548f66-k99vs IP:172.16.49.249" 
time="2021-12-17T15:36:36Z" level=info msg="Deregistering CNI targets: [172.16.49.249]" 
time="2021-12-17T15:36:37Z" level=info msg="New Pod skipper-ingress-575548f66-qff2q IP:172.16.60.193"
time="2021-12-17T15:36:40Z" level=info msg="Registering CNI targets: [172.16.60.193]"

misc

extended the service account with required RBAC permissions to watch/list pods
added example of Skipper without a HostPort and HostNetwork

universam1 · 2021-12-17T15:43:42Z

fixes #298

cc @szuecs

szuecs · 2021-12-17T17:36:45Z

@universam1 thanks, sounds like a great idea for aws-vpc cni.
Does the change also work with plain hostnetwork without nodeport?
We don't use nodeport service. see also https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/skipper/deployment.yaml#L47-L56

One drawback is that we need informers and because of that client-go WATCH (internal to informers). We tried to omit client-go, because of kubernetes/client-go#374 (comment) zalando-incubator/stackset-controller#263 , but since kubernetes/kubernetes#95981 was done maybe it's also fine to use client-go.

universam1 · 2021-12-18T10:06:23Z

Thanks @szuecs for review!

Does the change also work with plain hostnetwork without nodeport?

Please have a look at the roster of tests:

access mode	HostNetwork	HostPort	Result	Notes
`Instance`	`true`	`true`	`ok`	status quo
`IP`	`true`	`true`	`ok`	PodIP == HostIP --> limited benefits
`IP`	`false`	`true`	`ok`	PodIP != HostIP --> limited scaling, node bound
`IP`	`true`	`false`	`N/A`	impossible, HN implies HP
`IP`	`false`	`false`	`OK`	goal: free scaling and HA

From my testing it is impossible to define HostNetwork but disable HostPort. AFAIK this combination is not avail as the HostNetworking implies pod port mapping.

We don't use nodeport service

Likewise, we don't either! Sorry I mixed up nodeport with hostport, while the first stands for a service and the latter for a port on the host. I was only referring to the host port, corrected the description

drawback is that we need informers and because of that client-go WATCH (internal to informers)

Curious, unaware of this issue. Wondering, as the informer is used across SIGS projects such as ClusterAutoscaler or even suggested pattern for controller - is that issue you face a corner case?

szuecs · 2021-12-18T15:25:31Z

Thanks for the testing table, looks great to me.
Shared informers are afaik only a different api working on cache and internal using watch and hopefully Timeouts, which we can check or if not try to pass a Transport with Timeouts when we create the client.
You will observe the issue not often and only if you check "node issues" that likely everyone will just terminate because it is one way to fix it fast. For ingress (skipper, this one and external-dns), we were conservative and use things that work reliably. I think external-dns changed it or we added passing a transport to it in the early days.
If we change to informers we should change all of it to informers in my opinion.
Maybe as a follow up PR.

universam1 · 2021-12-18T16:23:37Z

we can check or if not try to pass a Transport with Timeouts when we create the client.

Thank you for the context, pretty impressive deep dive! Issues like this are incredible challenging to surface!
AFAIK the client defaults to https://pkg.go.dev/net/http#DefaultTransport that looks like it incorporates a timeout and H2 at least for Go 1.16!?
The rest.Client supports Timeout as well https://github.com/kubernetes/client-go/blob/4ae0d1917adff2d93ea4c6a7ae3a9e5c19280b9f/rest/transport.go#L32

Since I'm not familiar with this issue, can you help me out what the desired config would look like?

If we change to informers we should change all of it to informers in my opinion. Maybe as a follow up PR.

IMHO a full refactor could simplify things, but might render this PR large to impossible to land.

szuecs · 2021-12-20T15:48:28Z

we can check or if not try to pass a Transport with Timeouts when we create the client.

Thank you for the context, pretty impressive deep dive! Issues like this are incredible challenging to surface! AFAIK the client defaults to https://pkg.go.dev/net/http#DefaultTransport that looks like it incorporates a timeout and H2 at least for Go 1.16!?

Interesting not sure if this helps in this case. On the other hand client-go uses H2 ping frames to detect broken connections, so should be fine.

The rest.Client supports Timeout as well https://github.com/kubernetes/client-go/blob/4ae0d1917adff2d93ea4c6a7ae3a9e5c19280b9f/rest/transport.go#L32

Since I'm not familiar with this issue, can you help me out what the desired config would look like?

I would just start to pass a custom Transport, you can use the DefaultTransport to pass it.
When I would do it I would likely use https://pkg.go.dev/github.com/zalando/skipper/net#NewTransport.
Let's see if we observe errors and then we can work on a fix for these. Maybe H2 ping frames detect all things we need already.

If we change to informers we should change all of it to informers in my opinion. Maybe as a follow up PR.

IMHO a full refactor could simplify things, but might render this PR large to impossible to land.

Agree out of scope, but we should have an issue that we can work on later to migrate.

universam1 · 2021-12-27T15:15:23Z

I would likely use https://pkg.go.dev/github.com/zalando/skipper/net#NewTransport.

@szuecs struggling to leverage the skipper lib as a custom Transport here, running into the error kubernetes/client-go#452 "using a custom transport with TLS certificate options or the insecure flag is not allowed". I don't see TLSClientConfig being exported to be provided for TLS setup and alternatively disabling TLS doesn't sound like an option to me. Thanks for any hint!
Found a solution by providing a http.Transport to the skipper Options which can be configured accordingly.

universam1 · 2021-12-29T16:39:53Z

@szuecs Added complete set of unit tests, implemented the Skipper NewTransport, tested in Lab, I believe this should be ready to go.
Anything missing?

go.mod

szuecs · 2022-01-03T10:27:14Z

@universam1 from my side only DCO :)

szuecs · 2022-01-03T18:04:47Z

Run make lint
golangci-lint run ./...
level=error msg="Running error: context loading failed: failed to load packages: timed out to load packages: context deadline exceeded"
level=error msg="Timeout exceeded: try increasing it by passing --timeout option"

restarted, in case it happens again please check why it takes so long and if it really takes so long increase the timeout, please

universam1 · 2022-01-03T18:52:02Z

please check why it takes so long and if it really takes so long increase the timeout, please

It requires a couple of adjustments on the CI config as much as I see, fine from my side if that's desired

szuecs · 2022-01-03T19:15:49Z

yes, please

szuecs · 2022-01-04T09:31:09Z

👍

universam1 · 2022-01-10T12:05:58Z

added a section in the readme to describe that mode, anything missing @szuecs ?

README.md

aws/adapter.go

worker.go

kubernetes/pods.go

worker.go

aws/adapter.go

kubernetes/pods.go

worker.go

AlexanderYastrebov · 2022-01-20T14:15:19Z

kubernetes/pods.go

+// PodInformer is a event handler for Pod events registered to, that builds a local list of valid and relevant pods
+// and sends an event to the endpoint channel, triggering a resync of the targets.
+func (a *Adapter) PodInformer(ctx context.Context, endpointChan chan<- []string) error {
+	podEndpoints := sync.Map{}


It seems we need to pre-populate podEndpoints map before launching informer, otherwise it is empty after controller restart and would trigger an update after receiving a single pod change.

there is a guard for that in here already:
https://github.com/zalando-incubator/kube-ingress-aws-controller/pull/461/files#diff-0ca2bf28ee146c014db869dd42c216dff322e265933c8b179b939f18016386d6R673

As far as I can tell, on controller restart:

the main cycle will send a list of the relevant target groups.

this informer will receive a change of a single pod and would publish an updated list of endpoints of size 1 since podEndpoints is empty.
The SetTargetsOnCNITargetGroups will take the difference between registered targets and this single item endpoint list and will de-register other valid endpoints.

The "guard" only verifies that TGs and EPs were received at least onec, but the update itself assumes that endpoint list contains all available EPs.

That case is not existing because the informer is long time in sync before the TGs are enumerated. It is flushing those first messages. There are several seconds between

I guess @AlexanderYastrebov you should take some time to explore this concept in a lab to learn how it behaves

#461 (comment) doesn't work, the events are generated yet by informer.Run, not from factory.Start . That's why we cannot omit the informer.Run, and also the semaphore via atomic.StoreInt64 is effectless here, unfortunately! 😞

I think hypothetical or not, if the problem exist we need to tackle it.
We have sometimes more than 1k Ingress objects in a cluster with an SLA that 40k is fine. This can easily lead to kubernetes slower than AWS API response. time.Sleep is never a solution, I would rather accept a TODO comment that needs to be addressed some day.

What if we just list endpoints once on creation to populate the cache?
One issue that could happen is that if we do not execute an update operation on cloud load balancers to update the endpoints list, then a deployment of kube-ingress-aws-controller could miss a new created endpoint while it was down.
I think the right thing would be at start to make sure we list endpoints through informer (warm up cache) and use the list to trigger updates to all cloud load balancers. After that we can rely onUpdate events.

That's actually a smart idea to ~~determine when the cache equilibrium~~ is reached - I like that and it doesn't sound hard to implement...

@szuecs @AlexanderYastrebov Please check if you like this version. Using the list is convenient in several ways as it delivers the warm cache right from the start!
The successive informer updates result in a no-op since we do anyway a map reduce! via LoadOrStore and LoadAndDelete
That means, there is nothing like a warm-cache sync validation required!

The beauty is that even in a hard race we will process correctly a pod deletion right after the initial list just fine.

I think we should not over complicate things see also #461
@AlexanderYastrebov do you agree?

kubernetes/pods.go

AlexanderYastrebov · 2022-01-24T16:10:32Z

kubernetes/pods.go

+	informer := factory.Core().V1().Pods().Informer()
+	factory.Start(ctx.Done())


factory.Start() apparently starts informers but we configure informer EventHandler later which seems wrong.

From this perspective informer.Run(ctx.Done()) also feels wrong as factory.Start already runs it.

I guess we can do what others do:
workqueue: https://github.com/kubernetes/client-go/blob/master/examples/workqueue/main.go#L128-L132
fakeclient: https://github.com/kubernetes/client-go/blob/master/examples/fake-client/main_test.go#L69-L72

The cache.WaitForCacheSync() does what we need:
https://github.com/kubernetes/client-go/blob/3255cdcd9cf0ce5ea718c97f4aea519130649aca/informers/factory.go#L141
https://github.com/kubernetes/client-go/blob/3255cdcd9cf0ce5ea718c97f4aea519130649aca/tools/cache/shared_informer.go#L254

I also asked in sig-apimachinery, if they consider this safe or if we have to do something else.

so Kubernetes sig-apimachinery says that it's not an issue because List is used before Watch, which means there is no race.

@szuecs
In another discussion #461 (comment) this issue kubernetes/kubernetes#92777 (comment) was brought up which is exactly about the problem we have now - configuring informer after factory start.

Then we should just move the AddEventhandler call with body to before Start()

Note the extended unit test

See the sources https://github.com/kubernetes/client-go/blob/master/tools/cache/shared_informer.go#L552-L556 that explain how a safe join happens to a running informer

I prefer this solution over a semaphore since it is clean and easier to grasp

So this makes sure we get all entries from a fresh LIST, right?

Yes- so theoretically we could omit the explicit List before since we get this one implicitly by late binding, but it serves two reasons so I'd keep it

So this makes sure we get all entries from a fresh LIST, right?
Yes- so theoretically we could omit the explicit List

Idk, I see it uses addNotification but we only handle updates.

``` Run make lint golangci-lint run ./... level=error msg="Running error: context loading failed: failed to load packages: timed out to load packages: context deadline exceeded" level=error msg="Timeout exceeded: try increasing it by passing --timeout option" ``` ref: golangci/golangci-lint#825 Seems like an incompatibility with the client-go pkg and the module go version. Updating to 1.16 and recreating the go.sum fixes the timeouts and slims down the dependency list. Also updating the install script for github action script of golangci-lint as it is deprecated Signed-off-by: Samuel Lang <[email protected]>

AlexanderYastrebov · 2022-01-31T09:50:51Z

I guess we've spent enough time pondering on the proposed solution so it is good enough to be accepted.
@universam1 Thank you for the commitment and persistence 👍

AlexanderYastrebov · 2022-01-31T09:50:56Z

👍

szuecs · 2022-01-31T18:16:18Z

👍

szuecs · 2022-01-31T18:16:48Z

@universam1 thanks and let's have follow up fixes if you detect something that does not work as great as it could.

universam1 · 2022-01-31T18:53:48Z

Thank you @szuecs for releasing v0.12.17 - just wondering if this is really considered a bugfix release bump? 😺

szuecs · 2022-01-31T18:55:56Z

@universam1 we increase minor version in case there is a need to change flags or something like this. We add always migration guide to the readme for minor release increase.

…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in Cloudformation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to 'ip' or 'instance'. Changing target type from unset to 'instance' in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new 'Legacy' option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>

…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to 'instance' in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new 'Legacy' option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>

…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new 'Legacy' option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>

…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new `Legacy` option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>

…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - introduces a new `Legacy` option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>

universam1 force-pushed the TGIP branch from f7be221 to 5358f8c Compare December 17, 2021 15:42

universam1 force-pushed the TGIP branch 3 times, most recently from 80db96d to a04d016 Compare December 18, 2021 10:49

universam1 force-pushed the TGIP branch from a04d016 to 18570f4 Compare December 29, 2021 16:37

universam1 force-pushed the TGIP branch from 18570f4 to 664cd10 Compare December 29, 2021 17:24

szuecs reviewed Jan 3, 2022

View reviewed changes

go.mod Show resolved Hide resolved

universam1 force-pushed the TGIP branch 3 times, most recently from 712b195 to 653679c Compare January 3, 2022 13:27

universam1 force-pushed the TGIP branch 2 times, most recently from 782d97a to cd5f94b Compare January 4, 2022 08:50

universam1 force-pushed the TGIP branch from cd5f94b to 1848ce8 Compare January 10, 2022 12:05

AlexanderYastrebov reviewed Jan 10, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

universam1 force-pushed the TGIP branch from 1848ce8 to bd8ede1 Compare January 10, 2022 13:25

AlexanderYastrebov reviewed Jan 10, 2022

View reviewed changes