Add Readiness Probe for EventListener Deployment #467

afrittoli · 2020-03-02T22:49:07Z

Changes

Add a Readiness Probe for EventListener Deployment at URL /live

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide for more details.

Release Notes

Add a Readiness Probe for EventListener Deployment at URL /live.

afrittoli · 2020-03-02T23:56:01Z

/test pull-tekton-triggers-integration-tests

afrittoli · 2020-03-03T16:01:04Z

/test pull-tekton-triggers-integration-tests

afrittoli · 2020-03-03T16:55:27Z

I don't see the relationship between this change and the failure on the TestEventListenerScale test. Is this a flaky test or should I dig further?

afrittoli · 2020-03-03T16:55:35Z

/test pull-tekton-triggers-integration-tests

khrm · 2020-03-04T10:23:33Z

/retest

dibyom · 2020-03-06T21:31:26Z

I don't see the relationship between this change and the failure on the TestEventListenerScale test. Is this a flaky test or should I dig further?

I don't either but that test has not been flaky before 🤔

dibyom · 2020-03-12T21:02:33Z

/test pull-tekton-triggers-integration-tests

afrittoli · 2020-04-09T14:04:05Z

/test pull-tekton-triggers-integration-tests

afrittoli · 2020-04-09T14:13:47Z

/test pull-tekton-triggers-integration-tests

lawrencejones · 2020-04-14T10:37:03Z

I'm super keen to help get this over the line, as due to the way the GCE ingress controller works, using an ingress with a default eventlistener service will never set the upstream health check correctly unless a readiness probe is present.

In an effort to understand what's going on, I've been looking at the build logs. What I'm seeing is this:

I0409 16:23:28.255] service/tekton-triggers-webhook created
I0409 16:23:28.288] 2020/04/09 16:23:28 error processing import paths in "config/webhook.yaml": error resolving image references: GET https://gcr.io/v2/token?scope=repository%3Atekton-prow-11%2Fttriggers-e2e-img%2Fwebhook-dd1edc925ee1772a9f76e2c1bc291ef6%3Apush%2Cpull&scope=repository%3Adistroless%2Fstatic%3Apull&service=gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
I0409 16:23:28.352] ERROR: Tekton Triggers installation failed
I0409 16:23:28.354] ***************************************
I0409 16:23:28.355] ***         E2E TEST FAILED         ***
I0409 16:23:28.355] ***    Start of information dump    ***
...

But I'm dead confused here, as that doesn't look like anything to do with this change. It looks more like we're screwing up the installation of the triggers, rather than failing an actual test.

Am I looking in the wrong place? Is there an easy command I can use to replicate these test failures locally?

Sorry for the total novice questions. Any help getting up-to-speed would be great.

afrittoli · 2020-04-14T13:06:05Z

I'm super keen to help get this over the line, as due to the way the GCE ingress controller works, using an ingress with a default eventlistener service will never set the upstream health check correctly unless a readiness probe is present.

In an effort to understand what's going on, I've been looking at the build logs. What I'm seeing is this:
I0409 16:23:28.255] service/tekton-triggers-webhook created
I0409 16:23:28.288] 2020/04/09 16:23:28 error processing import paths in "config/webhook.yaml": error resolving image references: GET https://gcr.io/v2/token?scope=repository%3Atekton-prow-11%2Fttriggers-e2e-img%2Fwebhook-dd1edc925ee1772a9f76e2c1bc291ef6%3Apush%2Cpull&scope=repository%3Adistroless%2Fstatic%3Apull&service=gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
I0409 16:23:28.352] ERROR: Tekton Triggers installation failed
I0409 16:23:28.354] ***************************************
I0409 16:23:28.355] ***         E2E TEST FAILED         ***
I0409 16:23:28.355] ***    Start of information dump    ***
...
But I'm dead confused here, as that doesn't look like anything to do with this change. It looks more like we're screwing up the installation of the triggers, rather than failing an actual test.

Am I looking in the wrong place? Is there an easy command I can use to replicate these test failures locally?

Sorry for the total novice questions. Any help getting up-to-speed would be great.

@lawrencejones thanks for looking into this. In previous runs the failure was caused by the pod for the sink going into crash loop. I suspect that the test node may be slow, and thus the default values for the readiness probes were not enough. In the latest version of the PR I raised failure-threshold to 3 so that the pod has more time to start, since the time between retries defaults to 10.

The error you saw instead seems unrelated indeed, I'm just going to try a restest as last week our CI was busted for a few hours, so it may be that.

afrittoli · 2020-04-14T13:06:11Z

/retest

afrittoli · 2020-04-14T14:48:33Z

The error happens in the scale test, that creates a single event listener with a thousand long list of binding/triggertemplate couples:

I0414 13:36:22.620]                 reason: CrashLoopBackOff
I0414 13:36:22.621]                 message: Back-off 5m0s restarting failed container=event-listener pod=el-my-eventlistener-f7556469-f7hr8_arrakis-9d6fw(c8a4fdce-f45b-45f6-a53c-ba44fb82b1ab)
I0414 13:36:22.621]               running: null
I0414 13:36:22.621]               terminated: null
I0414 13:36:22.621]             lastterminationstate:
I0414 13:36:22.621]               waiting: null
I0414 13:36:22.621]               running: null
I0414 13:36:22.622]               terminated:
I0414 13:36:22.622]                 exitcode: 1
I0414 13:36:22.622]                 signal: 0
I0414 13:36:22.622]                 reason: Error
I0414 13:36:22.622]                 message: ""

My suspicion is that in this case the sink takes longer that 3x10s to start because of the scale. I wonder if that is supposed to happen. I would not want to change the default value much more to fit this test, but we might be able to override the values for this test specifically.

dibyom · 2020-04-16T18:26:35Z

I can repro this locally as well....The EL Sink container goes into a CrashLoopBackoff and the test times out. What I find surprising is that adding a Readniness probe is what causes this failure. This test has been around for a while and if its the long startup time that is causing issues, I'd have expected that the test would start failing after we added the Liveness probe (not the Readiness one)

Will keep digging but initial thoughts:

We should investigate why Sink is taking a long time to startup.
We should look into the scale test numbers i.e. is 1000 triggers realistic for an EL?
We'd probably want to be able to configure/tune the probe parameters (initialDelay etc.) via something like Feature request: Eventlistener should support customized podTemplate #505

afrittoli · 2020-04-16T19:09:17Z

As a data point, I deployed this pr on dogfooding and robocat clusters, and had no issues there

dibyom · 2020-04-20T20:00:41Z

As a data point, I deployed this pr on dogfooding and robocat clusters, and had no issues there

Great, Looks like the next step is debugging the e2e setup further. I'll look into what exactly is causing the crashloop for the sink pods

dibyom · 2020-04-20T20:02:21Z

Interestingly...the test passes when I run it on Minikube

dibyom · 2020-04-20T21:23:04Z

Upon more investigation, I do not think this is related to the number of triggers at all....adding the Readiness probe in GKE is just somehow exposing a race condition in the test. The root cause is a permission error for the Sink. Some findings:

The Liveness and Readiness probe fail due to the port not listening:

 Readiness probe failed: Get http://10.36.1.9:8080/live: dial tcp 10.36.1.9:8080: connect: connection refused

The EL sink fails because of logging configMap related permission error:

"level":"fatal","logger":"eventlistener","caller":"logging/logging.go:52","msg":"failed to start configuration manager: error waiting for ConfigMap informer to sync","knative.dev/controller":"eventlistener","stacktrace":"github.com/tektoncd/triggers/pkg/logging.ConfigureLogging\n\t/usr/local/google/home/dibyajyoti/dev/go/src/github.com/tektoncd/triggers/pkg/logging/logging.go:52\nmain.main\n\t/usr/local/google/home/dibyajyoti/dev/go/src/github.com/tektoncd/triggers/cmd/eventlistenersink/main.go:63\nruntime.main\n\t/usr/lib/google-golang/src/runtime/proc.go:203"}

The above failure happens even on the 0.4 release (w/o the Readiness fix):

arrakis-jlhs4      0s          Normal    Created                  pod/el-my-eventlistener-6ddfccfc69-x8xvt            Created container event-listener
arrakis-jlhs4      0s          Normal    Started                  pod/el-my-eventlistener-6ddfccfc69-x8xvt            Started container event-listener
arrakis-jlhs4      0s          Warning   Unhealthy                pod/el-my-eventlistener-6ddfccfc69-x8xvt            Liveness probe failed: Get http://10.36.1.13:8080/live: dial tcp 10.36.1.13:8080: connect: connection refused
arrakis-jlhs4      0s          Normal    Killing                  pod/el-my-eventlistener-6ddfccfc69-x8xvt            Container event-listener failed liveness probe, will be restarted
arrakis-jlhs4      0s          Normal    Pulled                   pod/el-my-eventlistener-6ddfccfc69-x8xvt            Container image "gcr.io/tekton-releases/github.com/tektoncd/triggers/cmd/eventlistenersink@sha256:76c208ec1d73d9733dcaf850240e1b3990e5977709a03c2bd98ad5b20fab9867" already present on machine

The test keeps polling the EL every second till the EL's status indicates that the deployment is healthy (the "Available" status.Condition on the deployment is marked true). There is a race here -- the deployment is "Available" for a small bit before the liveness probe kills it.
Before the Readiness probe was added, there was enough delay for the test's polling to pass once which marked the test successful. However, if you query the EL after the test completed, you'd see the same CrashLoop eventually
With the addition of the Readniess probe in the GKE environment , none of the test's polling calls ever pass and the tests time out (this happens even if you reduce the number of triggers from 1000 to 1). In other environments like Minikube, the test continues to pass due to the above behavior.

The Sink created by the test eventually goes into a CrashLoop because it lacks the permissions needed to get the logging configMap resulting in the LivenessProbe restarting it. However, there is a short delay initially when the Sink is marked `Available` which is apparently enough for the test to pass. The bug was discovered in tektoncd#467 when adding a ReadinessProbe and running the test in GKE make it consistently fail. This commit fixes the bug by adding a ServiceAccount to the EL with permission to get the logging configMap in the namespace. Fixes tektoncd#546 Signed-off-by: Dibyo Mukherjee <[email protected]>

dibyom · 2020-04-20T22:22:48Z

Should be fixed by #547 @afrittoli

dibyom

/approve

Will need a rebase once #547 is merged!

tekton-robot · 2020-04-20T22:32:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dibyom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The Sink created by the test eventually goes into a CrashLoop because it lacks the permissions needed to get the logging configMap resulting in the LivenessProbe restarting it. However, there is a short delay initially when the Sink is marked `Available` which is apparently enough for the test to pass. The bug was discovered in #467 when adding a ReadinessProbe and running the test in GKE make it consistently fail. This commit fixes the bug by adding a ServiceAccount to the EL with permission to get the logging configMap in the namespace. Fixes #546 Signed-off-by: Dibyo Mukherjee <[email protected]>

Add a Readiness Probe for EventListener Deployment at URL /live

afrittoli · 2020-04-21T17:15:51Z

/test pull-tekton-triggers-integration-tests

dibyom · 2020-04-21T17:31:52Z

/lgtm

tekton-robot requested review from bobcatfish and iancoffey March 2, 2020 22:49

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 2, 2020

googlebot added the cla: yes label Mar 2, 2020

afrittoli force-pushed the readiness_probe branch from 1841749 to 699ab14 Compare April 9, 2020 14:56

dibyom mentioned this pull request Apr 20, 2020

EventListenerScaleTest is racy due to missing serviceAccount #546

Closed

dibyom mentioned this pull request Apr 20, 2020

Add a ServiceAccount to EventListenerScaleTest #547

Merged

3 tasks

dibyom added this to the Triggers v0.5.0 milestone Apr 20, 2020

dibyom approved these changes Apr 20, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2020

Add Readiness Probe for EventListener Deployment

6e1e40c

Add a Readiness Probe for EventListener Deployment at URL /live

afrittoli force-pushed the readiness_probe branch from 699ab14 to 6e1e40c Compare April 21, 2020 16:28

tekton-robot assigned dibyom Apr 21, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2020

tekton-robot merged commit bf53476 into tektoncd:master Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Readiness Probe for EventListener Deployment #467

Add Readiness Probe for EventListener Deployment #467

afrittoli commented Mar 2, 2020 •

edited

Loading

afrittoli commented Mar 2, 2020

afrittoli commented Mar 3, 2020

afrittoli commented Mar 3, 2020

afrittoli commented Mar 3, 2020

khrm commented Mar 4, 2020

dibyom commented Mar 6, 2020

dibyom commented Mar 12, 2020

afrittoli commented Apr 9, 2020

afrittoli commented Apr 9, 2020

lawrencejones commented Apr 14, 2020

afrittoli commented Apr 14, 2020

afrittoli commented Apr 14, 2020

afrittoli commented Apr 14, 2020

dibyom commented Apr 16, 2020

afrittoli commented Apr 16, 2020

dibyom commented Apr 20, 2020

dibyom commented Apr 20, 2020

dibyom commented Apr 20, 2020 •

edited

Loading

dibyom commented Apr 20, 2020

dibyom left a comment

tekton-robot commented Apr 20, 2020

afrittoli commented Apr 21, 2020

dibyom commented Apr 21, 2020

Add Readiness Probe for EventListener Deployment #467

Add Readiness Probe for EventListener Deployment #467

Conversation

afrittoli commented Mar 2, 2020 • edited Loading

Changes

Submitter Checklist

Release Notes

afrittoli commented Mar 2, 2020

afrittoli commented Mar 3, 2020

afrittoli commented Mar 3, 2020

afrittoli commented Mar 3, 2020

khrm commented Mar 4, 2020

dibyom commented Mar 6, 2020

dibyom commented Mar 12, 2020

afrittoli commented Apr 9, 2020

afrittoli commented Apr 9, 2020

lawrencejones commented Apr 14, 2020

afrittoli commented Apr 14, 2020

afrittoli commented Apr 14, 2020

afrittoli commented Apr 14, 2020

dibyom commented Apr 16, 2020

afrittoli commented Apr 16, 2020

dibyom commented Apr 20, 2020

dibyom commented Apr 20, 2020

dibyom commented Apr 20, 2020 • edited Loading

dibyom commented Apr 20, 2020

dibyom left a comment

Choose a reason for hiding this comment

tekton-robot commented Apr 20, 2020

afrittoli commented Apr 21, 2020

dibyom commented Apr 21, 2020

afrittoli commented Mar 2, 2020 •

edited

Loading

dibyom commented Apr 20, 2020 •

edited

Loading