Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Readiness Probe for EventListener Deployment #467

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Mar 2, 2020

Changes

Add a Readiness Probe for EventListener Deployment at URL /live

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Release Notes

Add a Readiness Probe for EventListener Deployment at URL /live.

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 2, 2020
@afrittoli
Copy link
Member Author

/test pull-tekton-triggers-integration-tests

1 similar comment
@afrittoli
Copy link
Member Author

/test pull-tekton-triggers-integration-tests

@afrittoli
Copy link
Member Author

I don't see the relationship between this change and the failure on the TestEventListenerScale test. Is this a flaky test or should I dig further?

@afrittoli
Copy link
Member Author

/test pull-tekton-triggers-integration-tests

@khrm
Copy link
Contributor

khrm commented Mar 4, 2020

/retest

@dibyom
Copy link
Member

dibyom commented Mar 6, 2020

I don't see the relationship between this change and the failure on the TestEventListenerScale test. Is this a flaky test or should I dig further?

I don't either but that test has not been flaky before 🤔

@dibyom
Copy link
Member

dibyom commented Mar 12, 2020

/test pull-tekton-triggers-integration-tests

2 similar comments
@afrittoli
Copy link
Member Author

/test pull-tekton-triggers-integration-tests

@afrittoli
Copy link
Member Author

/test pull-tekton-triggers-integration-tests

@lawrencejones
Copy link
Contributor

I'm super keen to help get this over the line, as due to the way the GCE ingress controller works, using an ingress with a default eventlistener service will never set the upstream health check correctly unless a readiness probe is present.

In an effort to understand what's going on, I've been looking at the build logs. What I'm seeing is this:

I0409 16:23:28.255] service/tekton-triggers-webhook created
I0409 16:23:28.288] 2020/04/09 16:23:28 error processing import paths in "config/webhook.yaml": error resolving image references: GET https://gcr.io/v2/token?scope=repository%3Atekton-prow-11%2Fttriggers-e2e-img%2Fwebhook-dd1edc925ee1772a9f76e2c1bc291ef6%3Apush%2Cpull&scope=repository%3Adistroless%2Fstatic%3Apull&service=gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
I0409 16:23:28.352] ERROR: Tekton Triggers installation failed
I0409 16:23:28.354] ***************************************
I0409 16:23:28.355] ***         E2E TEST FAILED         ***
I0409 16:23:28.355] ***    Start of information dump    ***
...

But I'm dead confused here, as that doesn't look like anything to do with this change. It looks more like we're screwing up the installation of the triggers, rather than failing an actual test.

Am I looking in the wrong place? Is there an easy command I can use to replicate these test failures locally?

Sorry for the total novice questions. Any help getting up-to-speed would be great.

@afrittoli
Copy link
Member Author

I'm super keen to help get this over the line, as due to the way the GCE ingress controller works, using an ingress with a default eventlistener service will never set the upstream health check correctly unless a readiness probe is present.

In an effort to understand what's going on, I've been looking at the build logs. What I'm seeing is this:

I0409 16:23:28.255] service/tekton-triggers-webhook created
I0409 16:23:28.288] 2020/04/09 16:23:28 error processing import paths in "config/webhook.yaml": error resolving image references: GET https://gcr.io/v2/token?scope=repository%3Atekton-prow-11%2Fttriggers-e2e-img%2Fwebhook-dd1edc925ee1772a9f76e2c1bc291ef6%3Apush%2Cpull&scope=repository%3Adistroless%2Fstatic%3Apull&service=gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
I0409 16:23:28.352] ERROR: Tekton Triggers installation failed
I0409 16:23:28.354] ***************************************
I0409 16:23:28.355] ***         E2E TEST FAILED         ***
I0409 16:23:28.355] ***    Start of information dump    ***
...

But I'm dead confused here, as that doesn't look like anything to do with this change. It looks more like we're screwing up the installation of the triggers, rather than failing an actual test.

Am I looking in the wrong place? Is there an easy command I can use to replicate these test failures locally?

Sorry for the total novice questions. Any help getting up-to-speed would be great.

@lawrencejones thanks for looking into this. In previous runs the failure was caused by the pod for the sink going into crash loop. I suspect that the test node may be slow, and thus the default values for the readiness probes were not enough. In the latest version of the PR I raised failure-threshold to 3 so that the pod has more time to start, since the time between retries defaults to 10.

The error you saw instead seems unrelated indeed, I'm just going to try a restest as last week our CI was busted for a few hours, so it may be that.

@afrittoli
Copy link
Member Author

/retest

@afrittoli
Copy link
Member Author

The error happens in the scale test, that creates a single event listener with a thousand long list of binding/triggertemplate couples:

I0414 13:36:22.620]                 reason: CrashLoopBackOff
I0414 13:36:22.621]                 message: Back-off 5m0s restarting failed container=event-listener pod=el-my-eventlistener-f7556469-f7hr8_arrakis-9d6fw(c8a4fdce-f45b-45f6-a53c-ba44fb82b1ab)
I0414 13:36:22.621]               running: null
I0414 13:36:22.621]               terminated: null
I0414 13:36:22.621]             lastterminationstate:
I0414 13:36:22.621]               waiting: null
I0414 13:36:22.621]               running: null
I0414 13:36:22.622]               terminated:
I0414 13:36:22.622]                 exitcode: 1
I0414 13:36:22.622]                 signal: 0
I0414 13:36:22.622]                 reason: Error
I0414 13:36:22.622]                 message: ""

My suspicion is that in this case the sink takes longer that 3x10s to start because of the scale. I wonder if that is supposed to happen. I would not want to change the default value much more to fit this test, but we might be able to override the values for this test specifically.

@dibyom
Copy link
Member

dibyom commented Apr 16, 2020

I can repro this locally as well....The EL Sink container goes into a CrashLoopBackoff and the test times out. What I find surprising is that adding a Readniness probe is what causes this failure. This test has been around for a while and if its the long startup time that is causing issues, I'd have expected that the test would start failing after we added the Liveness probe (not the Readiness one)

Will keep digging but initial thoughts:

  1. We should investigate why Sink is taking a long time to startup.
  2. We should look into the scale test numbers i.e. is 1000 triggers realistic for an EL?
  3. We'd probably want to be able to configure/tune the probe parameters (initialDelay etc.) via something like Feature request: Eventlistener should support customized podTemplate #505

@afrittoli
Copy link
Member Author

As a data point, I deployed this pr on dogfooding and robocat clusters, and had no issues there

@dibyom
Copy link
Member

dibyom commented Apr 20, 2020

As a data point, I deployed this pr on dogfooding and robocat clusters, and had no issues there

Great, Looks like the next step is debugging the e2e setup further. I'll look into what exactly is causing the crashloop for the sink pods

@dibyom
Copy link
Member

dibyom commented Apr 20, 2020

Interestingly...the test passes when I run it on Minikube

@dibyom
Copy link
Member

dibyom commented Apr 20, 2020

Upon more investigation, I do not think this is related to the number of triggers at all....adding the Readiness probe in GKE is just somehow exposing a race condition in the test. The root cause is a permission error for the Sink. Some findings:

  • The Liveness and Readiness probe fail due to the port not listening:
 Readiness probe failed: Get http://10.36.1.9:8080/live: dial tcp 10.36.1.9:8080: connect: connection refused
  • The EL sink fails because of logging configMap related permission error:
"level":"fatal","logger":"eventlistener","caller":"logging/logging.go:52","msg":"failed to start configuration manager: error waiting for ConfigMap informer to sync","knative.dev/controller":"eventlistener","stacktrace":"github.com/tektoncd/triggers/pkg/logging.ConfigureLogging\n\t/usr/local/google/home/dibyajyoti/dev/go/src/github.com/tektoncd/triggers/pkg/logging/logging.go:52\nmain.main\n\t/usr/local/google/home/dibyajyoti/dev/go/src/github.com/tektoncd/triggers/cmd/eventlistenersink/main.go:63\nruntime.main\n\t/usr/lib/google-golang/src/runtime/proc.go:203"}
  • The above failure happens even on the 0.4 release (w/o the Readiness fix):
arrakis-jlhs4      0s          Normal    Created                  pod/el-my-eventlistener-6ddfccfc69-x8xvt            Created container event-listener
arrakis-jlhs4      0s          Normal    Started                  pod/el-my-eventlistener-6ddfccfc69-x8xvt            Started container event-listener
arrakis-jlhs4      0s          Warning   Unhealthy                pod/el-my-eventlistener-6ddfccfc69-x8xvt            Liveness probe failed: Get http://10.36.1.13:8080/live: dial tcp 10.36.1.13:8080: connect: connection refused
arrakis-jlhs4      0s          Normal    Killing                  pod/el-my-eventlistener-6ddfccfc69-x8xvt            Container event-listener failed liveness probe, will be restarted
arrakis-jlhs4      0s          Normal    Pulled                   pod/el-my-eventlistener-6ddfccfc69-x8xvt            Container image "gcr.io/tekton-releases/github.com/tektoncd/triggers/cmd/eventlistenersink@sha256:76c208ec1d73d9733dcaf850240e1b3990e5977709a03c2bd98ad5b20fab9867" already present on machine
  • The test keeps polling the EL every second till the EL's status indicates that the deployment is healthy (the "Available" status.Condition on the deployment is marked true). There is a race here -- the deployment is "Available" for a small bit before the liveness probe kills it.

  • Before the Readiness probe was added, there was enough delay for the test's polling to pass once which marked the test successful. However, if you query the EL after the test completed, you'd see the same CrashLoop eventually

  • With the addition of the Readniess probe in the GKE environment , none of the test's polling calls ever pass and the tests time out (this happens even if you reduce the number of triggers from 1000 to 1). In other environments like Minikube, the test continues to pass due to the above behavior.

dibyom added a commit to dibyom/triggers that referenced this pull request Apr 20, 2020
The Sink created by the test eventually goes into a CrashLoop because it lacks
the permissions needed to get the logging configMap resulting in the
LivenessProbe restarting it. However, there is a short delay initially when the
Sink is marked `Available` which is apparently enough for the test to pass. The
bug was discovered in tektoncd#467 when adding a ReadinessProbe and running the test in
GKE make it consistently fail. This commit fixes the bug by adding a
ServiceAccount to the EL with permission to get the logging configMap in the
namespace.

Fixes tektoncd#546

Signed-off-by: Dibyo Mukherjee <[email protected]>
@dibyom
Copy link
Member

dibyom commented Apr 20, 2020

Should be fixed by #547 @afrittoli

@dibyom dibyom added this to the Triggers v0.5.0 milestone Apr 20, 2020
Copy link
Member

@dibyom dibyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Will need a rebase once #547 is merged!

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2020
tekton-robot pushed a commit that referenced this pull request Apr 21, 2020
The Sink created by the test eventually goes into a CrashLoop because it lacks
the permissions needed to get the logging configMap resulting in the
LivenessProbe restarting it. However, there is a short delay initially when the
Sink is marked `Available` which is apparently enough for the test to pass. The
bug was discovered in #467 when adding a ReadinessProbe and running the test in
GKE make it consistently fail. This commit fixes the bug by adding a
ServiceAccount to the EL with permission to get the logging configMap in the
namespace.

Fixes #546

Signed-off-by: Dibyo Mukherjee <[email protected]>
Add a Readiness Probe for EventListener Deployment at URL /live
@afrittoli
Copy link
Member Author

/test pull-tekton-triggers-integration-tests

@dibyom
Copy link
Member

dibyom commented Apr 21, 2020

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2020
@tekton-robot tekton-robot merged commit bf53476 into tektoncd:master Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants