-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Readiness Probe for EventListener Deployment #467
Conversation
/test pull-tekton-triggers-integration-tests |
1 similar comment
/test pull-tekton-triggers-integration-tests |
I don't see the relationship between this change and the failure on the |
/test pull-tekton-triggers-integration-tests |
/retest |
I don't either but that test has not been flaky before 🤔 |
/test pull-tekton-triggers-integration-tests |
2 similar comments
/test pull-tekton-triggers-integration-tests |
/test pull-tekton-triggers-integration-tests |
1841749
to
699ab14
Compare
I'm super keen to help get this over the line, as due to the way the GCE ingress controller works, using an ingress with a default eventlistener service will never set the upstream health check correctly unless a readiness probe is present. In an effort to understand what's going on, I've been looking at the build logs. What I'm seeing is this:
But I'm dead confused here, as that doesn't look like anything to do with this change. It looks more like we're screwing up the installation of the triggers, rather than failing an actual test. Am I looking in the wrong place? Is there an easy command I can use to replicate these test failures locally? Sorry for the total novice questions. Any help getting up-to-speed would be great. |
@lawrencejones thanks for looking into this. In previous runs the failure was caused by the pod for the sink going into crash loop. I suspect that the test node may be slow, and thus the default values for the readiness probes were not enough. In the latest version of the PR I raised The error you saw instead seems unrelated indeed, I'm just going to try a restest as last week our CI was busted for a few hours, so it may be that. |
/retest |
The error happens in the scale test, that creates a single event listener with a thousand long list of binding/triggertemplate couples:
My suspicion is that in this case the sink takes longer that 3x10s to start because of the scale. I wonder if that is supposed to happen. I would not want to change the default value much more to fit this test, but we might be able to override the values for this test specifically. |
I can repro this locally as well....The EL Sink container goes into a CrashLoopBackoff and the test times out. What I find surprising is that adding a Readniness probe is what causes this failure. This test has been around for a while and if its the long startup time that is causing issues, I'd have expected that the test would start failing after we added the Liveness probe (not the Readiness one) Will keep digging but initial thoughts:
|
As a data point, I deployed this pr on dogfooding and robocat clusters, and had no issues there |
Great, Looks like the next step is debugging the e2e setup further. I'll look into what exactly is causing the crashloop for the sink pods |
Interestingly...the test passes when I run it on Minikube |
Upon more investigation, I do not think this is related to the number of triggers at all....adding the Readiness probe in GKE is just somehow exposing a race condition in the test. The root cause is a permission error for the Sink. Some findings:
|
The Sink created by the test eventually goes into a CrashLoop because it lacks the permissions needed to get the logging configMap resulting in the LivenessProbe restarting it. However, there is a short delay initially when the Sink is marked `Available` which is apparently enough for the test to pass. The bug was discovered in tektoncd#467 when adding a ReadinessProbe and running the test in GKE make it consistently fail. This commit fixes the bug by adding a ServiceAccount to the EL with permission to get the logging configMap in the namespace. Fixes tektoncd#546 Signed-off-by: Dibyo Mukherjee <[email protected]>
Should be fixed by #547 @afrittoli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Will need a rebase once #547 is merged!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dibyom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Sink created by the test eventually goes into a CrashLoop because it lacks the permissions needed to get the logging configMap resulting in the LivenessProbe restarting it. However, there is a short delay initially when the Sink is marked `Available` which is apparently enough for the test to pass. The bug was discovered in #467 when adding a ReadinessProbe and running the test in GKE make it consistently fail. This commit fixes the bug by adding a ServiceAccount to the EL with permission to get the logging configMap in the namespace. Fixes #546 Signed-off-by: Dibyo Mukherjee <[email protected]>
Add a Readiness Probe for EventListener Deployment at URL /live
699ab14
to
6e1e40c
Compare
/test pull-tekton-triggers-integration-tests |
/lgtm |
Changes
Add a Readiness Probe for EventListener Deployment at URL /live
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Release Notes
Add a Readiness Probe for EventListener Deployment at URL /live.