Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activator Pod keep in CrashLoopBackoff #4407

Closed
yuxiaoba opened this issue Jun 18, 2019 · 31 comments · Fixed by #4514
Closed

Activator Pod keep in CrashLoopBackoff #4407

yuxiaoba opened this issue Jun 18, 2019 · 31 comments · Fixed by #4514
Assignees
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@yuxiaoba
Copy link

yuxiaoba commented Jun 18, 2019

In what area(s)?

/area networking

What version of Knative?

Knative 0.6

Expected Behavior

Install the component of Knative Serving successfully

Actual Behavior

kubectl get pod -n knative-serving
NAME                                READY   STATUS    RESTARTS   AGE
activator-dfdb7f85-mfwhg            1/2     CrashLoopBackoff  15         32m
autoscaler-565cdf546b-hhkm7         2/2     Running   0          32m
controller-78564fd45c-x5mm5         1/1     Running   0          32m
networking-istio-6d95d868fb-tqwqw   1/1     Running   0          32m
webhook-5dc6f74b5b-5xdqv            1/1     Running   0          32m

Steps to Reproduce the Problem

Install Kubernetes 1.13

Install Istio 1.1.3

kubectl get pod -n istio-system 
NAME                                      READY   STATUS      RESTARTS   AGE
grafana-749c78bcc5-vtdm4                  1/1     Running     0          4h19m
istio-citadel-899dfb67c-7w5lz             1/1     Running     0          4h19m
istio-egressgateway-748d5fd794-khlm2      1/1     Running     0          4h5m
istio-egressgateway-748d5fd794-xddj4      1/1     Running     0          4h19m
istio-galley-555dd7c7d7-dffls             1/1     Running     0          4h19m
istio-ingressgateway-55dd86767f-lpp9w     1/1     Running     0          4h19m
istio-pilot-7979d58649-ffwjt              2/2     Running     0          4h19m
istio-pilot-7979d58649-lbqfv              2/2     Running     0          4h3m
istio-policy-f89c945dc-pfct8              2/2     Running     0          4h19m
istio-sidecar-injector-998dd6cbb-w5428    1/1     Running     1          4h19m
istio-telemetry-7d9d866c65-dw8mq          2/2     Running     0          4h19m
istio-tracing-595796cf54-zmqdb            1/1     Running     1          4h19m
kiali-5df77dc9b6-dlztz                    1/1     Running     0          4h19m
prometheus-7f87866f5f-gdl95               1/1     Running     1          4h19m

Install Knative Serving

kubectl apply --selector knative.dev/crd-install=true  --filename serving.yaml
kubectl apply --filename serving.yaml  --selector networking.knative.dev/certificate-provider!=cert-manager
@yuxiaoba yuxiaoba added the kind/bug Categorizes issue or PR as related to a bug. label Jun 18, 2019
@yuxiaoba
Copy link
Author

The logs of activator is here:

  • For activator container
kubectl logs -n knative-serving activator-dfdb7f85-mfwhg activator
...
{"level":"error","ts":"2019-06-18T13:10:58.667Z","logger":"activator","caller":"websocket/connection.go:158","msg":"Failed to send ping message","knative.dev/controller":"activator","error":"connection has not yet been established","stacktrace":"github.com/knative/serving/vendor/github.com/knative/pkg/websocket.NewDurableConnection.func3\n\t/go/src/github.com/knative/serving/vendor/github.com/knative/pkg/websocket/connection.go:158"}
{"level":"error","ts":"2019-06-18T13:11:02.003Z","logger":"activator","caller":"websocket/connection.go:158","msg":"Failed to send ping message","knative.dev/controller":"activator","error":"connection has not yet been established","stacktrace":"github.com/knative/serving/vendor/github.com/knative/pkg/websocket.NewDurableConnection.func3\n\t/go/src/github.com/knative/serving/vendor/github.com/knative/pkg/websocket/connection.go:158"}
{"level":"error","ts":"2019-06-18T13:11:05.236Z","logger":"activator","caller":"websocket/connection.go:158","msg":"Failed to send ping message","knative.dev/controller":"activator","error":"connection has not yet been established","stacktrace":"github.com/knative/serving/vendor/github.com/knative/pkg/websocket.NewDurableConnection.func3\n\t/go/src/github.com/knative/serving/vendor/github.com/knative/pkg/websocket/connection.go:158"}

  • For istio-proxy container
...
[2019-06-18T13:11:18.469Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "105ae136-0eb5-9ee9-8781-83a39e5a5ef8" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51564 -
[2019-06-18T13:11:19.867Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "a544b997-356b-9de0-b840-f271d4ffcaa7" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51568 -
[2019-06-18T13:11:19.890Z] "GET /healthz HTTP/1.1" 500 - "-" 0 40 1 0 "-" "kube-probe/1.13" "4dac30b2-a6ee-91a2-a444-54c4dda5621b" "172.20.9.17:8012" "127.0.0.1:8012" inbound|80|http|activator-service.knative-serving.svc.cluster.local - 172.20.9.17:8012 172.20.9.1:58396 -
[2019-06-18T13:11:21.468Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "04ac8022-4ac0-9ec2-9976-01b834135dd2" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51576 -
[2019-06-18T13:11:22.223Z] "GET /healthz HTTP/1.1" 500 - "-" 0 40 1 0 "-" "kube-probe/1.13" "a1ccb430-b763-9e64-86e5-72332884943e" "172.20.9.17:8012" "127.0.0.1:8012" inbound|80|http|activator-service.knative-serving.svc.cluster.local - 172.20.9.17:8012 172.20.9.1:58404 -
[2019-06-18T13:11:23.752Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "8b589bee-a8c2-9651-92f3-4adda79006db" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51582 -
[2019-06-18T13:11:26.704Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "0029075a-1c7f-9fee-9551-3c7862992a4a" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51590 -
[2019-06-18T13:11:29.890Z] "GET /healthz HTTP/1.1" 500 - "-" 0 40 1 0 "-" "kube-probe/1.13" "f8b5097b-d827-9f72-8c3b-44e22bda7a87" "172.20.9.17:8012" "127.0.0.1:8012" inbound|80|http|activator-service.knative-serving.svc.cluster.local - 172.20.9.17:8012 172.20.9.1:58420 -
[2019-06-18T13:11:30.932Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "7431f8ba-c511-96bf-9079-a47a913a59c2" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51598 -
[2019-06-18T13:11:32.223Z] "GET /healthz HTTP/1.1" 500 - "-" 0 40 1 0 "-" "kube-probe/1.13" "ae17054e-73ca-9a14-866f-bae098da2ac1" "172.20.9.17:8012" "127.0.0.1:8012" inbound|80|http|activator-service.knative-serving.svc.cluster.local - 172.20.9.17:8012 172.20.9.1:58426 -
[2019-06-18T13:11:35.459Z] "GET / HTTP/1.1" 404 NR "-" 0 0 0 - "-" "Go-http-client/1.1" "cb88188c-cf9d-9474-ba50-b3419d12d6b7" "autoscaler.knative-serving.svc.cluster.local.:8080" "-" - - 10.68.254.15:8080 172.20.9.17:51608 -
[2019-06-18T13:11:39.890Z] "GET /healthz HTTP/1.1" 500 - "-" 0 40 1 0 "-" "kube-probe/1.13" "e031aff1-1b53-9fdf-b28a-c4d2c0a3b076" "172.20.9.17:8012" "127.0.0.1:8012" inbound|80|http|activator-service.knative-serving.svc.cluster.local - 172.20.9.17:8012 172.20.9.1:58440 -
[2019-06-18T13:11:42.222Z] "GET /healthz HTTP/1.1" 500 - "-" 0 40 3 2 "-" "kube-probe/1.13" "06d1fcb9-c392-9db1-bc88-c63795a263ce" "172.20.9.17:8012" "127.0.0.1:8012" inbound|80|http|activator-service.knative-serving.svc.cluster.local - 172.20.9.17:8012 172.20.9.1:58444 -
[2019-06-18T13:11:13.682Z] "- - -" 0 - "-" 2899 132999 28591 - "-" "-" "-" "-" "192.168.199.181:6443" outbound|443||kubernetes.default.svc.cluster.local 172.20.9.17:44104 10.68.0.1:443 172.20.9.17:57854 -

@yuxiaoba
Copy link
Author

@gyliu513

@gyliu513
Copy link
Contributor

@zxDiscovery can you help check this?

@zxDiscovery
Copy link

@yuxiaoba
Can you show me the Gateway information on your environment?

kubectl get gateway -n knative-serving

@yuxiaoba
Copy link
Author

@zxDiscovery Thank you for your help.

$ kubectl get gateway -n knative-serving
NAME                      AGE
cluster-local-gateway     12h
knative-ingress-gateway   12h

$ kubectl get gateway -n knative-serving -o yaml
apiVersion: v1
items:
- apiVersion: networking.istio.io/v1alpha3
  kind: Gateway
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"networking.istio.io/v1alpha3","kind":"Gateway","metadata":{"annotations":{},"labels":{"networking.knative.dev/ingress-provider":"istio","serving.knative.dev/release":"v0.6.1"},"name":"cluster-local-gateway","namespace":"knative-serving"},"spec":{"selector":{"istio":"cluster-local-gateway"},"servers":[{"hosts":["*"],"port":{"name":"http","number":80,"protocol":"HTTP"}}]}}
    creationTimestamp: "2019-06-18T12:32:45Z"
    generation: 1
    labels:
      networking.knative.dev/ingress-provider: istio
      serving.knative.dev/release: v0.6.1
    name: cluster-local-gateway
    namespace: knative-serving
    resourceVersion: "1643314"
    selfLink: /apis/networking.istio.io/v1alpha3/namespaces/knative-serving/gateways/cluster-local-gateway
    uid: 2c984332-91c5-11e9-b455-5254004f8e0e
  spec:
    selector:
      istio: cluster-local-gateway
    servers:
    - hosts:
      - '*'
      port:
        name: http
        number: 80
        protocol: HTTP
- apiVersion: networking.istio.io/v1alpha3
  kind: Gateway
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"networking.istio.io/v1alpha3","kind":"Gateway","metadata":{"annotations":{},"labels":{"networking.knative.dev/ingress-provider":"istio","serving.knative.dev/release":"v0.6.1"},"name":"knative-ingress-gateway","namespace":"knative-serving"},"spec":{"selector":{"istio":"ingressgateway"},"servers":[{"hosts":["*"],"port":{"name":"http","number":80,"protocol":"HTTP"}},{"hosts":["*"],"port":{"name":"https","number":443,"protocol":"HTTPS"},"tls":{"mode":"PASSTHROUGH"}}]}}
    creationTimestamp: "2019-06-18T12:32:45Z"
    generation: 1
    labels:
      networking.knative.dev/ingress-provider: istio
      serving.knative.dev/release: v0.6.1
    name: knative-ingress-gateway
    namespace: knative-serving
    resourceVersion: "1643312"
    selfLink: /apis/networking.istio.io/v1alpha3/namespaces/knative-serving/gateways/knative-ingress-gateway
    uid: 2c815ed7-91c5-11e9-b455-5254004f8e0e
  spec:
    selector:
      istio: ingressgateway
    servers:
    - hosts:
      - '*'
      port:
        name: http
        number: 80
        protocol: HTTP
    - hosts:
      - '*'
      port:
        name: https
        number: 443
        protocol: HTTPS
      tls:
        mode: PASSTHROUGH
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@zxDiscovery
Copy link

@yuxiaoba Please show the information about VirtualService in your environment. Thanks!

kubectl get virtualService

@yuxiaoba
Copy link
Author

yuxiaoba commented Jun 19, 2019

@zxDiscovery My environment doesn't seem to have VirtualService

$ kubectl get virtualservice --all-namespaces
No resources found.

@mattmoor
Copy link
Member

The activator is attempting to reach the autoscaler, and won't report healthy until it does. If you are restricting network traffic, then you have to allow this for it to ever become healthy. This is what the activator logs tell me.

You should also share a describe of that pod, which will hopefully corroborate that hypothesis.

@yuxiaoba
Copy link
Author

yuxiaoba commented Jun 20, 2019

@mattmoor Thank you for your help。
This the describe of that pod

kubectl describe pod -n knative-serving activator-dfdb7f85-5467c
  Normal   Created      41m                 kubelet, 192.168.199.190  Created container
  Warning  Unhealthy    40m (x11 over 41m)  kubelet, 192.168.199.190  Readiness probe failed: HTTP probe failed with statuscode: 500
  Normal   Started      15m (x14 over 41m)  kubelet, 192.168.199.190  Started container
  Warning  BackOff      7s (x167 over 39m)  kubelet, 192.168.199.190  Back-off restarting failed container

And I am a fresh man about istio and knative, I do not know how I have set restricting network traffic, is this ?

# Source: istio/charts/security/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-security-custom-resources
  namespace: istio-system
  labels:
    app: security
    chart: security
    heritage: Tiller
    release: release-name
    istio: citadel
data:
  custom-resources.yaml: |-
    # Authentication policy to enable permissive mode for all services (that have sidecar) in the mesh.
    apiVersion: "authentication.istio.io/v1alpha1"
    kind: "MeshPolicy"
    metadata:
      name: "default"
      labels:
        app: security
        chart: security
        heritage: Tiller
        release: release-name
    spec:
      peers:
      - mtls:
          mode: PERMISSIVE

@mattmoor
Copy link
Member

as I suspected the activator's readiness probes are failing, it seems because it cannot open a websocket to the autoscaler's metrics endpoint.

@mattmoor
Copy link
Member

I'm not sure why...

@yuxiaoba
Copy link
Author

@mattmoor I am also confusing about this . Is this the code bug? Or the deployment yaml of Knative Serving has some error?

@mattmoor
Copy link
Member

I don't see any indication that this is a Knative problem (yet). I'd launch an ubuntu container with a sidecar in the knative-system namespace and try to curl the autoscaler service from it.

@patrickshan
Copy link

patrickshan commented Jun 20, 2019

we are seeing similar issue when knative is under some load (around 450 ksvc). Currently we are running knative 0.6.1.

Some commends results mentioned above:

$ kubectl get gateway -n knative-serving
NAME                      AGE
cluster-local-gateway     40d
knative-ingress-gateway   40d
$ kubectl get virtualService -n knative-serving --no-headers
route-00e90f5c-930f-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-6oe146xrcc3ifhch8iz.knative.test.example.com dev-test-6oe146xrcc3ifhch8iz.knative.svc.cluster.local]       33m
route-0da6136e-930f-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-88czshxetkdu0lejrfy6.knative.test.example.com dev-test-88czshxetkdu0lejrfy6.knative.svc.cluster.local]     27m
route-312f9031-930f-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-e0q247cj3d7k45i8yfas.knative.test.example.com dev-test-e0q247cj3d7k45i8yfas.knative.svc.cluster.local]     20m
route-407a0e40-930f-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-g0m7713xog5u32dqkf7.knative.test.example.com dev-test-g0m7713xog5u32dqkf7.knative.svc.cluster.local]       15m
route-d36e1ded-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-064qojgxhoj79jtkx85k.knative.test.example.com dev-test-064qojgxhoj79jtkx85k.knative.svc.cluster.local]     35m
route-d3e3d532-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-07f5l78xqrhv2wdx1vql.knative.test.example.com dev-test-07f5l78xqrhv2wdx1vql.knative.svc.cluster.local]     35m
route-d45a0690-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-098whrdu50ugaukqi19.knative.test.example.com dev-test-098whrdu50ugaukqi19.knative.svc.cluster.local]       35m
route-d4a80fdc-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-09gt2knxqc6o8slap0n3.knative.test.example.com dev-test-09gt2knxqc6o8slap0n3.knative.svc.cluster.local]     35m
route-d54b45f0-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-0af0co4r5z4alym6s8a.knative.test.example.com dev-test-0af0co4r5z4alym6s8a.knative.svc.cluster.local]       35m
route-d5c81091-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-0c9jjv26ipb5gci6sqzi.knative.test.example.com dev-test-0c9jjv26ipb5gci6sqzi.knative.svc.cluster.local]     35m
route-d6349933-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-0g6m1w9hsrn8xm1q91jm.knative.test.example.com dev-test-0g6m1w9hsrn8xm1q91jm.knative.svc.cluster.local]     35m
route-d6bae3d9-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-0i1g1f0hxvkrpv8be1pi.knative.test.example.com dev-test-0i1g1f0hxvkrpv8be1pi.knative.svc.cluster.local]     35m
route-d739a8d4-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-0vsbctd82q3l38uzood.knative.test.example.com dev-test-0vsbctd82q3l38uzood.knative.svc.cluster.local]       35m
route-d7cf85df-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-0ytgtl3t9crdvj3gdxr.knative.test.example.com dev-test-0ytgtl3t9crdvj3gdxr.knative.svc.cluster.local]       35m
route-d858cd26-930e-11e9-861e-06a53c0a941c   [knative-ingress-gateway mesh]   [dev-test-110ytza8c8pxftn05sn.knative.test.example.com dev-test-110ytza8c8pxftn05sn.knative.svc.cluster.local]       35m
......
$ kubectl get virtualService -n knative-serving --no-headers  | wc -l
91
$ kubectl get ksvc -n knative --no-headers | wc -l
449

we can see some error logs from knative autoscaler container which failed to talk with kube-apiserver but at the same time, we were able to talk with kube-apiserver without problem from local and all the other part of the cluster were just running fine.

{"level":"error","ts":"2019-06-20T04:27:49.373Z","logger":"activator","caller":"activator/main.go:147","msg":"Failed to get k8s version","knative.dev/controller":"activator","error":"Get https://10.32.0.1:443/version?timeout=32s: dial tcp 10.32.0.1:443: connect: connection refused","stacktrace":"main.main.func1\n\t/go/src/github.com/knative/serving/cmd/activator/main.go:147\ngithub.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait.WaitFor\n\t/go/src/github.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:331\ngithub.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait.pollInternal\n\t/go/src/github.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:227\ngithub.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal\n\t/go/src/github.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:252\ngithub.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait.PollImmediate\n\t/go/src/github.com/knative/serving/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:241\nmain.main\n\t/go/src/github.com/knative/serving/cmd/activator/main.go:145\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}

@patrickshan
Copy link

after further looking into our issue, it seems to be related with sidecar container resources limit for activator and autoscaler. Apparently the default 128Mi memory for sidecar container isn't enough with 450 ksvc workload. we bumped those limit by adding annotations to the PodSpec of their deployments: sidecar.istio.io/proxyCPU: 500m and sidecar.istio.io/proxyMemory: 512Mi. After that both autoscaler and activator are up and running.

@mattmoor
Copy link
Member

@yuxiaoba Any chance you are doing this at scales similar to @patrickshan ?

@mattmoor
Copy link
Member

@yuxiaoba could you try starting a container in the knative-serving namespace with a sidecar and curling the autoscaler service?

I notice it doesn't have a readiness probe itself, which is troubling.

@yuxiaoba
Copy link
Author

yuxiaoba commented Jun 21, 2019

@mattmoor I have tried this, but I am a freshman about this , may be I would do some incorrect action.

First, Starting a container

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: nginx
  namespace: knative-serving
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.13-alpine
        ports:
        - name: http
          containerPort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: knative-serving
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: ClusterIP

kubectl apply -f <(istioctl kube-inject -f nginx.yaml)

Second, show the staus of pods

kubectl get pod -n knative-serving 
NAME                                READY   STATUS             RESTARTS   AGE
activator-dfdb7f85-pckpn            1/2     CrashLoopBackOff   27         69m
autoscaler-565cdf546b-fkdwk         2/2     Running            0          69m
controller-78564fd45c-ck4p2         1/1     Running            0          69m
networking-istio-6d95d868fb-mt6jc   1/1     Running            0          69m
nginx-798c57cf6f-9jkxx              2/2     Running            0          11m
webhook-5dc6f74b5b-8vvd2            1/1     Running            0          69m

kubectl get svc -n knative-serving 
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
activator-service   ClusterIP   10.68.140.88    <none>        80/TCP,81/TCP,9090/TCP   69m
autoscaler          ClusterIP   10.68.86.227    <none>        8080/TCP,9090/TCP        69m
controller          ClusterIP   10.68.145.199   <none>        9090/TCP                 69m
nginx               ClusterIP   10.68.194.81    <none>        80/TCP                   30m
webhook             ClusterIP   10.68.64.29     <none>        443/TCP                  69m

Curl the autoscaler and shows the result

kubectl exec -it -n  knative-serving nginx-798c57cf6f-9jkxx -- /bin/sh
/ # curl 10.68.86.227:8080
Bad Request
/ # curl 10.68.86.227:9090
404 page not found
/# curl autoscaler.knative-serving.svc.cluster.local.:8080
Bad Request

@mattmoor
Copy link
Member

What if you add -H k-network-probe: activator? I also wonder whether it is websocket support that's the problem (cc @tcnghia ).

@yuxiaoba
Copy link
Author

yuxiaoba commented Jun 21, 2019

@mattmoor Does change the yaml like this?

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: nginx
  namespace: knative-serving
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.13-alpine
        livenessProbe:
          httpGet:
            httpHeaders:
            - name: k-kubelet-probe
              value: activator
            path: /healthz
            port: 8012
        ports:
        - name: http
          containerPort: 80
        - name: http1
          containerPort: 8012
        readinessProbe:
          httpGet:
            httpHeaders:
            - name: k-kubelet-probe
              value: activator
            path: /healthz
            port: 8012

@mattmoor
Copy link
Member

Actually nevermind, the autoscaler doesn't respond to network probes.

@yuxiaoba
Copy link
Author

@mattmoor I think this problem may be on the sidecar, because when I install the istio without sidecar, the activator pod runs normally

@tcnghia
Copy link
Contributor

tcnghia commented Jun 21, 2019

@yuxiaoba did you enable authorization in your Istio installation?

You may want to try using https://istio.io/docs/concepts/security/#enabling-authorization and exclude the namespace knative-serving to see if this is still a problem?

@yuxiaoba
Copy link
Author

@tcnghia # cat /etc/resolv.conf
nameserver 10.68.0.2
search istio-system.svc.cluster.local. svc.cluster.local. cluster.local.
options ndots:5

@tcnghia
Copy link
Contributor

tcnghia commented Jun 22, 2019

Thanks, I believe the https://github.com/knative/serving/blob/master/pkg/network/domain.go#L69 didn't handle this right.

We need to strip the trailing dot.

@majuansari
Copy link

@yuxiaoba You were able to fix this? I am still getting this in macosx local

@yuxiaoba
Copy link
Author

yuxiaoba commented Aug 2, 2019

@majuansari I have fixed this by change CLUSTER_DNS_DOMAIN from cluster.local. to cluster.local

@jgnoonan
Copy link

I am having this issue with OpenShift 3.11. I should note that I am using the multitenant CNI from Red Hat which won't allow project-to-project communications except from the default namespace. Could this be the issue?

@vagababov
Copy link
Contributor

vagababov commented Aug 26, 2019 via email

@jgnoonan
Copy link

Yup, that was it. I ran the following command to join the project networks and both pods are now running. Again, the worked because I am using the mult-tenant sdn from Red Hat:

oc adm pod-network join-projects --to=istio-system knative-serving knative-build knative-monitoring

@marcjimz
Copy link

after further looking into our issue, it seems to be related with sidecar container resources limit for activator and autoscaler. Apparently the default 128Mi memory for sidecar container isn't enough with 450 ksvc workload. we bumped those limit by adding annotations to the PodSpec of their deployments: sidecar.istio.io/proxyCPU: 500m and sidecar.istio.io/proxyMemory: 512Mi. After that both autoscaler and activator are up and running.

Just want to comment that this ended up being our solution, as we had tested our system to over 2000 ksvcs. Fortunately our entire service mesh didn't require the full service discovery so we limited the egress via sidecar definitions. But it's probably not scalable to keep increasing the resources on the activator pod so we're hoping in the future we can reduce the number of routes kept in the sidecar.

[~/Documents]: istioctl proxy-config clusters activator-668db7f4d5-4q986 -n knative-serving | wc -l
    8707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.