Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Switch off alpine as a base image #2051

Closed
mjpitz opened this issue May 15, 2019 · 12 comments
Closed

Switch off alpine as a base image #2051

mjpitz opened this issue May 15, 2019 · 12 comments
Labels

Comments

@mjpitz
Copy link

mjpitz commented May 15, 2019

Describe the bug

In versions of alpine 3.4 and later, a name resolution issue was introduced and causes problems with certain deployments of Kubernetes. Because of this bug with name resolution, I've ran into a problem where flux cannot clone the cluster-state repository, rendering the weaveworks-flux project useless.

This issue appears to be inconsistent and difficult to reproduce. Because of these inconsistencies, I'd suggest migrating off of alpine as the base image. This should create a more stable image for consumers of the system to leverage.

To Reproduce

What's your setup?

OpenStack, running CentOS 7 base image with docker, used Rancher's RKE to provision the cluster with kube-dns and canal (really just using the out of box configuration). After the cluster is up and running, you can test the various versions of alpine to see name resolution support against the cluster.

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.3 nslookup google.com 
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 172.217.14.174 dfw28s22-in-f14.1e100.net
Address 2: 2607:f8b0:4000:806::200e dfw28s22-in-x0e.1e100.net
pod "alpine" deleted

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.4 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.5 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.6 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.7 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.8 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.9 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

This same behavior applies when applying nslookup against AWSCodeCommit (git-codecommit.{region}.amazonaws.com). When you enable debug logging on the dns server, you can see it exhaust the full chain of search domains.

I0514 20:43:14.761492       1 nanny.go:116] dnsmasq[13]: query[A] google.com.default.svc.cluster.local from 10.42.2.7
I0514 20:43:14.761519       1 nanny.go:116] dnsmasq[13]: forwarded google.com.default.svc.cluster.local to 127.0.0.1
I0514 20:43:14.761525       1 nanny.go:116] dnsmasq[13]: query[AAAA] google.com.default.svc.cluster.local from 10.42.2.7
I0514 20:43:14.761533       1 nanny.go:116] dnsmasq[13]: forwarded google.com.default.svc.cluster.local to 127.0.0.1
I0514 20:43:14.761747       1 nanny.go:116] dnsmasq[13]: reply google.com.default.svc.cluster.local is NXDOMAIN
I0514 20:43:14.761764       1 nanny.go:116] dnsmasq[13]: reply google.com.default.svc.cluster.local is NXDOMAIN
I0514 20:43:14.762392       1 nanny.go:116] dnsmasq[13]: query[A] google.com.svc.cluster.local from 10.42.2.7
I0514 20:43:14.762476       1 nanny.go:116] dnsmasq[13]: forwarded google.com.svc.cluster.local to 127.0.0.1
I0514 20:43:14.762483       1 nanny.go:116] dnsmasq[13]: query[AAAA] google.com.svc.cluster.local from 10.42.2.7
I0514 20:43:14.762528       1 nanny.go:116] dnsmasq[13]: forwarded google.com.svc.cluster.local to 127.0.0.1
I0514 20:43:14.762807       1 nanny.go:116] dnsmasq[13]: reply google.com.svc.cluster.local is NXDOMAIN
I0514 20:43:14.762898       1 nanny.go:116] dnsmasq[13]: reply google.com.svc.cluster.local is NXDOMAIN
I0514 20:43:14.763491       1 nanny.go:116] dnsmasq[13]: query[A] google.com.cluster.local from 10.42.2.7
I0514 20:43:14.763568       1 nanny.go:116] dnsmasq[13]: forwarded google.com.cluster.local to 127.0.0.1
I0514 20:43:14.763579       1 nanny.go:116] dnsmasq[13]: query[AAAA] google.com.cluster.local from 10.42.2.7
I0514 20:43:14.763679       1 nanny.go:116] dnsmasq[13]: forwarded google.com.cluster.local to 127.0.0.1
I0514 20:43:14.763899       1 nanny.go:116] dnsmasq[13]: reply google.com.cluster.local is NXDOMAIN
I0514 20:43:14.763914       1 nanny.go:116] dnsmasq[13]: reply google.com.cluster.local is NXDOMAIN

Once the full chain is exhausted it looks like it never attempts google.com directly.

Expected behavior
The weaveworks/flux project should be able to easily resolve host names who's ndots is less than 5 (as configured by kubernetes in /etc/resolv.conf).

Logs
The resulting logs of fluxd are that it simply fails to clone the repo using any git url who's host is shorter then 5 dots:

ts=2019-04-25T14:52:11.311781787Z caller=loop.go:90 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository."

Additional context

Add any other context about the problem here, e.g

  • Flux version: all versions of flux (afaict)
  • Helm Operator version: un-tested
  • Kubernetes version:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Git provider: AWS CodeCommit
  • Container registry provider: N/A
@mjpitz mjpitz added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels May 15, 2019
@hiddeco hiddeco removed the blocked-needs-validation Issue is waiting to be validated before we can proceed label May 15, 2019
@hiddeco
Copy link
Member

hiddeco commented May 15, 2019

Bug report quality 11/10.


We are looking into switching to a different base image and I expect a PR before the end of the week (probably tomorrow). Please bear with us until then.

@mjpitz
Copy link
Author

mjpitz commented May 15, 2019

+1 @hiddeco Thanks!

@mjpitz
Copy link
Author

mjpitz commented May 15, 2019

I was able to backport the current image to alpine-3.3 in the meantime. This unblocks me for now. Will keep an eye out for the new image.

@mjpitz
Copy link
Author

mjpitz commented May 20, 2019

As a note, I was seeing similar nslookup issues usinig minideb.

@hiddeco
Copy link
Member

hiddeco commented May 20, 2019

@mjpitz would you be able to give this image a try? hiddeco/flux:2015-base-debian-a2775981

@mjpitz
Copy link
Author

mjpitz commented May 20, 2019

nslookup isn't installed on debian by default:

$ kubectl run -it --rm --restart=Never nslookup --image=hiddeco/flux:2015-base-debian-a2775981 --image-pull-policy=Always --command sh -- -c nslookup google.com
google.com: 1: google.com: nslookup: not found
pod "nslookup" deleted
pod default/nslookup terminated (Error)

@mjpitz
Copy link
Author

mjpitz commented May 20, 2019

This morning, I was able to bypass the ndots issue by specifically setting it to "1" for the flux deployment container.

@hiddeco
Copy link
Member

hiddeco commented May 20, 2019

@mjpitz this one has: hiddeco/flux:2015-base-debian-a2775981-wip

@mjpitz
Copy link
Author

mjpitz commented May 20, 2019

$ kubectl run -it --rm --restart=Never nslookup --image=hiddeco/flux:2015-base-debian-a2775981-wip --image-pull-policy=Always --command sh
If you don't see a command prompt, try pressing enter.

# nslookup google.com
Server:         10.43.0.10
Address:        10.43.0.10#53

** server can't find google.com.cluster.local: SERVFAIL

# cat /etc/resolv.conf
nameserver 10.43.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Interesting. Seeing the same issue with that one. I was able to successfully resolve with ubuntu last week, but I'm hitting the same issues again this week.

$ kubectl run -it --rm --restart=Never nslookup --image=mjpitz/dockerfiles-debutils --image-pull-policy=Always -- nslookup google.com
Server:         10.43.0.10
Address:        10.43.0.10#53

** server can't find google.com.cluster.local: SERVFAIL

pod "nslookup" deleted
pod default/nslookup terminated (Error)

@hiddeco
Copy link
Member

hiddeco commented May 21, 2019

Seeing the same issue with that one. I was able to successfully resolve with ubuntu last week, but I'm hitting the same issues again this week.

This lowers my expectations of fixing it by moving away from Alpine (at least, for this particular bug).

This morning, I was able to bypass the ndots issue by specifically setting it to "1" for the flux deployment container.

This is something that people can have control over by adapting their pod config (K8S >=1.10), and something we can incorporate in our chart.

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config

@mjpitz
Copy link
Author

mjpitz commented May 21, 2019

This lowers my expectations of fixing it by moving away from Alpine (at least, for this particular bug).

Agreed. I've been going through a slew of images and it seems pretty consistent across the various operating systems. CentOS worked, but thats too heavy of a base image for something like this.

we can incorporate in our chart

Definitely seems like the way to go for something like this, sorry for the mislead on the base image. I imagine some people are hosting their state repos out of github, so it probably needs to be set to "1".

hiddeco added a commit that referenced this issue May 31, 2019
Mainly to provide people with the tools to overcome nslookup issues on
certain Kubernetes setups, as one solution seems to be to configure the
ndots value to "1".

Ref: #2051 (comment)
hiddeco added a commit that referenced this issue May 31, 2019
Mainly to provide people with the tools to overcome nslookup issues on
certain Kubernetes setups, as one solution seems to be to configure the
ndots value to "1".

Ref: #2051 (comment)
hiddeco added a commit that referenced this issue Jun 3, 2019
Mainly to provide people with the tools to overcome nslookup issues on
certain Kubernetes setups, as one solution seems to be to configure the
ndots value to "1".

Ref: #2051 (comment)
hiddeco added a commit that referenced this issue Jun 3, 2019
Mainly to provide people with the tools to overcome nslookup issues on
certain Kubernetes setups, as one solution seems to be to configure the
ndots value to "1".

Ref: #2051 (comment)
@hiddeco
Copy link
Member

hiddeco commented Jun 3, 2019

With #2116 merged, this got resolved in an alternative way.

@hiddeco hiddeco closed this as completed Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants