Switch off alpine as a base image #2051

mjpitz · 2019-05-15T20:36:34Z

Describe the bug

In versions of alpine 3.4 and later, a name resolution issue was introduced and causes problems with certain deployments of Kubernetes. Because of this bug with name resolution, I've ran into a problem where flux cannot clone the cluster-state repository, rendering the weaveworks-flux project useless.

This issue appears to be inconsistent and difficult to reproduce. Because of these inconsistencies, I'd suggest migrating off of alpine as the base image. This should create a more stable image for consumers of the system to leverage.

To Reproduce

What's your setup?

OpenStack, running CentOS 7 base image with docker, used Rancher's RKE to provision the cluster with kube-dns and canal (really just using the out of box configuration). After the cluster is up and running, you can test the various versions of alpine to see name resolution support against the cluster.

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.3 nslookup google.com 
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 172.217.14.174 dfw28s22-in-f14.1e100.net
Address 2: 2607:f8b0:4000:806::200e dfw28s22-in-x0e.1e100.net
pod "alpine" deleted

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.4 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.5 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.6 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.7 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.8 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

$ kubectl run -it --rm --restart=Never alpine --image=alpine:3.9 nslookup google.com 
If you don't see a command prompt, try pressing enter.
nslookup: can't resolve 'google.com': Try again
pod "alpine" deleted
pod default/alpine terminated (Error)

This same behavior applies when applying nslookup against AWSCodeCommit (git-codecommit.{region}.amazonaws.com). When you enable debug logging on the dns server, you can see it exhaust the full chain of search domains.

I0514 20:43:14.761492       1 nanny.go:116] dnsmasq[13]: query[A] google.com.default.svc.cluster.local from 10.42.2.7
I0514 20:43:14.761519       1 nanny.go:116] dnsmasq[13]: forwarded google.com.default.svc.cluster.local to 127.0.0.1
I0514 20:43:14.761525       1 nanny.go:116] dnsmasq[13]: query[AAAA] google.com.default.svc.cluster.local from 10.42.2.7
I0514 20:43:14.761533       1 nanny.go:116] dnsmasq[13]: forwarded google.com.default.svc.cluster.local to 127.0.0.1
I0514 20:43:14.761747       1 nanny.go:116] dnsmasq[13]: reply google.com.default.svc.cluster.local is NXDOMAIN
I0514 20:43:14.761764       1 nanny.go:116] dnsmasq[13]: reply google.com.default.svc.cluster.local is NXDOMAIN
I0514 20:43:14.762392       1 nanny.go:116] dnsmasq[13]: query[A] google.com.svc.cluster.local from 10.42.2.7
I0514 20:43:14.762476       1 nanny.go:116] dnsmasq[13]: forwarded google.com.svc.cluster.local to 127.0.0.1
I0514 20:43:14.762483       1 nanny.go:116] dnsmasq[13]: query[AAAA] google.com.svc.cluster.local from 10.42.2.7
I0514 20:43:14.762528       1 nanny.go:116] dnsmasq[13]: forwarded google.com.svc.cluster.local to 127.0.0.1
I0514 20:43:14.762807       1 nanny.go:116] dnsmasq[13]: reply google.com.svc.cluster.local is NXDOMAIN
I0514 20:43:14.762898       1 nanny.go:116] dnsmasq[13]: reply google.com.svc.cluster.local is NXDOMAIN
I0514 20:43:14.763491       1 nanny.go:116] dnsmasq[13]: query[A] google.com.cluster.local from 10.42.2.7
I0514 20:43:14.763568       1 nanny.go:116] dnsmasq[13]: forwarded google.com.cluster.local to 127.0.0.1
I0514 20:43:14.763579       1 nanny.go:116] dnsmasq[13]: query[AAAA] google.com.cluster.local from 10.42.2.7
I0514 20:43:14.763679       1 nanny.go:116] dnsmasq[13]: forwarded google.com.cluster.local to 127.0.0.1
I0514 20:43:14.763899       1 nanny.go:116] dnsmasq[13]: reply google.com.cluster.local is NXDOMAIN
I0514 20:43:14.763914       1 nanny.go:116] dnsmasq[13]: reply google.com.cluster.local is NXDOMAIN

Once the full chain is exhausted it looks like it never attempts google.com directly.

Expected behavior
The weaveworks/flux project should be able to easily resolve host names who's ndots is less than 5 (as configured by kubernetes in /etc/resolv.conf).

Logs
The resulting logs of fluxd are that it simply fails to clone the repo using any git url who's host is shorter then 5 dots:

ts=2019-04-25T14:52:11.311781787Z caller=loop.go:90 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository."

Additional context

DNS Issue gliderlabs/docker-alpine#255 documents the core issues that are experienced by many leveraging alpine today.
ndots breaks DNS resolving kubernetes/kubernetes#64924 (comment) documents a this issue on the Kubernetes side of the house.
Kube-dns add-on should accept option ndots for SkyDNS or document ConfigMap alternative subPath kubernetes/kubernetes#33554 (comment) documents why ndots:5 is important and why removing it is a bad idea.
problematic base-image with musl #1148 was reported with a link to the comment in k8s, but lacked context around the underlying issue.
DNS resolution problems in NixOS #1980 references similar issues on a different operating system.

Add any other context about the problem here, e.g

Flux version: all versions of flux (afaict)
Helm Operator version: un-tested
Kubernetes version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Git provider: AWS CodeCommit
Container registry provider: N/A

The text was updated successfully, but these errors were encountered:

hiddeco · 2019-05-15T20:55:27Z

Bug report quality 11/10.

We are looking into switching to a different base image and I expect a PR before the end of the week (probably tomorrow). Please bear with us until then.

mjpitz · 2019-05-15T20:57:56Z

+1 @hiddeco Thanks!

mjpitz · 2019-05-15T21:28:17Z

I was able to backport the current image to alpine-3.3 in the meantime. This unblocks me for now. Will keep an eye out for the new image.

mjpitz · 2019-05-20T14:34:17Z

As a note, I was seeing similar nslookup issues usinig minideb.

hiddeco · 2019-05-20T15:22:47Z

@mjpitz would you be able to give this image a try? hiddeco/flux:2015-base-debian-a2775981

mjpitz · 2019-05-20T15:29:59Z

nslookup isn't installed on debian by default:

$ kubectl run -it --rm --restart=Never nslookup --image=hiddeco/flux:2015-base-debian-a2775981 --image-pull-policy=Always --command sh -- -c nslookup google.com
google.com: 1: google.com: nslookup: not found
pod "nslookup" deleted
pod default/nslookup terminated (Error)

mjpitz · 2019-05-20T15:37:47Z

This morning, I was able to bypass the ndots issue by specifically setting it to "1" for the flux deployment container.

hiddeco · 2019-05-20T15:39:28Z

@mjpitz this one has: hiddeco/flux:2015-base-debian-a2775981-wip

mjpitz · 2019-05-20T16:00:00Z

$ kubectl run -it --rm --restart=Never nslookup --image=hiddeco/flux:2015-base-debian-a2775981-wip --image-pull-policy=Always --command sh
If you don't see a command prompt, try pressing enter.

# nslookup google.com
Server:         10.43.0.10
Address:        10.43.0.10#53

** server can't find google.com.cluster.local: SERVFAIL

# cat /etc/resolv.conf
nameserver 10.43.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Interesting. Seeing the same issue with that one. I was able to successfully resolve with ubuntu last week, but I'm hitting the same issues again this week.

$ kubectl run -it --rm --restart=Never nslookup --image=mjpitz/dockerfiles-debutils --image-pull-policy=Always -- nslookup google.com
Server:         10.43.0.10
Address:        10.43.0.10#53

** server can't find google.com.cluster.local: SERVFAIL

pod "nslookup" deleted
pod default/nslookup terminated (Error)

hiddeco · 2019-05-21T08:49:20Z

Seeing the same issue with that one. I was able to successfully resolve with ubuntu last week, but I'm hitting the same issues again this week.

This lowers my expectations of fixing it by moving away from Alpine (at least, for this particular bug).

This morning, I was able to bypass the ndots issue by specifically setting it to "1" for the flux deployment container.

This is something that people can have control over by adapting their pod config (K8S >=1.10), and something we can incorporate in our chart.

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config

mjpitz · 2019-05-21T12:03:54Z

This lowers my expectations of fixing it by moving away from Alpine (at least, for this particular bug).

Agreed. I've been going through a slew of images and it seems pretty consistent across the various operating systems. CentOS worked, but thats too heavy of a base image for something like this.

we can incorporate in our chart

Definitely seems like the way to go for something like this, sorry for the mislead on the base image. I imagine some people are hosting their state repos out of github, so it probably needs to be set to "1".

Mainly to provide people with the tools to overcome nslookup issues on certain Kubernetes setups, as one solution seems to be to configure the ndots value to "1". Ref: #2051 (comment)

hiddeco · 2019-06-03T13:38:02Z

With #2116 merged, this got resolved in an alternative way.

mjpitz added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels May 15, 2019

hiddeco removed the blocked-needs-validation Issue is waiting to be validated before we can proceed label May 15, 2019

This was referenced May 16, 2019

Switch to minideb as base image #2055

Closed

Switch to Debian slim as base image #2060

Closed

2opremio mentioned this issue May 21, 2019

DNS resolution problems in NixOS #1980

Closed

hiddeco mentioned this issue May 31, 2019

Pod DNS settings #2116

Merged

hiddeco added a commit that referenced this issue Jun 3, 2019

Incorporate pod DNS settings into chart

3e8bc15

Mainly to provide people with the tools to overcome nslookup issues on certain Kubernetes setups, as one solution seems to be to configure the ndots value to "1". Ref: #2051 (comment)

hiddeco added a commit that referenced this issue Jun 3, 2019

Incorporate pod DNS settings into chart

d4db661

Mainly to provide people with the tools to overcome nslookup issues on certain Kubernetes setups, as one solution seems to be to configure the ndots value to "1". Ref: #2051 (comment)

hiddeco closed this as completed Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch off alpine as a base image #2051

Switch off alpine as a base image #2051

mjpitz commented May 15, 2019 •

edited

Loading

hiddeco commented May 15, 2019

mjpitz commented May 15, 2019

mjpitz commented May 15, 2019 •

edited

Loading

mjpitz commented May 20, 2019

hiddeco commented May 20, 2019

mjpitz commented May 20, 2019

mjpitz commented May 20, 2019

hiddeco commented May 20, 2019

mjpitz commented May 20, 2019

hiddeco commented May 21, 2019

mjpitz commented May 21, 2019

hiddeco commented Jun 3, 2019

Switch off alpine as a base image #2051

Switch off alpine as a base image #2051

Comments

mjpitz commented May 15, 2019 • edited Loading

hiddeco commented May 15, 2019

mjpitz commented May 15, 2019

mjpitz commented May 15, 2019 • edited Loading

mjpitz commented May 20, 2019

hiddeco commented May 20, 2019

mjpitz commented May 20, 2019

mjpitz commented May 20, 2019

hiddeco commented May 20, 2019

mjpitz commented May 20, 2019

hiddeco commented May 21, 2019

mjpitz commented May 21, 2019

hiddeco commented Jun 3, 2019

mjpitz commented May 15, 2019 •

edited

Loading

mjpitz commented May 15, 2019 •

edited

Loading