Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickly segfaults on FreeBSD #90

Closed
asomers opened this issue Apr 2, 2020 · 9 comments
Closed

Quickly segfaults on FreeBSD #90

asomers opened this issue Apr 2, 2020 · 9 comments

Comments

@asomers
Copy link
Contributor

asomers commented Apr 2, 2020

grok_exporter 1.0.0.RC3 (and master too) quickly segfault on FreeBSD. It takes anywhere from 0.1s to about 3 minutes.

$ ./grok_exporter -config config.yml 
Starting server on http://XXXXX.net:9146/metrics

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xf5a47e]

goroutine 1 [running]:
main.main()
	/usr/home/asomers/src/github/fstab/grok_exporter/grok_exporter.go:116 +0x1b0e

This happens when built with Go 1.14. So far we haven't observed the problem using Go 1.13.

$ go version
go version go1.14 freebsd/amd64
@fstab
Copy link
Owner

fstab commented Apr 13, 2020

Thanks for reporting this. I'll try to reproduce it. Are you running on an amd64 architecture (go version should print this information)?

@asomers
Copy link
Contributor Author

asomers commented Apr 13, 2020

Yes, this is amd64.

@fstab
Copy link
Owner

fstab commented Apr 19, 2020

I cannot reproduce this. I got a FreeBSD 12.1 x64 zfs Droplet on Digitalocean, and the latest go1.14.2.freebsd distribution from dl.golang.org. I ran pkg install git to clone grok_exporter, apart from that I didn't change anything. I needed to copy all *_darwin* source files to *_freebsd*, then I compiled it with go install. The resulting binary runs stable.

Do you see any difference between your setup and mine?

@fstab
Copy link
Owner

fstab commented Apr 19, 2020

I will keep it running for a few days.

@asomers
Copy link
Contributor Author

asomers commented Apr 19, 2020

It could have something to do with the type of files you're monitoring. I'll see if I can find a configuration that reproduces the failure quickly.

@fstab
Copy link
Owner

fstab commented Apr 23, 2020

I just checked again: It's still up and running after 4 days. So yes, if you have any way to reproduce this it would be awesome.

@asomers
Copy link
Contributor Author

asomers commented Apr 30, 2020

We have some progress to report. I still don't have any easy recipe for reproducing the problem, but we've determined that the failure is related to asyncronous preemption, which was introduced in go 1.14 (see https://golang.org/doc/go1.14#runtime).

We can see this asynchronous preemption in progress by using truss:

truss -o truss_0429_2057.log -eH ./grok_exporter -config /usr/local/etc/grok_exporter/config.yml :

...
106367: _umtx_op(0xbf2b90,UMTX_OP_WAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
106367: SIGNAL 16 (SIGURG) code=SI_LWP pid=73399 uid=1406
101696: _umtx_op(0xbf2b90,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x0) = 0 (0x0)
101148: thr_kill(106367,SIGURG)			 = 0 (0x0)
106367: sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },0x0) = 0 (0x0)
106367: sigreturn(0xc0001076c0)			 = 0 (0x0)
101148: nanosleep({ 0.000020000 })		 = 0 (0x0)
101696: nanosleep({ 0.000003000 })		 = 0 (0x0)
101148: thr_kill(106367,SIGURG)			 = 0 (0x0)
106367: SIGNAL 16 (SIGURG) code=SI_LWP pid=73399 uid=1406
106367: compat11.kevent(3,0x0,0,0xc000142a00,10,0x0) ERR#4 'Interrupted system call'
106367: SIGNAL 16 (SIGURG) code=SI_LWP pid=73399 uid=1406
106367: sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },0x0) = 0 (0x0)
106367: sigreturn(0xc0001076c0)			 ERR#4 'Interrupted system call'
...

Note the compat11.kevent call that returned EINTR.

Finally, we've found that disabling asynchronous preemption eliminates the problem. If we invoke the grok_exporter like this, nothing crashes:

set GODEBUG=asyncpreemptoff=1
go run grok_exporter.go -config /usr/local/etc/grok_exporter/config.yml

I see that grok_exporter is calling kevent in tailer/fswatcher/fseventProducerLoop_darwin.go, function runKeventLoop. So I think that the early return in the case of EINTR must be failing to set a variable, or some code later on must be wrongly assuming that some gochans will be non-empty. But unfortunately my go skills aren't good enough to provide a patch.

@a1exanderpetrov
Copy link

There are some details for a problem report:
grok_exporter fails in

for {
	select {
		case err := <-tail.Errors():
		...
		case line := <-tail.Lines():
		...
	}
}

with nil error or line, it seems that channels tail.Errors() and tail.Lines() are empty, and it's causing SIGSEGV.
and going to runKeventLoop() from fseventProducerLoop_darwin.go - if we won't exit from polling kevent in case of EINTR, grok_exporter won't crash with empty gochans values.

@fstab fstab closed this as completed in 6a3c1c0 Jul 5, 2020
@fstab
Copy link
Owner

fstab commented Jul 5, 2020

Fixed with #103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants