gps: remove arbitrary command timeouts #1110

tamird · 2017-09-01T12:22:40Z

Use better context plumbing while I'm here.

sdboyer

oh, cool! thanks for jumping right on this. this does go a little further than i'd like to in an initial step, though.

as noted in one of the individual line comments, moving to CommandContext is a regression, as it relies on Kill semantics for subprocess termination.

so, my preferred route here would be to retain monitoredCmd's interrupt-then-kill semantics, but modify it to have a mode without any kind of explicit timeout. then, for commands that don't generate output we can effectively monitor, we can avoid the timeout.

for example, git clone has --progress, so we can monitor it and retain the inactivity timeout. git checkout does not, however, because that flag was added in a sufficiently recent of git that i didn't want to run the risk.

sdboyer · 2017-09-01T12:43:28Z

internal/gps/vcs_repo.go

-	out, err := runFromCwd(ctx, expensiveCmdTimeout, "git", "clone", "--recursive", "-v", "--progress", r.Remote(), r.LocalPath())
-	if err != nil {
-		return newVcsRemoteErrorOr("unable to get repository", err, string(out))
+	cmd := exec.CommandContext(


relying on CommandContext represents a regression vs. our current logic, as it relies on Process.Kill() to terminate the subprocess. we used to do that, and it produced lots of corrupted repositories. what we have isn't still isn't without problems, but it's better than all that.

sdboyer · 2017-09-01T12:45:10Z

internal/gps/vcs_repo.go

+	if out, err := cmd.CombinedOutput(); err != nil {
+		return newVcsRemoteErrorOr(
+			"unable to update repository",
+			errors.Wrapf(err, "command failed: %v", cmd.Args),


these errors are going to end up being errors.Wrap'd higher up in the call stack - i'm not sure we gain much by doing it again here.

I believe these wraps are preserving the wrap that was happening inside of runFromRepoDir via combinedOutput

ahh, i see, that's right, you just added those. OK

(aside - man i wish we had consistent strategies for error handling in various subsystems)

tamird · 2017-09-11T01:57:26Z

Alright, this is at least green now. I still think that the fancy stuff with --progress isn't necessary; vcs commands aren't expected to randomly hang.

as noted in one of the individual line comments, moving to CommandContext is a regression, as it relies on Kill semantics for subprocess termination.

This is a good point; that's quite unfortunate that Kill is hardcoded. I think keeping the cmd.Process.Signal(os.Interrupt) semantics while removing everything else is possible, though.

sdboyer · 2017-09-11T05:27:56Z

I think keeping the cmd.Process.Signal(os.Interrupt) semantics while removing everything else is possible, though.

💯 definitely should be, yeah.

I still think that the fancy stuff with --progress isn't necessary; vcs commands aren't expected to randomly hang.

we've seen all manner of weird vcs behavior and slowness reported, especially at higher concurrency, but we never really have enough context from reports to know if something's truly hung, let alone what on. however, if we're removing these inactivity-based timeouts, then --progress becomes superfluous, because we no longer really care. so, yeah, 🗑 it 😄

tamird · 2017-09-12T22:53:45Z

Alright, this is ready for a look (and is green on travis)! AppVeyor failed with a problem installing bzr, but I built a windows binary locally (GOOS=windows go build ./cmd/dep) to confirm everything compiles, at least.

sdboyer · 2017-09-13T01:54:59Z

internal/gps/cmd_unix.go

+				// immediately to hard kill.
+				c.cancel()
+			} else {
+				time.AfterFunc(time.Minute, c.cancel)


The old threshold was 3s - why bump it up to a minute? (seeing it now, i have a thought about why this might be better, but i'm curious to hear your reasoning first)

also, just noting that this is slightly leaky - each of these will remain in memory until after the timeout expires. just ~200 bytes per, though, so i can't imagine a realistic scenario in which it makes any kind of difference.

Good call; plugged the leak!

I bumped it to a minute because there's no real benefit to being aggressive here; we expect most commands to shut down gracefully in a reasonable amount of time. I'm open to tuning this, but I think this value is just fine for now.

Use better context plumbing while I'm here.

sdboyer

yeah so, re: timeouts, my thinking went something like this: there are two scenarios in which that cancel gets tripped:

on a signal
on background work that was queued up during a normal solve run, but is now being terminated because Release() was called

the first scenario really doesn't concern me at all. the user has already issued an interrupt, which we've effectively passed on to child processes. if the child processes aren't exiting as we'd hoped, then this basically puts power back in the hands of the user to decide when to issue a hard kill, rather than doing it after the relatively short window of 3s.

the other scenario is slightly more concerning, as it means that errant subprocesses may end up delaying the exit of the command, and the user isn't already in the frame of mind to cancel. i'd say this scenario is unlikely, except it seems it literally just came up: #1163.

however, i think that a longer timeout still ends up being better. i mean, why terminate so fast? what are we trying to achieve? there's some potential running time harm, of course, but we default to the safe position on so many other things. we should do it here, too

tamird requested a review from sdboyer as a code owner September 1, 2017 12:22

googlebot added the cla: yes label Sep 1, 2017

sdboyer suggested changes Sep 1, 2017

View reviewed changes

sdboyer mentioned this pull request Sep 6, 2017

command killed after 10s of no activity #1087

Closed

F21 mentioned this pull request Sep 7, 2017

Add separate timeout for commands that touches the disk #1101

Closed

ibrasho added the area: gps label Sep 9, 2017

sdboyer mentioned this pull request Sep 11, 2017

Made command timeouts configurable #1028

Closed

sdboyer reviewed Sep 13, 2017

View reviewed changes

gps: remove arbitrary command timeouts

9e66033

Use better context plumbing while I'm here.

sdboyer approved these changes Sep 14, 2017

View reviewed changes

sdboyer merged commit 95db3e1 into golang:master Sep 14, 2017

tamird deleted the remove-timeouts branch September 14, 2017 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gps: remove arbitrary command timeouts #1110

gps: remove arbitrary command timeouts #1110

tamird commented Sep 1, 2017 •

edited by sdboyer

Loading

sdboyer left a comment

sdboyer Sep 1, 2017

sdboyer Sep 1, 2017

jmank88 Sep 1, 2017

sdboyer Sep 1, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 12, 2017

sdboyer Sep 13, 2017

sdboyer Sep 13, 2017

tamird Sep 13, 2017

sdboyer left a comment

gps: remove arbitrary command timeouts #1110

gps: remove arbitrary command timeouts #1110

Conversation

tamird commented Sep 1, 2017 • edited by sdboyer Loading

sdboyer left a comment

Choose a reason for hiding this comment

sdboyer Sep 1, 2017

Choose a reason for hiding this comment

sdboyer Sep 1, 2017

Choose a reason for hiding this comment

jmank88 Sep 1, 2017

Choose a reason for hiding this comment

sdboyer Sep 1, 2017

Choose a reason for hiding this comment

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 12, 2017

sdboyer Sep 13, 2017

Choose a reason for hiding this comment

sdboyer Sep 13, 2017

Choose a reason for hiding this comment

tamird Sep 13, 2017

Choose a reason for hiding this comment

sdboyer left a comment

Choose a reason for hiding this comment

tamird commented Sep 1, 2017 •

edited by sdboyer

Loading