-
Notifications
You must be signed in to change notification settings - Fork 1k
gps: remove arbitrary command timeouts #1110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, cool! thanks for jumping right on this. this does go a little further than i'd like to in an initial step, though.
as noted in one of the individual line comments, moving to CommandContext
is a regression, as it relies on Kill
semantics for subprocess termination.
so, my preferred route here would be to retain monitoredCmd
's interrupt-then-kill semantics, but modify it to have a mode without any kind of explicit timeout. then, for commands that don't generate output we can effectively monitor, we can avoid the timeout.
for example, git clone
has --progress
, so we can monitor it and retain the inactivity timeout. git checkout
does not, however, because that flag was added in a sufficiently recent of git that i didn't want to run the risk.
internal/gps/vcs_repo.go
Outdated
out, err := runFromCwd(ctx, expensiveCmdTimeout, "git", "clone", "--recursive", "-v", "--progress", r.Remote(), r.LocalPath()) | ||
if err != nil { | ||
return newVcsRemoteErrorOr("unable to get repository", err, string(out)) | ||
cmd := exec.CommandContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relying on CommandContext
represents a regression vs. our current logic, as it relies on Process.Kill()
to terminate the subprocess. we used to do that, and it produced lots of corrupted repositories. what we have isn't still isn't without problems, but it's better than all that.
internal/gps/vcs_repo.go
Outdated
if out, err := cmd.CombinedOutput(); err != nil { | ||
return newVcsRemoteErrorOr( | ||
"unable to update repository", | ||
errors.Wrapf(err, "command failed: %v", cmd.Args), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these errors are going to end up being errors.Wrap
'd higher up in the call stack - i'm not sure we gain much by doing it again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these wraps are preserving the wrap that was happening inside of runFromRepoDir
via combinedOutput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, i see, that's right, you just added those. OK
(aside - man i wish we had consistent strategies for error handling in various subsystems)
Alright, this is at least green now. I still think that the fancy stuff with
This is a good point; that's quite unfortunate that |
💯 definitely should be, yeah.
we've seen all manner of weird vcs behavior and slowness reported, especially at higher concurrency, but we never really have enough context from reports to know if something's truly hung, let alone what on. however, if we're removing these inactivity-based timeouts, then |
Alright, this is ready for a look (and is green on travis)! AppVeyor failed with a problem installing bzr, but I built a windows binary locally ( |
internal/gps/cmd_unix.go
Outdated
// immediately to hard kill. | ||
c.cancel() | ||
} else { | ||
time.AfterFunc(time.Minute, c.cancel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old threshold was 3s - why bump it up to a minute? (seeing it now, i have a thought about why this might be better, but i'm curious to hear your reasoning first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, just noting that this is slightly leaky - each of these will remain in memory until after the timeout expires. just ~200 bytes per, though, so i can't imagine a realistic scenario in which it makes any kind of difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call; plugged the leak!
I bumped it to a minute because there's no real benefit to being aggressive here; we expect most commands to shut down gracefully in a reasonable amount of time. I'm open to tuning this, but I think this value is just fine for now.
Use better context plumbing while I'm here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah so, re: timeouts, my thinking went something like this: there are two scenarios in which that cancel gets tripped:
- on a signal
- on background work that was queued up during a normal solve run, but is now being terminated because
Release()
was called
the first scenario really doesn't concern me at all. the user has already issued an interrupt, which we've effectively passed on to child processes. if the child processes aren't exiting as we'd hoped, then this basically puts power back in the hands of the user to decide when to issue a hard kill, rather than doing it after the relatively short window of 3s.
the other scenario is slightly more concerning, as it means that errant subprocesses may end up delaying the exit of the command, and the user isn't already in the frame of mind to cancel. i'd say this scenario is unlikely, except it seems it literally just came up: #1163.
however, i think that a longer timeout still ends up being better. i mean, why terminate so fast? what are we trying to achieve? there's some potential running time harm, of course, but we default to the safe position on so many other things. we should do it here, too
Use better context plumbing while I'm here.
@sdboyer
fixes #1101
fixes #1034