Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

gps: remove arbitrary command timeouts #1110

Merged
merged 1 commit into from
Sep 14, 2017
Merged

gps: remove arbitrary command timeouts #1110

merged 1 commit into from
Sep 14, 2017

Conversation

tamird
Copy link
Contributor

@tamird tamird commented Sep 1, 2017

Use better context plumbing while I'm here.

@sdboyer

fixes #1101
fixes #1034

@tamird tamird requested a review from sdboyer as a code owner September 1, 2017 12:22
Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, cool! thanks for jumping right on this. this does go a little further than i'd like to in an initial step, though.

as noted in one of the individual line comments, moving to CommandContext is a regression, as it relies on Kill semantics for subprocess termination.

so, my preferred route here would be to retain monitoredCmd's interrupt-then-kill semantics, but modify it to have a mode without any kind of explicit timeout. then, for commands that don't generate output we can effectively monitor, we can avoid the timeout.

for example, git clone has --progress, so we can monitor it and retain the inactivity timeout. git checkout does not, however, because that flag was added in a sufficiently recent of git that i didn't want to run the risk.

out, err := runFromCwd(ctx, expensiveCmdTimeout, "git", "clone", "--recursive", "-v", "--progress", r.Remote(), r.LocalPath())
if err != nil {
return newVcsRemoteErrorOr("unable to get repository", err, string(out))
cmd := exec.CommandContext(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on CommandContext represents a regression vs. our current logic, as it relies on Process.Kill() to terminate the subprocess. we used to do that, and it produced lots of corrupted repositories. what we have isn't still isn't without problems, but it's better than all that.

if out, err := cmd.CombinedOutput(); err != nil {
return newVcsRemoteErrorOr(
"unable to update repository",
errors.Wrapf(err, "command failed: %v", cmd.Args),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these errors are going to end up being errors.Wrap'd higher up in the call stack - i'm not sure we gain much by doing it again here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these wraps are preserving the wrap that was happening inside of runFromRepoDir via combinedOutput

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, i see, that's right, you just added those. OK

(aside - man i wish we had consistent strategies for error handling in various subsystems)

@tamird
Copy link
Contributor Author

tamird commented Sep 11, 2017

Alright, this is at least green now. I still think that the fancy stuff with --progress isn't necessary; vcs commands aren't expected to randomly hang.

as noted in one of the individual line comments, moving to CommandContext is a regression, as it relies on Kill semantics for subprocess termination.

This is a good point; that's quite unfortunate that Kill is hardcoded. I think keeping the cmd.Process.Signal(os.Interrupt) semantics while removing everything else is possible, though.

@sdboyer
Copy link
Member

sdboyer commented Sep 11, 2017

I think keeping the cmd.Process.Signal(os.Interrupt) semantics while removing everything else is possible, though.

💯 definitely should be, yeah.

I still think that the fancy stuff with --progress isn't necessary; vcs commands aren't expected to randomly hang.

we've seen all manner of weird vcs behavior and slowness reported, especially at higher concurrency, but we never really have enough context from reports to know if something's truly hung, let alone what on. however, if we're removing these inactivity-based timeouts, then --progress becomes superfluous, because we no longer really care. so, yeah, 🗑 it 😄

@tamird
Copy link
Contributor Author

tamird commented Sep 12, 2017

Alright, this is ready for a look (and is green on travis)! AppVeyor failed with a problem installing bzr, but I built a windows binary locally (GOOS=windows go build ./cmd/dep) to confirm everything compiles, at least.

// immediately to hard kill.
c.cancel()
} else {
time.AfterFunc(time.Minute, c.cancel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old threshold was 3s - why bump it up to a minute? (seeing it now, i have a thought about why this might be better, but i'm curious to hear your reasoning first)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, just noting that this is slightly leaky - each of these will remain in memory until after the timeout expires. just ~200 bytes per, though, so i can't imagine a realistic scenario in which it makes any kind of difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call; plugged the leak!

I bumped it to a minute because there's no real benefit to being aggressive here; we expect most commands to shut down gracefully in a reasonable amount of time. I'm open to tuning this, but I think this value is just fine for now.

Use better context plumbing while I'm here.
Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so, re: timeouts, my thinking went something like this: there are two scenarios in which that cancel gets tripped:

  1. on a signal
  2. on background work that was queued up during a normal solve run, but is now being terminated because Release() was called

the first scenario really doesn't concern me at all. the user has already issued an interrupt, which we've effectively passed on to child processes. if the child processes aren't exiting as we'd hoped, then this basically puts power back in the hands of the user to decide when to issue a hard kill, rather than doing it after the relatively short window of 3s.

the other scenario is slightly more concerning, as it means that errant subprocesses may end up delaying the exit of the command, and the user isn't already in the frame of mind to cancel. i'd say this scenario is unlikely, except it seems it literally just came up: #1163.

however, i think that a longer timeout still ends up being better. i mean, why terminate so fast? what are we trying to achieve? there's some potential running time harm, of course, but we default to the safe position on so many other things. we should do it here, too

@sdboyer sdboyer merged commit 95db3e1 into golang:master Sep 14, 2017
@tamird tamird deleted the remove-timeouts branch September 14, 2017 13:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add separate timeout for commands that touches the disk Issue with cloudfoundry cli dependency
5 participants