fix: reduce deadlock potential in shell->client #180
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch attempts to reduce the potential to deadlock in the shell->client thread by removing the mutex lock around the shell->client unix socket stream. While recent changes to reduce deadlocks have successfully prevented a single locked session from gumming up all the other sessions, the core issue seems to be persisting. I don't see a smoking gun in the logs that a reporter sent me, but reducing the number of locks in the shell->client thread, which seems to be the one that gets locked up, might help. This should also be slight better for performance (not that it really matters here).
In addition to that main change, this patch contains a few little bugfixes:
Previously, every time the client probed for daemon presence, the daemon would put a distracting stack trace in its log. This wasn't a real issue, but it could be confusing for people not familiar with how shpool works. I've had a few people point to it in bug reports. I suppressed the stack trace and replaced it with a more civilized log line. Stack traces in the logs should be indicative of an issue, not business as usual.
I fixed the child_watcher thread to directly call waitpid rather than using the shpool_pty crate to do the work. This isn't ideal, but is required to avoid a use-after-close issue that was causing trouble. I think this bug was already present, but it just wasn't presenting a problem during tests for whatever reason (likely just timing). I had to fix this to get the tests passing.