-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resync from fsevents journal on UserDropped #287
Conversation
@wez updated the pull request. |
// Block until fsevents_root_start is waiting for our initialization | ||
pthread_mutex_lock(&state->fse_mtx); | ||
if (!fse_stream) { | ||
root->failure_reason = w_string_new("OOM"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably in an OOM you wouldn't be able to allocate a new string?
OK, I'll probably have another pass over it. |
This factors out the code that sets up the event stream for fsevents. This is prep work for a more intelligent re-sync operation for dealing with UserDropped events.
During initial setup it is valid to directup update root->failure_reason, but in the resync case, we don't want it to stick in the root because we may treat that later on as a failure to recrawl.
@wez updated the pull request. |
This asserts that the resync leaves us with a correct state, and turns on this feature by default. I inspected the log files and can see that we receive the cookie notifications for a prior cookie again after we've set up the replacement stream. It looks good, but we can't easily make assertions on the log file contents in the test without a high chance of flakiness.
The docs for FSEventsCopyUUIDForDevice have more commentary on the rules behind this. The TL;DR is that we need to ensure that the UUID matches the last one we saw before we can legitimately re-sync from the last_good event id.
Log when we resync, along with the eventid, so that we can get a feel for how many events we dropped. Also remember when the event id overflows so that we don't bother trying to resync.
@wez updated the pull request. |
|
||
if (!replacement) { | ||
w_log(W_LOG_ERR, | ||
"Failed to rebuild fsevent stream (%.*s) while trying to " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsevents
LGTM |
This PR implements a new configurable feature that will attempt to resync from the persistent fsevents journal in the event of a user-space overflow.
In my recent performance tests I've been unable to trigger a UserDropped event while watchman is running under instrumentation. Furthermore, the performance data shows that the processing callback takes only a handful of milliseconds to run, with the critical section taking microseconds. This leads me to believe that the root cause of the UserDropped events is not tardiness in the processing loop, but rather that we are not getting the CPU in a timely fashion.
Therefore it seems reasonable that we should be able to successfully resync from the fsevents journal when we do eventually get back on the CPU.
The approach taken here is to refactor the code that creates the stream such that we can create more than one, and pass in the event id as the since parameter.
Initially we sync from the current point in time, but if we do experience a UserDropped event we will set up a new stream using the last observed good event id.
This resync is not as heavyweight as a full recrawl because fsevents promises to tell us about the changes that we didn't see; we're effectively rewinding the stream to resync. This should cut out a lot of recursive crawling IO and take less time over all.
Testing: I've spent a couple of hours and have been unable to trigger a UserDropped event with the 4.6 release candidate code thus far (my test case for this involves changing the sparse status of a deep tree), but our telemetry shows that this has happened for a couple of our mac users. I'm going to try some build intensive tests when I get back into the office.