Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resync from fsevents journal on UserDropped #287

Merged
merged 10 commits into from
Jun 22, 2016
Merged

Resync from fsevents journal on UserDropped #287

merged 10 commits into from
Jun 22, 2016

Conversation

wez
Copy link
Contributor

@wez wez commented Jun 19, 2016

This PR implements a new configurable feature that will attempt to resync from the persistent fsevents journal in the event of a user-space overflow.

In my recent performance tests I've been unable to trigger a UserDropped event while watchman is running under instrumentation. Furthermore, the performance data shows that the processing callback takes only a handful of milliseconds to run, with the critical section taking microseconds. This leads me to believe that the root cause of the UserDropped events is not tardiness in the processing loop, but rather that we are not getting the CPU in a timely fashion.

Therefore it seems reasonable that we should be able to successfully resync from the fsevents journal when we do eventually get back on the CPU.

The approach taken here is to refactor the code that creates the stream such that we can create more than one, and pass in the event id as the since parameter.

Initially we sync from the current point in time, but if we do experience a UserDropped event we will set up a new stream using the last observed good event id.

This resync is not as heavyweight as a full recrawl because fsevents promises to tell us about the changes that we didn't see; we're effectively rewinding the stream to resync. This should cut out a lot of recursive crawling IO and take less time over all.

Testing: I've spent a couple of hours and have been unable to trigger a UserDropped event with the 4.6 release candidate code thus far (my test case for this involves changing the sparse status of a deep tree), but our telemetry shows that this has happened for a couple of our mac users. I'm going to try some build intensive tests when I get back into the office.

@ghost
Copy link

ghost commented Jun 19, 2016

@wez updated the pull request.

2 similar comments
@ghost
Copy link

ghost commented Jun 19, 2016

@wez updated the pull request.

@ghost
Copy link

ghost commented Jun 19, 2016

@wez updated the pull request.

// Block until fsevents_root_start is waiting for our initialization
pthread_mutex_lock(&state->fse_mtx);
if (!fse_stream) {
root->failure_reason = w_string_new("OOM");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably in an OOM you wouldn't be able to allocate a new string?

@sunshowers
Copy link
Contributor

OK, I'll probably have another pass over it.

wez added 7 commits June 20, 2016 13:39
This factors out the code that sets up the event stream for fsevents.
This is prep work for a more intelligent re-sync operation for dealing
with UserDropped events.
During initial setup it is valid to directup update
root->failure_reason, but in the resync case, we don't want it to stick
in the root because we may treat that later on as a failure to recrawl.
@ghost
Copy link

ghost commented Jun 20, 2016

@wez updated the pull request.

wez added 3 commits June 20, 2016 14:43
This asserts that the resync leaves us with a correct state,
and turns on this feature by default.

I inspected the log files and can see that we receive the cookie
notifications for a prior cookie again after we've set up the
replacement stream.  It looks good, but we can't easily make assertions
on the log file contents in the test without a high chance of flakiness.
The docs for FSEventsCopyUUIDForDevice have more commentary on the
rules behind this.

The TL;DR is that we need to ensure that the UUID matches the last
one we saw before we can legitimately re-sync from the last_good
event id.
Log when we resync, along with the eventid, so that we can get a feel
for how many events we dropped.

Also remember when the event id overflows so that we don't bother trying
to resync.
@ghost
Copy link

ghost commented Jun 20, 2016

@wez updated the pull request.


if (!replacement) {
w_log(W_LOG_ERR,
"Failed to rebuild fsevent stream (%.*s) while trying to "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsevents

@sunshowers
Copy link
Contributor

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants