Resync from fsevents journal on UserDropped #287

wez · 2016-06-19T18:35:41Z

This PR implements a new configurable feature that will attempt to resync from the persistent fsevents journal in the event of a user-space overflow.

In my recent performance tests I've been unable to trigger a UserDropped event while watchman is running under instrumentation. Furthermore, the performance data shows that the processing callback takes only a handful of milliseconds to run, with the critical section taking microseconds. This leads me to believe that the root cause of the UserDropped events is not tardiness in the processing loop, but rather that we are not getting the CPU in a timely fashion.

Therefore it seems reasonable that we should be able to successfully resync from the fsevents journal when we do eventually get back on the CPU.

The approach taken here is to refactor the code that creates the stream such that we can create more than one, and pass in the event id as the since parameter.

Initially we sync from the current point in time, but if we do experience a UserDropped event we will set up a new stream using the last observed good event id.

This resync is not as heavyweight as a full recrawl because fsevents promises to tell us about the changes that we didn't see; we're effectively rewinding the stream to resync. This should cut out a lot of recursive crawling IO and take less time over all.

Testing: I've spent a couple of hours and have been unable to trigger a UserDropped event with the 4.6 release candidate code thus far (my test case for this involves changing the sparse status of a deep tree), but our telemetry shows that this has happened for a couple of our mac users. I'm going to try some build intensive tests when I get back into the office.

ghost · 2016-06-19T21:22:09Z

@wez updated the pull request.

ghost · 2016-06-19T21:56:35Z

@wez updated the pull request.

ghost · 2016-06-19T22:15:06Z

@wez updated the pull request.

sunshowers · 2016-06-20T17:23:02Z

watcher/fsevents.c

-  // Block until fsevents_root_start is waiting for our initialization
-  pthread_mutex_lock(&state->fse_mtx);
+  if (!fse_stream) {
+    root->failure_reason = w_string_new("OOM");


Presumably in an OOM you wouldn't be able to allocate a new string?

sunshowers · 2016-06-20T18:04:01Z

OK, I'll probably have another pass over it.

This factors out the code that sets up the event stream for fsevents. This is prep work for a more intelligent re-sync operation for dealing with UserDropped events.

During initial setup it is valid to directup update root->failure_reason, but in the resync case, we don't want it to stick in the root because we may treat that later on as a failure to recrawl.

ghost · 2016-06-20T20:40:49Z

@wez updated the pull request.

This asserts that the resync leaves us with a correct state, and turns on this feature by default. I inspected the log files and can see that we receive the cookie notifications for a prior cookie again after we've set up the replacement stream. It looks good, but we can't easily make assertions on the log file contents in the test without a high chance of flakiness.

The docs for FSEventsCopyUUIDForDevice have more commentary on the rules behind this. The TL;DR is that we need to ensure that the UUID matches the last one we saw before we can legitimately re-sync from the last_good event id.

Log when we resync, along with the eventid, so that we can get a feel for how many events we dropped. Also remember when the event id overflows so that we don't bother trying to resync.

ghost · 2016-06-20T21:44:34Z

@wez updated the pull request.

sunshowers · 2016-06-22T02:46:50Z

watcher/fsevents.c

+
+            if (!replacement) {
+              w_log(W_LOG_ERR,
+                    "Failed to rebuild fsevent stream (%.*s) while trying to "


sunshowers · 2016-06-22T02:48:51Z

LGTM

ghost added GH Review: review-needed CLA Signed labels Jun 19, 2016

sunshowers reviewed Jun 20, 2016
View reviewed changes

sunshowers added GH Review: needs-revision and removed GH Review: review-needed labels Jun 20, 2016

wez added 7 commits June 20, 2016 13:39

refactor: fsevents stream setup

2e2c3e6

This factors out the code that sets up the event stream for fsevents. This is prep work for a more intelligent re-sync operation for dealing with UserDropped events.

allow for passing a stream id when setting up the fsevents stream

587fcd2

save the last good event id

c5d181f

track the fsevents stream in the state, not a local

e0a478b

add config option to control whether we want to try to resync ourselves

ce7c711

resync since last good event id on user-dropped

adf3583

return failure reason as "out" parameter from fse_stream_make

1b5ad6e

During initial setup it is valid to directup update root->failure_reason, but in the resync case, we don't want it to stick in the root because we may treat that later on as a failure to recrawl.

wez force-pushed the resync branch from 99de878 to 773c217 Compare June 20, 2016 20:40

ghost added GH Review: review-needed and removed GH Review: needs-revision labels Jun 20, 2016

wez added 3 commits June 20, 2016 14:43

handle the case where the journal UUID changed

3bcc382

The docs for FSEventsCopyUUIDForDevice have more commentary on the rules behind this. The TL;DR is that we need to ensure that the UUID matches the last one we saw before we can legitimately re-sync from the last_good event id.

handle a couple of flags that may be set in fsevents

0ee4801

Log when we resync, along with the eventid, so that we can get a feel for how many events we dropped. Also remember when the event id overflows so that we don't bother trying to resync.

wez force-pushed the resync branch from 773c217 to 0ee4801 Compare June 20, 2016 21:43

sunshowers reviewed Jun 22, 2016
View reviewed changes

sunshowers added GH Review: accepted and removed GH Review: review-needed labels Jun 22, 2016

wez merged commit 0ee4801 into facebook:master Jun 22, 2016

wez deleted the resync branch June 22, 2016 14:30

vtjnash mentioned this pull request Mar 9, 2020

os x: FSEventStreamFlushSync assertions logged to console nodejs/node#854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resync from fsevents journal on UserDropped #287

Resync from fsevents journal on UserDropped #287

wez commented Jun 19, 2016

ghost commented Jun 19, 2016

ghost commented Jun 19, 2016

ghost commented Jun 19, 2016

sunshowers Jun 20, 2016

sunshowers commented Jun 20, 2016

ghost commented Jun 20, 2016

ghost commented Jun 20, 2016

sunshowers Jun 22, 2016

sunshowers commented Jun 22, 2016

Resync from fsevents journal on UserDropped #287

Resync from fsevents journal on UserDropped #287

Conversation

wez commented Jun 19, 2016

ghost commented Jun 19, 2016

ghost commented Jun 19, 2016

ghost commented Jun 19, 2016

sunshowers Jun 20, 2016

Choose a reason for hiding this comment

sunshowers commented Jun 20, 2016

ghost commented Jun 20, 2016

ghost commented Jun 20, 2016

sunshowers Jun 22, 2016

Choose a reason for hiding this comment

sunshowers commented Jun 22, 2016