-
Notifications
You must be signed in to change notification settings - Fork 39
/
l12-bayou.html
796 lines (652 loc) · 24 KB
/
l12-bayou.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
<h1>6.824 2015 Lecture 12: Eventual Consistency</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the
6.824 <a href="http://nil.csail.mit.edu/6.824/2015/schedule.html">course website</a> from
Spring 2015.</p>
<h2>Exam</h2>
<ul>
<li>Bring papers and lecture notes for exam</li>
</ul>
<h2>Bayou: Eventual consistency</h2>
<ul>
<li>a set of copies of the data, where applications can use any copy of the data</li>
<li>local read/write</li>
<li>even if the network breaks, I can still use the local copy
<ul>
<li><em>disconnected operation</em></li>
</ul></li>
<li>ad-hoc synchronization
<ul>
<li>laptop, phone, tablet can synchronize amongst each other instead
of relying on Internet connection</li>
</ul></li>
<li>can work with database servers that have different data and synchronize
with each other</li>
<li>similar to Ficus, but Bayou has more sophisticated conflict resolution</li>
</ul>
<h3>Conflicts</h3>
<ul>
<li>what to do about the inevitable conflicts that happen when you allow people
to write to their local copies and synchronize them later</li>
</ul>
<h3>Meeting room scheduler</h3>
<p>Traditional approach (central server):</p>
<pre><code> PDA
|-----------------
|9am 824 staff |----------------|
|-- | Server |
|10am -------------> | DB | |
|-- | 9am | |
|11am | 10am | |
|-- |
|12pm |
|--
</code></pre>
<p>Not a good approach because it requires everyone to have connectivity to
the server.</p>
<p>Would be nice if you have PDA send appointment to laptop, which can then send it
to the server.</p>
<pre><code> PDA
|-----------------
|9am 824 staff |----------------|
|-- | Server |
|10am | DB | |
|-- | 9am | | <-----\
|11am | 10am | | \
|-- | |
|12pm | laptop
|-- \ /
\----------------------------------->/
</code></pre>
<h3>Update functions</h3>
<p><strong>Main idea:</strong> Update functions. Instead of the application saying "write this DB
record", the application hands a function that behaves differently based on
what's in the DB.</p>
<p>Example:</p>
<ul>
<li>if free at 10am
<ul>
<li>reserve @10am</li>
</ul></li>
<li>else if free at 9am
<ul>
<li>reserve @9am</li>
</ul></li>
<li>else
<ul>
<li>reserve</li>
</ul></li>
</ul>
<p>Bayou takes this function from the PDA and gives it to the laptop.</p>
<p>Suppose A and B want the same times:</p>
<ul>
<li>A wants: either staff meeting at 10 or 11</li>
<li>B wants: hiring meeting at 10 or 11</li>
</ul>
<p>If you simply apply these functions to node A's and B's databases, that's not
enough:</p>
<ul>
<li>X syncs with A: X gets 10am staff meeting</li>
<li>X syncs with B: X gets 11am hiring meeting</li>
<li>Y syncs with B: Y gets 10am hiring meeting</li>
<li>Y syncs with A: Y gets 11am staff meeting</li>
<li><strong>Bad:</strong> now X and Y have differing views</li>
</ul>
<p><code>=></code> have to execute <code>A</code>'s and <code>B</code>'s update functions in the same order</p>
<h3>Numbering updates</h3>
<p><strong>Next idea:</strong> number update functions, so that you can view them as being a log</p>
<ul>
<li>Classic way to order things is to stamp them with numbers and sorting</li>
<li>initially let the Bayou update ID be <code><time T, nodeId></code>
<ul>
<li>possible for time <code>T</code> to be the same for two update IDs, but then
the node IDs will differ (presumably)</li>
</ul></li>
<li>ordering rules:
<ul>
<li><code>a < b</code> if <code>a.T < b.T</code> or <code>a.T == b.T and a.ID < b.ID</code></li>
</ul></li>
</ul>
<p>If we take the previous example:</p>
<pre><code><T=10, nodeId=A>, A wants: either staff meeting at 10 or 11
<T=20, nodeId=B>, B wants: hiring meeting at 10 or 11
</code></pre>
<ul>
<li>When Y syncs with B and then with A, it'll see A's update occurred earlier</li>
<li>so it undoes B's update, applies A's and then B's again</li>
</ul>
<p>We need to be able to roll back and re-execute the log.</p>
<p>Are the updates consistent with causality?</p>
<ul>
<li>PDA A adds a meeting</li>
<li>A synchronizes with B</li>
<li>B deletes A's meeting</li>
</ul>
<p>If some 3rd node sees these updates, it would be necessary to have the meeting
creation timestamp be smaller than the deletion timestamp.</p>
<h3>Lamport logical clock</h3>
<p>Each node maintains <code>T_max</code>, the highest timestamp this node has ever seen
from itself or from another node.</p>
<p>When a node creates an event and adds it to the log, it picks timestamp <code>T =
max (T_max + 1, wall clock time)</code></p>
<ul>
<li>new timestamps are always higher than timestamps the node has ever seen</li>
</ul>
<h3>Tentative entries, commit scheme</h3>
<p>It's annoying that entries in the calendar are always displayed as tentative
because another (earlier) update could come in and replace it.</p>
<ul>
<li>maybe because the new update sender was disconnected for a long time</li>
</ul>
<p>We're looking for a way to all agree that anything above a certain point in the
log will never change (it's frozen, no one can modify stuff there)</p>
<p><strong>Bad idea:</strong> One possibility is to have all the replicas exchange summary w/
each other about what they've seen:</p>
<ul>
<li>X has seen all A's updates through 20, B's through 17, and C's through 72
<ul>
<li>these are timestamps (logical clocks)</li>
</ul></li>
<li>we know that X will never create a timestamp less than 72</li>
<li>similarly, node Y also has a min timestamp that he will generate next
<ul>
<li>say 30</li>
</ul></li>
<li>we can take the minimum over all these minimums <code>min(30, 72) = 30</code> and
commit all operations up to that point</li>
<li>problem is it requires every node to be up and connected to all other nodes</li>
</ul>
<h4>Commit scheme for Bayou</h4>
<p>They have one magic node, a primary. Every update that passes through the primary,
the primary stamps it with a <em>commit sequence number</em> (CSN), the actual ordering number
becomes: <code><csn, T, node ID></code></p>
<ul>
<li>primary does not wait for earlier updates (with smaller <code>T</code>) to arrive first,
it just timestamps things as they come</li>
<li>commit preserves causal order</li>
<li>commit does not preserve wall clock order</li>
</ul>
<p>If you don't have a CSN: <code><-, T, nodeID></code>, all commited operations are considered
to occur before uncommitted ones. </p>
<p><strong>TODO:</strong> not clear what this example was supposed to show</p>
<ul>
<li>A's meeting created</li>
<li>B's meeting created</li>
<li>B synchronizes with C</li>
<li>B synchronizes with A</li>
<li>C synchronizes with primary</li>
<li>primary applies CSN to A's op, but not B's</li>
<li>B synchronizes with primary</li>
</ul>
<h3>Vector timestamps</h3>
<p>Synchronization</p>
<ul>
<li>A has
<ul>
<li><code><-, 10, X></code></li>
<li><code><-, 20, Y></code></li>
<li><code><-, 30, X></code></li>
<li><code><-, 40, X></code></li>
</ul></li>
<li>B has
<ul>
<li><code><-, 10, X></code></li>
<li><code><-, 20, Y></code></li>
<li><code><-, 30, X></code></li>
</ul></li>
<li>A syncs with B
<ul>
<li>sends a version vector to B describe which updates it has
from every node
<ul>
<li>A: <code>[X 40, Y 20]</code></li>
<li>(remember that the timestamps are always increased by senders)</li>
<li>B: <code>[X 30, Y 20]</code></li>
<li>If B compares A's VT with his, he notices that he needs
updates by X between timestamp 30 and 40</li>
</ul></li>
</ul></li>
</ul>
<h3>A new node joins</h3>
<p>Now some VTs will have an entry for some new node Z. For instance, in the previous
example</p>
<ul>
<li>A can send [X 40, Y 20, Z 60] to B</li>
</ul>
<p>We also need a way to remove nodes.</p>
<p>But B won't know if <code>Z</code> is newly added or newly deleted?</p>
<ul>
<li>Z joins the system</li>
<li>Z talks to X</li>
<li>X generates Z's unique node ID
<ul>
<li>Z's ID = <code><Tz, X's node ID></code>, where Tz is the time Z talked to X</li>
</ul></li>
<li><p>X sends an update timestamped with <code><-, Tz, X></code> that says "new server z"</p>
<ul>
<li><p>Everybody will see this first before seeing Z's updates</p>
<ul>
<li><p>Z's updates have timestamps higher than Tz</p></li>
<li><p>note that IDs are unbounded in size</p></li>
</ul></li>
</ul></li>
</ul>
<p>Forgetting nodes:</p>
<ul>
<li>Z's ID = <code><20, X></code></li>
<li>A syncs -> B</li>
<li>A has log entry from Z <code><-, 25, <20, X>></code></li>
<li>B has no VT entry for Z</li>
</ul>
<p>Now B needs to figure out from A's updates if Z was added or removed</p>
<p>Case 1: If B's VT entry for <code>X</code> is less than the timestamp in <code>Z</code>'s ID, then
that means that <code>B</code> hasn't even seen the creation for <code>Z</code>, let alone any updates
from <code>Z</code> => <code>B</code> should create the entry for <code>Z</code> because <code>Z</code> is new to <code>B</code></p>
<p>Case 2: If B's VT entry for <code>X</code> is higher than the timestamp in <code>Z</code>'s ID, (ie.
B has seen updates from <code>X</code> after it created <code>Z</code>), then B must've seen <code>Z</code>'s
creation <code>=></code> B must have seen a deletion notice</p>
<p><strong>Q:</strong> If Z's entry is missing from <code>B</code> then <code>Z</code> (probably?) says <code><-, T, Z> bye, T > Tz</code></p>
<hr />
<h1>6.824 notes</h1>
<p><a href="papers/bayou-conflicts.pdf">Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage
System</a> Terry, Theimer, Petersen, Demers, Spreitzer,
Hauser, SOSP 95</p>
<p>Some material from <a href="http://people.cs.umass.edu/~arun/cs677/notes/Bayou.pdf">Flexible Update Propagation for Weakly Consistent
Replication</a>, SOSP 97</p>
<h2>Why this paper?</h2>
<ul>
<li>Eventual consistency is pretty common
<ul>
<li>git, iPhone sync, Dropbox, Amazon Dynamo</li>
</ul></li>
<li>Why do people like eventual consistency?
<ul>
<li>fast read/write of local copy (no primary, no paxos)</li>
<li>disconnected operation</li>
</ul></li>
<li>What goes wrong?
<ul>
<li>doesn't look like "single copy" (no primary, no paxos)</li>
<li>conflicting writes to different copies</li>
<li>how to reconcile them when discovered?</li>
</ul></li>
<li>Bayou has the most sophisticated reconciliation story</li>
</ul>
<p>Paper context:</p>
<ul>
<li>Early 1990s (like Ficus)</li>
<li>Dawn of PDAs, laptops, tablets
<ul>
<li>H/W clunky but clear potential</li>
<li>Commercial devices did not have wireless</li>
</ul></li>
<li>Devices might be off or not have network access
<ul>
<li>This problem has not gone away!</li>
<li>iPhone sync, Dropbox sync, Dynamo</li>
</ul></li>
</ul>
<p>Let's build a conference room scheduler</p>
<ul>
<li>Only one meeting allowed at a time (one room).</li>
<li>Each entry has a time and a description.</li>
<li>We want everyone to end up seeing the same set of entries.</li>
</ul>
<p>Traditional approach: one server</p>
<ul>
<li>Server executes one client request at a time</li>
<li>Checks for conflicting time, says yes or no</li>
<li>Updates DB</li>
<li>Proceeds to next request</li>
<li>Server implicitly chooses order for concurrent requests</li>
</ul>
<p>Why aren't we satisfied with central server?
- I want to use scheduler on disconnected iPhone &c
+ So need DB replica in each node.
+ Modify on any node, as well as read.
- Periodic connectivity to net.
- Periodic direct contact with other calendar users (e.g. bluetooth).</p>
<h2>Straw man 1: merge DBs</h2>
<ul>
<li>Similar to iPhone calendar sync, or file sync.</li>
<li>May need to compare every DB entry -- lots of time and net b/w.</li>
<li>Still need a story for conflicting entries, i.e. two meetings at same time.
<ul>
<li>User may not be available to decide at time of DB merge.</li>
<li>So need automatic reconciliation.</li>
</ul></li>
</ul>
<p>Idea for conflicts: update functions</p>
<ul>
<li>Application supplies a function, not a new value.</li>
<li>Read current state of DB, decide best change.</li>
<li>E.g. "Meet at 9 if room is free at 9, else 10, else 11."
<ul>
<li>Rather than just "Meet at 9"</li>
</ul></li>
<li>Function can make reconciliation decision for absent user.</li>
<li>Sync exchanges functions, not DB content.</li>
</ul>
<p><strong>Problem:</strong> can't just apply update functions to DB replica</p>
<ul>
<li>A's fn: staff meeting at 10:00 or 11:00</li>
<li>B's fn: hiring meeting at 10:00 or 11:00</li>
<li>X syncs w/ A, then B</li>
<li>Y syncs w/ B, then A</li>
<li>Will X put A's meeting at 10:00, and Y put A's at 11:00?</li>
</ul>
<p><strong>Goal:</strong> eventual consistency</p>
<ul>
<li>OK for X and Y to disagree initially</li>
<li>But after enough syncing, all nodes' DBs should be identical</li>
</ul>
<p><strong>Idea:</strong> ordered update log</p>
<ul>
<li>Ordered log of updates at each node.</li>
<li>Syncing == ensure both nodes have same updates in log.</li>
<li>DB is result of applying update functions in order.</li>
<li>Same log <code>=></code> same order <code>=></code> same DB content.</li>
</ul>
<p>How can nodes agree on update order?</p>
<ul>
<li>Update ID: <code><time T, node ID></code></li>
<li>T is creating node's wall-clock time.</li>
<li>Ordering updates a and b:
<ul>
<li><code>a < b</code> if <code>a.T < b.T</code> or (<code>a.T = b.T</code> and <code>a.ID < b.ID</code>)</li>
</ul></li>
</ul>
<p>Example:</p>
<pre><code> <10,A>: staff meeting at 10:00 or 11:00
<20,B>: hiring meeting at 10:00 or 11:00
what's the correct eventual outcome?
the result of executing update functions in timestamp order
staff at 10:00, hiring at 11:00
</code></pre>
<p>What DB content before sync?</p>
<ul>
<li>A: staff at 10:00</li>
<li>B: hiring at 10:00</li>
<li>This is what A/B user will see before syncing.</li>
</ul>
<p>Now A and B sync with each other</p>
<ul>
<li>Each sorts new entries into its log, order by time-stamp</li>
<li>Both now know the full set of updates</li>
<li>A can just run B's update function</li>
<li>But B has <em>already</em> run B's operation, too soon!</li>
</ul>
<p>Roll back and replay</p>
<ul>
<li>B needs to to "roll back" DB, re-run both ops in the right order</li>
<li>Big point: the log holds the truth; the DB is just an optimization</li>
<li>We will optimize roll-back in a bit</li>
</ul>
<p>Displayed meeting room calendar entries are "tentative"</p>
<ul>
<li>B's user saw hiring at 10, then it changed to hiring at 11</li>
</ul>
<p>Will update order be consistent with wall-clock time?</p>
<ul>
<li>Maybe A went first (in wall-clock time) with <code><10,A></code></li>
<li>Node clocks unlikely to be perfectly synchronized</li>
<li>So B could then generate <9,B></li>
<li>B's meeting gets priority, even though A asked first</li>
<li>Not "externally consistent"</li>
</ul>
<p>Will update order be consistent with causality?</p>
<ul>
<li>What if A adds a meeting,
<ul>
<li>then B sees A's meeting,</li>
<li>then B deletes A's meeting.</li>
</ul></li>
<li>Perhaps
<ul>
<li><code><10,A> add</code></li>
<li><code><9,B> delete</code> -- B's clock is slow</li>
</ul></li>
<li>Now delete will be ordered before add!</li>
</ul>
<h3>Lamport logical clocks for causal consistency</h3>
<ul>
<li>Want to timestamp events s.t.
<ul>
<li>if node observes E1, then generates E2, then <code>TS(E2) > TS(E1)</code></li>
</ul></li>
<li>So all nodes will order E1, then E2</li>
<li><code>Tmax</code> = highest time-stamp seen from any node (including self)</li>
<li><code>T = max(Tmax + 1, wall-clock time)</code> -- to generate a timestamp</li>
<li>Note properties:
<ul>
<li>E1 then E2 on same node <code>=> TS(E1) < TS(E2)</code></li>
<li>BUT</li>
<li><code>TS(E1) < TS(E2)</code> does not imply E1 came before E2</li>
</ul></li>
</ul>
<p>Logical clock solves add/delete causality example
- When B sees <code><10,A></code>,
+ B will set its Tmax to 10, so
+ B will generate <code><11,B></code> for its delete</p>
<p>Irritating that there could always be a long-delayed update with lower TS</p>
<ul>
<li>That can cause the results of my update to change
<ul>
<li>User can never be sure if meeting time is final!</li>
</ul></li>
<li>Would be nice if updates were eventually "stable"
<ul>
<li><code>=></code> no changes in update order up to that point</li>
<li><code>=></code> results can never again change -- you know for sure when your meeting is</li>
<li><code>=></code> don't have to roll back, re-run committed updates</li>
</ul></li>
</ul>
<p><strong>Bad idea:</strong> a fully decentralized "commit" scheme</p>
<ul>
<li>Proposal: <code><10,A></code> is stable if all nodes have seen all updates w/ <code>TS <= 10</code></li>
<li>Have sync always send in log order -- "prefix property"</li>
<li>If you have seen updates w/ <code>TS > 10</code> from <em>every</em> node
<ul>
<li>Then you'll never again see one <code>< <10,A></code></li>
<li>So <code><10,A></code> is stable</li>
</ul></li>
<li>Why doesn't Bayou do this?
<ul>
<li>Not all nodes are connected to each other</li>
</ul></li>
</ul>
<p>How does Bayou commit updates, so that they are stable?</p>
<ul>
<li>One node designated "primary replica".</li>
<li>It marks each update it receives with a permanent CSN.
<ul>
<li>Commit Sequence Number.</li>
<li>That update is committed.</li>
<li>So a complete time stamp is <code><CSN, local-time, node-id></code></li>
<li>Uncommitted updates (are considered to) come after all committed updates
w/ this new timestamping scheme</li>
</ul></li>
<li>CSN notifications are synced between nodes.</li>
<li>The CSNs define a total order for committed updates.
<ul>
<li>All nodes will eventually agree on it.</li>
</ul></li>
</ul>
<p>Will commit order match tentative order?</p>
<ul>
<li>Often.</li>
<li>Syncs send in log order (prefix property)
<ul>
<li>Including updates learned from other nodes.</li>
</ul></li>
<li>So if A's update log says
<ul>
<li><code><-,10,X></code></li>
<li><code><-,20,A></code></li>
</ul></li>
<li>A will send both to primary, in that order
<ul>
<li>Primary will assign CSNs in that order</li>
<li>Commit order will, in this case, match tentative order</li>
</ul></li>
</ul>
<p>Will commit order always match tentative order?</p>
<ul>
<li>No: primary may see newer updates before older ones.</li>
<li>A has just: <code><-,10,A> W1</code></li>
<li>B has just: <code><-,20,B> W2</code></li>
<li>If <code>C</code> sees both, C's order: <code>W1 W2</code></li>
<li>B syncs with primary, gets <code>CSN=5</code>.</li>
<li>Later A syncs w/ primary, gets <code>CSN=6</code>.</li>
<li>When C syncs w/ primary, its order will change to <code>W2 W1</code>
<ul>
<li><code><5,20,B> W1</code></li>
<li><code><6,10,A> W2</code></li>
</ul></li>
<li>So: committing may change order.</li>
</ul>
<p>Committing allows app to tell users which calendar entries are stable.</p>
<ul>
<li>A stable meeting room time is final.</li>
</ul>
<p>Nodes can discard committed updates.</p>
<ul>
<li>Instead, keep a copy of the DB as of the highest known CSN</li>
<li>Roll back to that DB when replaying tentative update log</li>
<li>Never need to roll back farther
<ul>
<li>Prefix property guarantees seen <code>CSN=x => seen CSN<x</code></li>
<li>No changes to update order among committed updates</li>
</ul></li>
</ul>
<p>How do I sync if I've discarded part of my log?</p>
<ul>
<li>Suppose I've discarded all updates with CSNs.</li>
<li>I keep a copy of the stable DB reflecting just discarded entries.</li>
<li>When I propagate to node <code>X</code>:
<ul>
<li>If node X's highest CSN is less than mine,
<ul>
<li>I can send him my stable DB reflecting just committed updates.</li>
<li>Node X can use my DB as starting point.</li>
<li>And X can discard all CSN log entries.</li>
<li>Then play his tentative updates into that DB.</li>
</ul></li>
<li>If node X's highest CSN is greater than mine,
<ul>
<li>X doesn't need my DB.</li>
</ul></li>
</ul></li>
<li>In practice, Bayou nodes keep the last few committed updates.
<ul>
<li>To reduce chance of having to send whole DB during sync.</li>
</ul></li>
</ul>
<p>How to sync?</p>
<ul>
<li>A sending to B</li>
<li>Need a quick way for B to tell A what to send</li>
<li>Committed updates easy: B sends its CSN to A</li>
<li>What about tentative updates?</li>
<li>A has:
<code><-,10,X></code>
<code><-,20,Y></code>
<code><-,30,X></code>
<code><-,40,X></code></li>
<li>B has:
<code><-,10,X></code>
<code><-,20,Y></code>
<code><-,30,X></code></li>
<li>At start of sync, B tells A "X 30, Y 20"
<ul>
<li>Sync prefix property means B has all X updates before 30, all Y before 20</li>
</ul></li>
<li>A sends all X's updates after <code><-,30,X></code>, all Y's updates after <code><-,20,X></code>, &c</li>
<li>This is a version vector -- it summarize log content
<ul>
<li>It's the "F" vector in Figure 4</li>
<li>A's F: <code>[X:40,Y:20]</code></li>
<li>B's F: <code>[X:30,Y:20]</code></li>
</ul></li>
</ul>
<p>How could we cope with a new server Z joining the system?</p>
<ul>
<li>Could it just start generating writes, e.g. <code><-,1,Z></code> ?</li>
<li>And other nodes just start including Z in VVs?</li>
<li>If A syncs to B, A has <code><-,10,Z></code>, but B has no Z in VV
<ul>
<li>A should pretend B's VV was <code>[Z:0,...]</code></li>
</ul></li>
</ul>
<p>What happens when Z retires (leaves the system)?</p>
<ul>
<li>We want to stop including Z in VVs!</li>
<li>How to announce that Z is gone?
<ul>
<li>Z sends update <code><-,?,Z> "retiring"</code></li>
</ul></li>
<li>If you see a retirement update, omit Z from VV</li>
<li>How to deal with a VV that's missing Z?</li>
<li>If A has log entries from Z, but B's VV has no Z entry:
<ul>
<li>e.g. A has <code><-,25,Z></code>, B's VV is just <code>[A:20, B:21]</code></li>
<li>Maybe Z has retired, B knows, A does not</li>
<li>Maybe Z is new, A knows, B does not</li>
</ul></li>
<li>Need a way to disambiguate: Z missing from VV b/c new, or b/c retired?</li>
</ul>
<p>Bayou's retirement plan</p>
<ul>
<li>Z joins by contacting some server <code>X</code></li>
<li>Z's ID is generated by X as <code><Tz,X></code>
<ul>
<li>Tz is X's logical clock as of when Z joined</li>
<li>Note: unbounded ID size</li>
</ul></li>
<li>X issues <code><-,Tz,X> "new server Z"</code></li>
</ul>
<p>How does <code>ID=<Tz,X></code> scheme help disambiguate new vs forgotten?</p>
<ul>
<li>Suppose Z's ID is <code><20,X></code></li>
<li>A syncs to B
<ul>
<li>A has log entry from <code>Z <-,25,<20,X>></code></li>
<li>B's VV has no Z entry</li>
</ul></li>
<li>One case:
<ul>
<li>B's VV: <code>[X:10, ...]</code></li>
<li><code>10 < 20</code> implies B hasn't yet seen X's "new server Z" update</li>
</ul></li>
<li>The other case:
<ul>
<li>B's VV: <code>[X:30, ...]</code></li>
<li><code>20 < 30</code> implies B once knew about Z, but then saw a retirement update</li>
</ul></li>
</ul>
<p>Let's step back.</p>
<p>Is eventual consistency a useful idea?</p>
<ul>
<li>Yes: people want fast writes to local copies</li>
<li>iPhone sync, Dropbox, Dynamo, Riak, Cassandra, &c</li>
</ul>
<p>Are update conflicts a real problem?</p>
<ul>
<li>Yes -- all systems have some more or less awkward solution</li>
</ul>
<p>Is Bayou's complexity warranted?</p>
<ul>
<li>I.e. log of update functions, version vectors, tentative operations</li>
<li>Only critical if you want peer-to-peer sync
<ul>
<li>I.e. both disconnected operation AND ad-hoc connectivity</li>
</ul></li>
<li>Only tolerable if humans are main consumers of data</li>
<li>Otherwise you can sync through a central server (iPhone, Dropbox)</li>
<li>Or read locally but send updates through a master (PNUTS, Spanner)</li>
</ul>
<p>But there's are good ideas for us to learn from Bayou</p>
<ul>
<li>Update functions for automatic application-driven conflict resolution</li>
<li>Ordered update log is the real truth, not the DB</li>
<li>Logical clock for causal consistency</li>
</ul>