-
Notifications
You must be signed in to change notification settings - Fork 39
/
pbft-2011.txt
255 lines (213 loc) · 9.45 KB
/
pbft-2011.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
6.824 2009 Lecture 18: Security: Byzantine Fault Tolerance
so far we have been assuming fail-stop behavior
computer/network either executes protocol correctly
or halts (and maybe repaired after a while)
even fail-stop is hard to cope with
did the server crash, or is the network broken?
is the network broken, or just delaying packets?
is the network partitioned?
what if fail-stop failures/repair so frequent we can do no work?
if system is quiet for a while,
ping+paxos will produce a useful configuration
now we're going to be more ambitious
use replication to help with security
what is Byzantine behavior?
a larger set of ways in which a computer can misbehave
includes fail-stop faults
plus bugs (i.e. incorrect execution)
plus intentional malice
plus conspiracies -- multiple bad nodes cooperating to trick the good ones
(devious, surreptitious)
what kinds of Byzantine behaviour could cause trouble in the labs?
lock server primary gives away same lock to multiple clients
lock server backup responds to pings, but does nothing else
will be kept in Paxos config, but will prevent progress
backup might fake a double-grant from primary,
trick system administrator into thinking primary is faulty,
thus trick admin into replacing good primary with malicious backup
backup DDoSs primary, Paxos causes backup to take over as primary,
then gives away locks to multiple clients
specific assumptions in the paper's Byzantine fault model
network can delay, duplicate, re-order, or drop any message
"faulty" vs "non-faulty" replicas
faulty replicas may be controlled by a malicious attacker
the attacker:
supplies the code that faulty replicas run
knows the code the non-faulty replicas are running
can read network messages
cannot guess non-faulty replicas' cryptographic keys
knows the faulty replicas' keys
can force messages to be delayed
how can we possibly cope with Byzantine replicas?
assume no more than some fraction of replicas are faulty
the basic Byzantine fault tolerance approach
replicated state machines
ask them all to do each operation
compare answers from replicas
if enough agree, perhaps that's the correct answer
straw man 1:
[client, n servers]
n servers
client sends request to all of them
waits for all n to reply
only proceeds if all n agree
why might this not work well?
faulty replica can stop progress by disagreeing
straw man 2:
let's have replicas vote
2f+1 servers, assume no more than f are faulty
if client gets f+1 matching replies, success
otherwise, failure
why might this not work well?
client can't wait for replies from the last f replicas
they might be faulty, never going to reply
so must be able to make a decision after n-f replies, or f+1 if n=2f+1
but f of the first f+1 replies might be from faulty replicas!
i.e. f+1 is not enough to vote
also waiting for f+1 of 2f+1 doesn't ensure that majority of good nodes executed
straw man 3:
3f+1 servers, of which at most f are faulty
client waits for 2f+1 replies
client takes majority of those 2f+1
why does this work? informally...
client will get at least 2f+1 replies: all non-faulty replicas eventually respond
at most f of 2f+1 are faulty
the f+1 non-faulty replicas will agree on the answer
what about handling multiple clients?
non-faulty replicas must process operations in the same order!
let's use a primary to choose operation order
picks an order and assigns sequence numbers
for now, assume primary is non-faulty
straw man 4: with primary
pick sequence #
send *signed* message to each replica (including itself)
from here on, every message signed by sender using public key crypto
wait for 2f+1 replies
f+1 must match
reply to client
REQ MSG REP REP
C
0
1
2
3
what if the primary is faulty?
it might send different operations to different replicas
or send them in different orders to different replicas
or send a wrong result to the client
or do nothing
general approach to handling faulty primary
clients notify replicas of each operation, as well as primary
each replica watches progress of each operation
if no progress, force view change -> new primary
view change must sort out state of last operation
can a replica execute an operation when it first receives it from primary?
no: maybe primary gave different ops to different replicas
if we execute before we're sure, we've wrecked the replica's state
so we need a second round of messages to make sure all good replicas got the same op
straw man 5:
3f+1 servers, one is primary, f faulty, primary might be faulty
client sends request to primary AND to each replica
primary sends PRE-PREPARE(op, n) to replicas
each replica sends PREPARE(op, n) to all replicas
if replica gets PREPARE(op, n) from 2f+1 replicas (incl itself)
(with same op and n)
execute the operation, possibly modifying state
send reply to client
client is happy when it gets f+1 matching replies
REQ PRE-P PREPARE REPLY
C
0
1
2
3
does this cope with the primary sending different ops to different replicas?
yes: replicas won't see 2f+1 matching PREPAREs
also handles primary assigning different sequence #s
result: system will stop
does this cope with the primary lying to the client about the reply?
yes: each replica sends a reply, not the primary
result: system will stop
does this cope with the primary not sending operations to replicas?
yes: replicas receive request from the client
but don't see a PRE-PREPARE, or don't see a PREPARE
result: system will stop
how to resume operation after faulty client?
need a view change to choose new primary
view change does *not* choose a set of "live" servers
2f+1 of 3f+1 deals with faulty servers
when to trigger a view change?
if a replica sees a client op but doesn't see 2f+1 matching PREPAREs
if faulty replicas try to trigger a view change to shoot down non-faulty primary?
require view change requests from 2f+1 replicas
who is the next primary?
need to make sure faulty replicas can't always make themselves next primary
view number v
primary is v mod n
so primary rotates among servers
at most f faulty primaries in a row
view change straw man
replicas send VIEW-CHANGE requests to *new* primary
new primary waits for 2f+1 view-change requests
new primary announces view change w/ NEW-VIEW
includes the 2f+1 VIEW-CHANGE requests
as proof that enough replicas wanted to change views
new primary starts numbering operations at last n it saw + 1
will all non-faulty replicas agree about operation numbering across view change?
problem:
I saw 2f+1 PREPAREs for operation n, so I executed it
new primary did not, so it did not execute it
maybe new primary didn't even see the PRE-PREPARE for operation n
since old primary only waited for 2f+1 replies
or old primary may never have sent PRE-PREARE to next primary
thus new primary may start numbering at n, yielding two different op #n
can new primary ask all replicas for set of operations they have executed?
doesn't work: new primary can only wait for 2f+1 replies
faulty replicas may reply, so new primary may not wait for me
solution:
don't execute operation until sure a new primary will hear about it
add a third phase: PRE-PREPARE, PREPARE, then COMMIT
only execute after commit
operation protocol:
client sends op to primary
primary sends PRE-PREPARE(op, n) to all
all send PREPARE(op, n) to all
after replica receives 2f+1 matching PREPARE(op, n)
send COMMIT(op, n) to all
after receiving 2f+1 matching COMMIT(op, n)
execute op
view change:
each replica sends new primary 2f+1 PREPAREs for recent ops (if it has them)
new primary waits for 2f+1 VIEW-CHANGE requests
new primary sends NEW-VIEW msg to all replicas with
complete set of VIEW-CHANGE msgs
list of every op for which some VIEW-CHANGE contained 2f+1 PREPAREs
i.e. list of final ops from last view
informal argument why if a replica executes an op, new primary will know of that op
replica only executed after receiving 2f+1 COMMITS
maybe f of those were lies, from faulty replicas, who won't tell new primary
but f+1 COMMITs were from replicas that got matching PREPAREs from 2f+1 replicas
new primary waits for view-change requests from 2f+1 replicas
at least 1 of those f+1 will report PREPARE set to the new primary
can a replica trick new primary into believing a manufactured op?
no -- replica must present 2f+1 PREPAREs for that op
the PREPAREs must have the original signatures
so the non-faulty nodes must have seen that op
can the new primary omit some of the reported recent operations?
replicas must check that the primary announced the right set
compare against NEW-VIEW PREPARE info included in the VIEW-CHANGE
paper also discusses
checkpoints and logs to help good nodes recover
various cryptographic optimizations
optimizations to reduce # of msgs in common case
fast read-only operations
what are the consequences of more than f corrupt servers?
can the system recover?
suppose a single server has a fail-stop fault (e.g. powered off)
can the system still survive f additional malicious faulty nodes?
will it be live?
can bad nodes trick it into executing incorrect operations?
what if the client is corrupt?
suppose we had a technique to provide Byzantine fault tolerance (BFT)
would it be useful to apply it to a set of identical servers?
how about non-identical servers run by same organization?