RICON keynote: outwards from the middle of the maze

1. Outwards from the middle of the maze Peter Alvaro UC Berkeley

2. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures

3. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;

7. The “top-down” ethos

13. Transactions: a holistic contract Write Read Application Opaque store Transactions

14. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0

15. Transactions: a holistic contract Assert: balance > 0 Write Read Application Opaque store Transactions

18. Incidental complexities • The “Internet.” Searching it. • Cross-datacenter replication schemes • CAP Theorem • Dynamo & MapReduce • “Cloud”

19. Fundamental complexity “[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.” Jim Waldo et al., A Note on Distributed Computing (1994)

20. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions

21. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions

22. Are you blithely asserting that transactions aren’t webscale? Some people just want to see the world burn. Those same people want to see the world use inconsistent databases. - Emin Gun Sirer

23. Alternative to top-down design? The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.

24. Alternative: the “bottom-up,” systems ethos

25. The “bottom-up” ethos

31. The “bottom-up” ethos “‘Tis a fine barn, but sure ‘tis no castle, English”

32. The “bottom-up” ethos Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?

33. Low-level contracts Write Read Application Distributed store KVS

34. Low-level contracts Write Read Application Distributed store KVS

35. Low-level contracts Write Read Application Distributed store KVS R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

36. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

37. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 causal? PRAM? delta? fork/join? red/blue? Release? R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

38. When do contracts compose? Application Distributed service Assert: balance > 0

39. iw, did I get mongo in my riak? Assert: balance > 0

40. Composition is the last hard problem Composing modules is hard enough We must learn how to compose guarantees

42. Why distributed systems are hard2 Asynchrony Partial Failure Fundamental Uncertainty

43. Asynchrony isn’t that hard Ameloriation: Logical timestamps Deterministic interleaving

44. Partial failure isn’t that hard Ameloriation: Replication Replay

45. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay

46. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay

47. (asynchrony * partial failure) = hard2 Tackling one clown at a time Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs

49. Distributed consistency Today: A quick summary of some great work.

50. Consider a (distributed) graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14

51. Partitioned, for scalability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14

52. Replicated, for availability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14

53. Deadlock detection Task: Identify strongly-connected components Waits-for graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14

54. Garbage collection Task: Identify nodes not reachable from Root. Root Refers-to graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14

55. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory

56. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives- • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory

57. Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Root

58. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?

59. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?

60. Consistency at the extremes Application Language Custom s olutions? Flow Efficient Object Correct Storage Linearizable key-value store?

61. Object-level consistency Capture semantics of data structures that • allow greater concurrency • maintain guarantees (e.g. convergence) Application Language Flow Object Storage

62. Object-level consistency

63. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence

66. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence Reordering Batching Retry/duplication Tolerant to

67. Object-level composition? Application Convergent data structures Assert: Graph replicas converge

68. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed Assert: Graph replicas converge

69. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed ? ? Assert: Graph replicas converge

70. Flow-level consistency Application Language Flow Object Storage

71. Flow-level consistency Capture semantics of data in motion • Asynchronous dataflow model • component properties à system-wide guarantees Graph store Transaction manager Transitive closure Deadlock detector

72. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)

76. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) =

77. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) { } = { }

78. Confluence is compositional output set = f  g(input set)

81. Graph queries as dataflow Graph store Memory allocator Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent

82. Graph queries as dataflow Graph store Memory allocator Confluent Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Coordinate here

83. Coordination: what is that? Strategy 1: Establish a total order Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent

84. Coordination: what is that? Strategy 2: Establish a producer-consumer Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent barrier

85. Fundamental costs: FT via replication (mostly) free! Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Graph store Transitive closure Deadlock detector Confluent Confluent Confluent

86. Fundamental costs: FT via replication global synchronization! Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Garbage Collector Confluent Not Confluent Confluent Paxos Not Confluent

87. Fundamental costs: FT via replication The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton Garbage Collector Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Confluent Not Confluent Confluent Barrier Not Confluent Barrier

88. Language-level consistency DSLs for distributed programming? • Capture consistency concerns in the type system Application Language Flow Object Storage

89. Language-level consistency CALM Theorem: Monotonic à confluent Conservative, syntactic test for confluence

90. Language-level consistency Deadlock detector Garbage collector

91. Language-level consistency Deadlock detector Garbage collector nonmonotonic

92. Let’s review • Consistency is tolerance to asynchrony • Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise (Tricks are great, but tools are better)

94. Grand challenge: composition Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?

95. Example: Atomic multi-partition update T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Two-phase commit

96. Example: replication T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Reliable broadcast

97. Popular wisdom: don’t reinvent

98. Example: Kafka replication bug Three “correct” components: 1. Primary/backup replication 2. Timeout-based failure detectors 3. Zookeeper One nasty bug: Acknowledged writes are lost

99. A guarantee would be nice Bottom up approach: • use formal methods to verify individual components (e.g. protocols) • Build systems from verified components Shortcomings: • Hard to use • Hard to compose Investment Returns

100. Bottom-up assurances Formal verifica[on Environment Program Correctness Spec

101. Composing bottom-up assurances

102. Composing bottom-up assurances Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property) If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson

106. Top-down “assurances”

107. Top-down “assurances” Testing

108. Top-down “assurances” Fault injection Testing

109. Top-down “assurances” Fault injection Testing

110. End-to-end testing would be nice Top-down approach: • Build a large-scale system • Test the system under faults Shortcomings: • Hard to identify complex bugs • Fundamentally incomplete Investment Returns

111. Lineage-driven fault injection Goal: top-down testing that • finds all of the fault-tolerance bugs, or • certifies that none exist

112. Lineage-driven fault injection Correctness Specification Malevolent sentience Molly

113. Lineage-driven fault injection Molly Correctness Specification Malevolent sentience

114. Lineage-driven fault injection (LDFI) Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: • Why did a good thing happen? • What could have gone wrong along the way?

115. Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.

116. The game • Both players agree on a failure model • The programmer provides a protocol • The adversary observes executions and chooses failures for the next execution.

117. Dedalus: it’s about data log(B, “data”)@5 What Where When Some data

118. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! !

119. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! (Which is like SQL) create view log as select Node, Pload from bcast;!

120. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

121. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); State change Natural join (bcast.Node1 == node.Node1) Communication

122. The match Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions

123. Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);! log(Node, Pload)@next ! :- log(Node, Pload);! !! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); “An effort” delivery protocol

124. Round 1 in space / time Process b Process a Process c 2 1 2 log log

125. Round 1: Lineage log(B, data)@5

126. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!

127. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3

128. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2

129. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2 log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! node(Node1, Node2);! !!!! log(B, data)@2 :- bcast(A, data)@1, ! ! ! ! ! ! ! node(A, B)@1;! log(A, data)@1

130. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(log(AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 log(log(log(log((which required a message from A to B at time 1)

131. Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”

132. Round 1: counterexample Process b Process a Process c 1 2 log (LOST) log The adversary wins!

133. Round 2 Same as Round 1, but A retries. bcast(N, P)@next ! ! ! :- bcast(N, P);!

134. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log

135. Round 2 log(B, data)@5

136. Round 2 log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!

137. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);! !!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!

138. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3

139. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2

140. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1

141. Round 2 Retry provides redundancy in time log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1

142. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4

143. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4

144. Round 2: counterexample Process b Process a Process c 1 log (LOST) log CRASHED 2 The adversary wins!

145. Round 3 Same as in Round 2, but symmetrical. bcast(N, P)@next ! ! ! :- log(N, P);!

146. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 log log 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log Redundancy in space and time

147. Round 3 -- lineage log(B, data)@5

148. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4

149. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3

150. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1

151. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1

152. Round 3 The programmer wins!

153. Let’s reflect Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations

154. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. (AB1 ∨ BC2) Disjunction

155. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)

156. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)

157. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast

158. Commit protocols Problem: Atomically change things Correctness properties: 1. Agreement (All or nothing) 2. Termination (Something)

159. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit

160. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it?

161. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN

162. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN Well I’m gone

163. Two-phase commit Agent a Agent a Coordinator Agent d 2 2 1 p p p 3 CRASHED 2 v v v Violation: Termination

164. The collabora[ve termina[on protocol Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.

165. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req

166. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req Can I kick it? YES YOU CAN ……?

167. 3PC Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1. Phase 1: Just like in 2PC – Agent timeout à abort 2. Phase 2: send canCommit, collect acks – Agent timeout à commit 3. Phase 3: Just like phase 2 of 2PC

168. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit

169. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit Timeout à Abort Timeout à Commit

170. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg

171. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision

172. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision d is dead; coordinator decides to abort

173. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort

174. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit

175. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w

176. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition

177. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica

178. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write

179. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write Data loss

180. Molly summary Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods

181. Where we’ve been; where we’re headed 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures

182. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures

183. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures

184. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. (asynchrony X partial failure) = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures

185. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures

186. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures

187. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures

188. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes

189. Remember 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes Composition is the hardest problem

190. A happy crisis Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”

https://twitter.com	64
http://kuenishi.hatenadiary.jp	20
http://feedly.com	1

SlideShare for iOS

RICON keynote: outwards from the middle of the maze

by palvaro

on Oct 31, 2014

Statistics

Views

Actions

3 Embeds 85

Accessibility

Categories

Upload Details

Usage Rights

Report content

RICON keynote: outwards from the middle of the maze Presentation Transcript