RICON keynote: outwards from the middle of the maze
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

RICON keynote: outwards from the middle of the maze

on

  • 609 views

slides from my RICON keynote

slides from my RICON keynote

Statistics

Views

Total Views
609
Views on SlideShare
524
Embed Views
85

Actions

Likes
2
Downloads
8
Comments
0

3 Embeds 85

https://twitter.com 64
http://kuenishi.hatenadiary.jp 20
http://feedly.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • USER-CENTRIC
  • OMG pause here. Remember brewer 2012? Top-down vs bottom-up designs? We had this top-down thing and it was beautiful.
  • It was so beautiful that it didn’t matter that it was somewhat ugly
  • The abstraction was so beautiful, <br /> IT DOESN”T MATTER WHAT”S UNDERNEATH. Wait, or does it? When does it? <br />
  • We’ve known for a long time that it is hard to hide the complexities of distribution
  • Focus not on semantics, but on the properties of components: thin interfaces, understandable latency & failure modes. DEV-centric <br /> But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong? <br />
  • FIX ME: joe’s idea: sketch of a castle being filled in, vs bricks <br /> But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong? <br />
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • Meaning: translation
  • DS are hard because of uncertainty – nondeterminism – which is fundamental to the environment and can “leak” into the results” <br /> It’s astoundingly difficult to face these demons at the same time – tempting to try to defeat them one at a time.
  • Async isn’t a problem: just need to be careful to number messages and interleave correctly. Ignore arrival order. <br /> Whoa, this is easy so far.
  • Failure isn’t a problem: just do redundant computation and store redundant data. Make more copies than there will be failures. <br /> I win.
  • We can’t do deterministic interleaving if producers may fail. <br /> Nd message order makes it hard to keep replicas in agreement
  • We can’t do deterministic interleaving if producers may fail. <br /> Nd message order makes it hard to keep replicas in agreement
  • We can’t do deterministic interleaving if producers may fail. <br /> Nd message order makes it hard to keep replicas in agreement
  • To guard against failures, we replicate. <br /> NB: asynchrony => replicas might not agree
  • Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  • Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  • Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  • FIX: make it about translation vs. prayer <br /> <br />
  • FIX: make it about translation vs. prayer <br /> <br />
  • FIX: make it about translation vs. prayer <br /> <br />
  • Ie, reorderability, batchability, tolerance to duplication / retry <br /> Now programmer must map from application invariants to object API (with richer semantics than read/write). <br />
  • Ie, reorderability, batchability, tolerance to duplication / retry <br /> Now programmer must map from application invariants to object API (with richer semantics than read/write). <br />
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • However, not sufficient to synchronize GC. <br /> Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give? <br /> To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app. <br /> *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  • However, not sufficient to synchronize GC. <br /> Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give? <br /> To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app. <br /> *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  • However, not sufficient to synchronize GC. <br /> Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give? <br /> To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app. <br /> *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. <br /> A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. <br /> A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. <br /> A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. <br /> A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. <br /> A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. <br /> A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • Confluence is compositional: Composing confluent components yields a confluent dataflow
  • Confluence is compositional: Composing confluent components yields a confluent dataflow
  • Confluence is compositional: Composing confluent components yields a confluent dataflow
  • All of these components are confluent! Composing confluent components yields a confluent dataflow <br /> But annotations are burdensome
  • All of these components are confluent! Composing confluent components yields a confluent dataflow <br /> But annotations are burdensome
  • A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
  • A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
  • M – a semantic property of code – implies confluence <br /> An appropriately constrained language provides a conservative syntactic test for M.
  • M – a semantic property of code – implies confluence <br /> An appropriately constrained language provides a conservative syntactic test for M.
  • Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
  • Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
  • Try to not use it! Learn how to choose it. Tools help!
  • Start with a hard problem Hard problem: is my FT protocol work? <br /> Harder: is the composition of my components FT
  • Point: we need to replicate data to both copies of a replica <br /> We need to commit multiple partitions together
  • Start with a hard problem Hard problem: is my FT protocol work? <br /> Harder: is the composition of my components FT
  • Examples! 2pc and replication. Properties, etc etc <br />
  • Talk about speed too.
  • After all, FT is an end-to-end concern.
  • (synchronous)
  • (synchronous)
  • (synchronous)
  • TALK ABOUT SAT!!!
  • TALK ABOUT SAT!!!

RICON keynote: outwards from the middle of the maze Presentation Transcript

  • 1. Outwards from the middle of the maze Peter Alvaro UC Berkeley
  • 2. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 3. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 4. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 5. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 6. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 7. The “top-down” ethos
  • 8. The “top-down” ethos
  • 9. The “top-down” ethos
  • 10. The “top-down” ethos
  • 11. The “top-down” ethos
  • 12. The “top-down” ethos
  • 13. Transactions: a holistic contract Write Read Application Opaque store Transactions
  • 14. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  • 15. Transactions: a holistic contract Assert: balance > 0 Write Read Application Opaque store Transactions
  • 16. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  • 17. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  • 18. Incidental complexities • The “Internet.” Searching it. • Cross-datacenter replication schemes • CAP Theorem • Dynamo & MapReduce • “Cloud”
  • 19. Fundamental complexity “[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.” Jim Waldo et al., A Note on Distributed Computing (1994)
  • 20. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  • 21. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  • 22. Are you blithely asserting that transactions aren’t webscale? Some people just want to see the world burn. Those same people want to see the world use inconsistent databases. - Emin Gun Sirer
  • 23. Alternative to top-down design? The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.
  • 24. Alternative: the “bottom-up,” systems ethos
  • 25. The “bottom-up” ethos
  • 26. The “bottom-up” ethos
  • 27. The “bottom-up” ethos
  • 28. The “bottom-up” ethos
  • 29. The “bottom-up” ethos
  • 30. The “bottom-up” ethos
  • 31. The “bottom-up” ethos “‘Tis a fine barn, but sure ‘tis no castle, English”
  • 32. The “bottom-up” ethos Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?
  • 33. Low-level contracts Write Read Application Distributed store KVS
  • 34. Low-level contracts Write Read Application Distributed store KVS
  • 35. Low-level contracts Write Read Application Distributed store KVS R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  • 36. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  • 37. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 causal? PRAM? delta? fork/join? red/blue? Release? R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  • 38. When do contracts compose? Application Distributed service Assert: balance > 0
  • 39. iw, did I get mongo in my riak? Assert: balance > 0
  • 40. Composition is the last hard problem Composing modules is hard enough We must learn how to compose guarantees
  • 41. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 42. Why distributed systems are hard2 Asynchrony Partial Failure Fundamental Uncertainty
  • 43. Asynchrony isn’t that hard Ameloriation: Logical timestamps Deterministic interleaving
  • 44. Partial failure isn’t that hard Ameloriation: Replication Replay
  • 45. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  • 46. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  • 47. (asynchrony * partial failure) = hard2 Tackling one clown at a time Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs
  • 48. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 49. Distributed consistency Today: A quick summary of some great work.
  • 50. Consider a (distributed) graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 51. Partitioned, for scalability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 52. Replicated, for availability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 53. Deadlock detection Task: Identify strongly-connected components Waits-for graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 54. Garbage collection Task: Identify nodes not reachable from Root. Root Refers-to graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 55. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  • 56. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives- • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  • 57. Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Root
  • 58. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  • 59. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  • 60. Consistency at the extremes Application Language Custom s olutions? Flow Efficient Object Correct Storage Linearizable key-value store?
  • 61. Object-level consistency Capture semantics of data structures that • allow greater concurrency • maintain guarantees (e.g. convergence) Application Language Flow Object Storage
  • 62. Object-level consistency
  • 63. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  • 64. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  • 65. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  • 66. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence Reordering Batching Retry/duplication Tolerant to
  • 67. Object-level composition? Application Convergent data structures Assert: Graph replicas converge
  • 68. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed Assert: Graph replicas converge
  • 69. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed ? ? Assert: Graph replicas converge
  • 70. Flow-level consistency Application Language Flow Object Storage
  • 71. Flow-level consistency Capture semantics of data in motion • Asynchronous dataflow model • component properties à system-wide guarantees Graph store Transaction manager Transitive closure Deadlock detector
  • 72. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 73. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 74. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 75. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 76. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) =
  • 77. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) { } = { }
  • 78. Confluence is compositional output set = f Ÿ g(input set)
  • 79. Confluence is compositional output set = f Ÿ g(input set)
  • 80. Confluence is compositional output set = f Ÿ g(input set)
  • 81. Graph queries as dataflow Graph store Memory allocator Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent
  • 82. Graph queries as dataflow Graph store Memory allocator Confluent Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Coordinate here
  • 83. Coordination: what is that? Strategy 1: Establish a total order Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent
  • 84. Coordination: what is that? Strategy 2: Establish a producer-consumer Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent barrier
  • 85. Fundamental costs: FT via replication (mostly) free! Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Graph store Transitive closure Deadlock detector Confluent Confluent Confluent
  • 86. Fundamental costs: FT via replication global synchronization! Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Garbage Collector Confluent Not Confluent Confluent Paxos Not Confluent
  • 87. Fundamental costs: FT via replication The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton Garbage Collector Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Confluent Not Confluent Confluent Barrier Not Confluent Barrier
  • 88. Language-level consistency DSLs for distributed programming? • Capture consistency concerns in the type system Application Language Flow Object Storage
  • 89. Language-level consistency CALM Theorem: Monotonic à confluent Conservative, syntactic test for confluence
  • 90. Language-level consistency Deadlock detector Garbage collector
  • 91. Language-level consistency Deadlock detector Garbage collector nonmonotonic
  • 92. Let’s review • Consistency is tolerance to asynchrony • Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise (Tricks are great, but tools are better)
  • 93. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 94. Grand challenge: composition Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?
  • 95. Example: Atomic multi-partition update T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Two-phase commit
  • 96. Example: replication T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Reliable broadcast
  • 97. Popular wisdom: don’t reinvent
  • 98. Example: Kafka replication bug Three “correct” components: 1. Primary/backup replication 2. Timeout-based failure detectors 3. Zookeeper One nasty bug: Acknowledged writes are lost
  • 99. A guarantee would be nice Bottom up approach: • use formal methods to verify individual components (e.g. protocols) • Build systems from verified components Shortcomings: • Hard to use • Hard to compose Investment Returns
  • 100. Bottom-up assurances Formal verifica[on Environment Program Correctness Spec
  • 101. Composing bottom-up assurances
  • 102. Composing bottom-up assurances Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property) If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson
  • 103. Composing bottom-up assurances
  • 104. Composing bottom-up assurances
  • 105. Composing bottom-up assurances
  • 106. Top-down “assurances”
  • 107. Top-down “assurances” Testing
  • 108. Top-down “assurances” Fault injection Testing
  • 109. Top-down “assurances” Fault injection Testing
  • 110. End-to-end testing would be nice Top-down approach: • Build a large-scale system • Test the system under faults Shortcomings: • Hard to identify complex bugs • Fundamentally incomplete Investment Returns
  • 111. Lineage-driven fault injection Goal: top-down testing that • finds all of the fault-tolerance bugs, or • certifies that none exist
  • 112. Lineage-driven fault injection Correctness Specification Malevolent sentience Molly
  • 113. Lineage-driven fault injection Molly Correctness Specification Malevolent sentience
  • 114. Lineage-driven fault injection (LDFI) Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: • Why did a good thing happen? • What could have gone wrong along the way?
  • 115. Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.
  • 116. The game • Both players agree on a failure model • The programmer provides a protocol • The adversary observes executions and chooses failures for the next execution.
  • 117. Dedalus: it’s about data log(B, “data”)@5 What Where When Some data
  • 118. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! !
  • 119. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! (Which is like SQL) create view log as select Node, Pload from bcast;!
  • 120. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
  • 121. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); State change Natural join (bcast.Node1 == node.Node1) Communication
  • 122. The match Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions
  • 123. Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);! log(Node, Pload)@next ! :- log(Node, Pload);! !! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); “An effort” delivery protocol
  • 124. Round 1 in space / time Process b Process a Process c 2 1 2 log log
  • 125. Round 1: Lineage log(B, data)@5
  • 126. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  • 127. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3
  • 128. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2
  • 129. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2 log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! node(Node1, Node2);! !!!! log(B, data)@2 :- bcast(A, data)@1, ! ! ! ! ! ! ! node(A, B)@1;! log(A, data)@1
  • 130. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(log(AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 log(log(log(log((which required a message from A to B at time 1)
  • 131. Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”
  • 132. Round 1: counterexample Process b Process a Process c 1 2 log (LOST) log The adversary wins!
  • 133. Round 2 Same as Round 1, but A retries. bcast(N, P)@next ! ! ! :- bcast(N, P);!
  • 134. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log
  • 135. Round 2 log(B, data)@5
  • 136. Round 2 log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  • 137. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);! !!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
  • 138. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3
  • 139. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2
  • 140. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  • 141. Round 2 Retry provides redundancy in time log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  • 142. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 143. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 144. Round 2: counterexample Process b Process a Process c 1 log (LOST) log CRASHED 2 The adversary wins!
  • 145. Round 3 Same as in Round 2, but symmetrical. bcast(N, P)@next ! ! ! :- log(N, P);!
  • 146. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 log log 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log Redundancy in space and time
  • 147. Round 3 -- lineage log(B, data)@5
  • 148. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4
  • 149. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3
  • 150. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  • 151. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  • 152. Round 3 The programmer wins!
  • 153. Let’s reflect Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations
  • 154. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. (AB1 ∨ BC2) Disjunction
  • 155. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  • 156. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  • 157. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast
  • 158. Commit protocols Problem: Atomically change things Correctness properties: 1. Agreement (All or nothing) 2. Termination (Something)
  • 159. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit
  • 160. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it?
  • 161. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN
  • 162. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN Well I’m gone
  • 163. Two-phase commit Agent a Agent a Coordinator Agent d 2 2 1 p p p 3 CRASHED 2 v v v Violation: Termination
  • 164. The collabora[ve termina[on protocol Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.
  • 165. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req
  • 166. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req Can I kick it? YES YOU CAN ……?
  • 167. 3PC Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1. Phase 1: Just like in 2PC – Agent timeout à abort 2. Phase 2: send canCommit, collect acks – Agent timeout à commit 3. Phase 3: Just like phase 2 of 2PC
  • 168. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit
  • 169. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit Timeout à Abort Timeout à Commit
  • 170. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg
  • 171. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision
  • 172. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  • 173. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  • 174. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit
  • 175. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w
  • 176. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition
  • 177. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica
  • 178. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write
  • 179. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write Data loss
  • 180. Molly summary Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods
  • 181. Where we’ve been; where we’re headed 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 182. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 183. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 184. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. (asynchrony X partial failure) = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 185. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 186. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  • 187. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  • 188. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes
  • 189. Remember 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes Composition is the hardest problem
  • 190. A happy crisis Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”