Preparing for distributed system failures using akka #ScalaMatsuri

385 views

Published on

Akkaで分散システムの障害に備える
Presentation of ScalaMatsuri 2017

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
385
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Preparing for distributed system failures using akka #ScalaMatsuri

  1. 1. Copyright © 2017 TIS Inc. All rights reserved. Preparing for distributed system failures using Akka 2017.2.25 Scala Matsuri Yugo Maede @yugolf
  2. 2. Copyright © 2017 TIS Inc. All rights reserved. 2 Who am I? TIS Inc. provides “Reactive Systems Consulting Service” @yugolf https://twitter.com/okapies/status/781439220330164225 - support PoC projects - review designs - review codes        etc リアクティブシステムのコンサルティングサービ スをやっています
  3. 3. Copyright © 2017 TIS Inc. All rights reserved. 3 Todayʼs Topics What are Architectural Safety Measures in distributed system? How to realize them with Akka 分散システムに考慮が必要な安全対策 Akkaでどうやるか?
  4. 4. Copyright © 2017 TIS Inc. All rights reserved. 4 Microservices mean distributed systems from Monolith to Microservices マイクロサービス、 すなわち、分散システム
  5. 5. Copyright © 2017 TIS Inc. All rights reserved. 5 "Mooreʼs law is dead" means "distributed systems are the beginning" limitation of CPU performance ムーアの法則の終焉、 すなわち、分散システムの幕開け
  6. 6. Copyright © 2017 TIS Inc. All rights reserved. confront with distributed system 6 Building large-scale systems requires distributed systems 分散システムなしにはビジネスの成功はない
  7. 7. Copyright © 2017 TIS Inc. All rights reserved. 7 - increasing the number of server means increasing failure points - face new enemies "network" サーバが増えれば障害点も増える ネットワークという新たな敵の出現 building distributed system is not easy
  8. 8. Copyright © 2017 TIS Inc. All rights reserved. Architectural Safety Measures 8 define Cross-Functional Requirements - availability - response time and latency 機能横断要件を定義しましょう 可⽤性と応答時間/遅延
  9. 9. Copyright © 2017 TIS Inc. All rights reserved. systems based on failure 9 - needs Antifragile Organizations - needs systems based on failure アンチフラジャイルな組織と障害を前提とした システムが必要
  10. 10. Copyright © 2017 TIS Inc. All rights reserved. Architectural Safety Measures need 10 timeout bulkhead circuit breaker ... タイムアウト、隔壁、サーキットブレーカー、…
  11. 11. Copyright © 2017 TIS Inc. All rights reserved. Akka is here 11 Akka has tools to deal with distributed system failures Akkaには分散システムに関わる障害に対処する ためのツールが備わっている
  12. 12. Copyright © 2017 TIS Inc. All rights reserved. Akka Actor 12 participant Actor processes messages in order of arrival $30 host アクターはメッセージを到達順に処理 シンプルに⾮同期処理を実装可能 $10 $10 $10 status $10 $10 $10 mailbox
  13. 13. Copyright © 2017 TIS Inc. All rights reserved. Supervisor Hierarchy 13 let it crash スーパーバイザーが⼦アクターを監視し障害制 御などを⾏う supervisor child actorchild actor supervise signal failure - restart - resume - stop - escalate
  14. 14. Copyright © 2017 TIS Inc. All rights reserved. timeout 14
  15. 15. Copyright © 2017 TIS Inc. All rights reserved. request-response needs timeout 15 request response 応答が遅かったり、返ってこないこともある ☓
  16. 16. Copyright © 2017 TIS Inc. All rights reserved. message passing 16 ! tell(fire and forget)を使う askの場合はタイムアウトを適切に設定 ? 1s tell(fire and forget) ask
  17. 17. Copyright © 2017 TIS Inc. All rights reserved. timeout configuration 17 import akka.pattern.ask
 import akka.util.Timeout
 import system.dispatcher
 
 implicit val timeout = Timeout(5 seconds) 
 val response = kitchen ? KitchenActor.DripCoffee(count)
 
 response.mapTo[OrderCompleted] onComplete {
 case Success(result) =>
 log.info(s"success: ${result.message}")
 case Failure(e: AskTimeoutException) =>
 log.info(s"failure: ${e.getMessage}")
 case Failure(t) =>
 log.info(s"failure: ${t.getMessage}")
 } askのタイムアウト設定
  18. 18. Copyright © 2017 TIS Inc. All rights reserved. 18 送信先に問題があった場合は? ? 1s if a receiver has a problem
  19. 19. Copyright © 2017 TIS Inc. All rights reserved. 19 supervisor never return failure to sender 障害の事実を送信元に返さない if a receiver has a problem - restart - resume - stop - escalate
  20. 20. Copyright © 2017 TIS Inc. All rights reserved. 20 timeout! レスポンスが返ってこないためタイムアウトが 必要 ? 1s if a receiver has a problem ☓
  21. 21. Copyright © 2017 TIS Inc. All rights reserved. implements of ask pattern 1/2 21 def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any] = internalAsk(message, timeout, sender) private[pattern] def internalAsk(message: Any, timeout: Timeout, sender: ActorRef): Future[Any] = actorSel.anchor match {
 case ref: InternalActorRef ⇒
 if (timeout.duration.length <= 0)
 Future.failed[Any](
 new IllegalArgumentException(s"""Timeout length must not be negative, question not sent to [$actorSel]. Sender[$sender] sent the message of type "$ {message.getClass.getName}"."""))
 else {
 val a = PromiseActorRef(ref.provider, timeout, targetName = actorSel, message.getClass.getName, sender)
 actorSel.tell(message, a)
 a.result.future
 }
 case _ ⇒ Future.failed[Any](new IllegalArgumentException(s"""Unsupported recipient ActorRef type, question not sent to [$actorSel]. Sender[$sender] sent the message of type "${message.getClass.getName}"."""))
 } ? internalAsk
  22. 22. Copyright © 2017 TIS Inc. All rights reserved. 22 akka.pattern.PromiseActorRef def apply(provider: ActorRefProvider, timeout: Timeout, targetName: Any, messageClassName: String, sender: ActorRef = Actor.noSender): PromiseActorRef = {
 val result = Promise[Any]()
 val scheduler = provider.guardian.underlying.system.scheduler
 val a = new PromiseActorRef(provider, result, messageClassName)
 implicit val ec = a.internalCallingThreadExecutionContext
 val f = scheduler.scheduleOnce(timeout.duration) {
 result tryComplete Failure(
 new AskTimeoutException(s"""Ask timed out on [$targetName] after [${timeout.duration.toMillis} ms]. Sender[$sender] sent message of type "$ {a.messageClassName}"."""))
 }
 result.future onComplete { _ ⇒ try a.stop() finally f.cancel() }
 a
 } スケジューラを設定し時間がくれば AskTimeoutException送信 implements of ask pattern 2/2
  23. 23. Copyright © 2017 TIS Inc. All rights reserved. circuit breaker 23
  24. 24. Copyright © 2017 TIS Inc. All rights reserved. a receiver is down 24 問い合わせたいサービスがダウンしていること もある
  25. 25. Copyright © 2017 TIS Inc. All rights reserved. response latency will rise 25 100ms 1s normal abnormal(timeout=1s) レスポンス劣化 過負荷により性能劣化が拡⼤
  26. 26. Copyright © 2017 TIS Inc. All rights reserved. apply circuit breaker 26 サーキットブレーカ でダウンしているサービス には問い合わせをしないように circuit breaker
  27. 27. Copyright © 2017 TIS Inc. All rights reserved. what is circuit breaker 27 https://martinfowler.com/bliki/CircuitBreaker.html ⼀定回数の失敗を繰り返す と接続を抑⽌ Once the failures reach a certain threshold, the circuit breaker trips
  28. 28. Copyright © 2017 TIS Inc. All rights reserved. circuit breaker has three statuses 28 http://doc.akka.io/docs/akka/current/common/circuitbreaker.html Closed:メッセージ送信可能 Open :メッセージ送信不可
  29. 29. Copyright © 2017 TIS Inc. All rights reserved. decrease the latency 29 無駄な問い合わせをやめてレイテンシを発⽣さ せないようにする 100ms x ms normal abnormal(timeout=1s) 1s Open Close
  30. 30. Copyright © 2017 TIS Inc. All rights reserved. apply circuit breaker: implement 30 val breaker =
 new CircuitBreaker(
 context.system.scheduler,
 maxFailures = 5,
 callTimeout = 10.seconds,
 resetTimeout = 1.minute).onOpen(notifyMeOnOpen()) http://doc.akka.io/docs/akka/current/common/circuitbreaker.html def receive = {
 case "dangerousCall" =>
 breaker.withCircuitBreaker(Future(dangerousCall)) pipeTo sender()
 } 5回失敗するとOpenになり、1分間はメッセー ジを送信させない
  31. 31. Copyright © 2017 TIS Inc. All rights reserved. block threads 31 ブロッキング処理があるとスレッドが枯渇しレ イテンシが伝播 blockingblocking threads threads
  32. 32. Copyright © 2017 TIS Inc. All rights reserved. prevention of propagation 32 異常サービスを切り離すことで、問題が上流へ 伝播しない blockingblocking threads threads
  33. 33. Copyright © 2017 TIS Inc. All rights reserved. CAP trade-off 33 return old information vs don't return anything just do my work vs need synchronize with others cache push - read - write 古い情報を返してもよいか? 他者との同期なしで問題ないか?
  34. 34. Copyright © 2017 TIS Inc. All rights reserved. rate limiting 34 rate limiter 同じクライアントからの集中したリクエストか ら守る no more than 100 requests in any 3 sec interval
  35. 35. Copyright © 2017 TIS Inc. All rights reserved. bulkhead 35
  36. 36. Copyright © 2017 TIS Inc. All rights reserved. Even if there is damage next door, are you OK? 36 無関係なお隣さんがダウンしたとき、影響を被 る不運な出来事
  37. 37. Copyright © 2017 TIS Inc. All rights reserved. bulkhead blocks the damage 37 スレッドをブロックするアクターと影響を受け るアクターの間に隔壁 threadsthreads blocking
  38. 38. Copyright © 2017 TIS Inc. All rights reserved. isolating the blocking calls to actors 38 val blockingActor = context.actorOf(Props[BlockingActor].
 withDispatcher(“blocking-actor-dispatcher”),
 "blocking-actor")
 
 class BlockingActor extends Actor {
 def receive = {
 case GetCustomer(id) =>
 // calling database
 …
 }
 } ブロッキングコードはアクターごと分離してリ ソースを共有しない
  39. 39. Copyright © 2017 TIS Inc. All rights reserved. the blocking in Future 39 Future{ // blocking } ブロックするFutureによりディスパッチャが枯 渇 threads
  40. 40. Copyright © 2017 TIS Inc. All rights reserved. 40 http://www.slideshare.net/ktoso/zen-of-akka#44 デフォルトディスパッチャを利⽤した場合 using the default dispatcher
  41. 41. Copyright © 2017 TIS Inc. All rights reserved. 41 ブロッキング処理を分離 threadsthreads Future{ // blocking } isolating the blocking Future
  42. 42. Copyright © 2017 TIS Inc. All rights reserved. 42 http://www.slideshare.net/ktoso/zen-of-akka#44 using a dedicated dispatcher 専⽤ディスパッチャの利⽤
  43. 43. Copyright © 2017 TIS Inc. All rights reserved. CQRS:Command and Query Responsibility Segregation 43 コマンドとクエリを分離する write read command query
  44. 44. Copyright © 2017 TIS Inc. All rights reserved. cluster 44
  45. 45. Copyright © 2017 TIS Inc. All rights reserved. hardware will fail 45 If there are 365 machines failing once a year, one machine will fail a day Wouldn't a machine break even when it's hosted on the cloud? 1年に1回故障するマシンが365台あれば平均毎 ⽇1台故障する
  46. 46. Copyright © 2017 TIS Inc. All rights reserved. availability of AWS 46 例:AWSの可⽤性検証サイト https://cloudharmony.com/status-of-compute-and-storage-and-cdn-and-dns-for-aws
  47. 47. Copyright © 2017 TIS Inc. All rights reserved. preparing for failure of hardware 47 - minimize single point of failure - allow recovery of State 単⼀障害点を最⼩化 状態を永続化
  48. 48. Copyright © 2017 TIS Inc. All rights reserved. Cluster monitor each other by sending heartbeats 48 node1 node2 node3 node4 クラスタのメンバーがハートビートを送り合い 障害を検知
  49. 49. Copyright © 2017 TIS Inc. All rights reserved. recovery states 49 Cluster 永続化しておいたイベントをリプレイすること で状態の復元が可能 persist replay node1 node2 node3 node4 events state akka-persistence
  50. 50. Copyright © 2017 TIS Inc. All rights reserved. the database may be down or overloaded 50 永続化機能の障害未復旧時に闇雲にリトライし ない persist replay node3 node4 replay replay db has not started yet
  51. 51. Copyright © 2017 TIS Inc. All rights reserved. BackoffSupervisor 51 http://doc.akka.io/docs/akka/current/general/supervision.html#Delayed_restarts_with_the_BackoffSupervisor_pattern 3秒後、6秒後、12秒後、…の間隔でスタートを 試みる 
 val childProps = Props(classOf[EchoActor])
 
 val supervisor = BackoffSupervisor.props(
 Backoff.onStop(
 childProps,
 childName = "myEcho",
 minBackoff = 3.seconds,
 maxBackoff = 30.seconds,
 randomFactor = 0.2 // adds 20% "noise" to vary the intervals slightly
 ))
 
 system.actorOf(supervisor, name = "echoSupervisor") increasing intervals of 3, 6, 12, ...
  52. 52. Copyright © 2017 TIS Inc. All rights reserved. split brain resolver 52
  53. 53. Copyright © 2017 TIS Inc. All rights reserved. Cluster node1 node2 node3 node4 In the case of network partitions 53 ネットワークが切れることもある
  54. 54. Copyright © 2017 TIS Inc. All rights reserved. node1 node4 Cluster1 Cluster2 using split brain resolver 54 クラスタ間での⼀貫性維持のためSplit brain resolverを適⽤ node2 node3 node5 split brain resolver
  55. 55. Copyright © 2017 TIS Inc. All rights reserved. strategy 1/4: Static Quorum 55 quorum-size = 3 クラスタ内のノード数が⼀定数以上の場合⽣存 node2 node1 node4 node3 node5 Which can survive? - If the number of nodes is quorum-size or more
  56. 56. Copyright © 2017 TIS Inc. All rights reserved. strategy 2/4: Keep Majority 56 ノード数が50%より多い場合に⽣存 node2 node1 node4 node3 node5 Which can survive? - If the number of nodes is more than 50%
  57. 57. Copyright © 2017 TIS Inc. All rights reserved. strategy 3/4: Keep Oldest 57 最古のノードが⾃グループに含まれている場合 に⽣存 node2 node4 node3 node5 Which can survive? - If contain the oldest node node1 oldest
  58. 58. Copyright © 2017 TIS Inc. All rights reserved. strategy 4/4: Keep Referee 58 特定のノードが含まれている場合に⽣存 node2 node4 node3 node5 node1 Which can survive? - If contain the given referee node address = "akka.tcp://system@node1:port"
  59. 59. Copyright © 2017 TIS Inc. All rights reserved. 59 SBR is included in Lightbend Reactive Platform https://github.com/TanUkkii007/akka-cluster-custom-downing http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html Lightbend Reactive Platform akka-cluster-custom-downing SBRはLightbend Reactive Platformで提供され ています
  60. 60. Copyright © 2017 TIS Inc. All rights reserved. idempotence 60 冪等性
  61. 61. Copyright © 2017 TIS Inc. All rights reserved. Failed to receive ack message 61 Order(coffee,1) Order(coffee,1) ackを受信できずメッセージを再送すると2重注 ⽂してしまう coffee please! becomes a duplicate order by resending the message
  62. 62. Copyright © 2017 TIS Inc. All rights reserved. idempotence 62 メッセージを複数回受信しても問題ないように 冪等な設計で⼀貫性を維持 Order(id1, coffee, 1) Order(id1, coffee, 1) coffee, please! applying it multiple times is not harmful
  63. 63. Copyright © 2017 TIS Inc. All rights reserved. summary 63
  64. 64. Copyright © 2017 TIS Inc. All rights reserved. summary 64 - Microservices mean distributed systems - define Cross-Functional Requirements - design for failure 障害は発⽣するものなので、受け⼊れましょう
  65. 65. Copyright © 2017 TIS Inc. All rights reserved. summary 65 timeout circuit breaker bulkhead cluster backoff split brain resolver ... by using Akka Akkaは分散システムの障害に対処するための ツールキットを備えています
  66. 66. Copyright © 2017 TIS Inc. All rights reserved. reference materials 66 - Building Microservices - Reactive Design Patterns - Reactive Application Development - Effective Akka - http://akka.io/
  67. 67. Copyright © 2017 TIS Inc. All rights reserved. 67 https://gitter.im/akka-ja/akka-doc-ja https://github.com/akka-ja/akka-doc-ja/ akka.io翻訳協⼒者募集中!! Gitterにジョインしてください。 now translating
  68. 68. THANK YOU

×