Benchmark and Metrics

1. Benchmark & Metrics Yuta Imai

2. Agenda 1.  Metrics 2.  Benchmark

3. Cita:ons •  This slide deck is based on the stories what Robert Barnes told us at his AWS :me. hCps://www.youtube.com/watch?v=jﬀB30FRmlY

4. Why benchmark? •  How long will the current configura:on be adequate? •  Will this plaSorm provide adequate performance, now and in the future? •  For a specific workload, how does one plaSorm compare to another? •  What configura:on will it take to meet current needs? •  What size instance will provide the best cost/performance for my applica:on? •  Are the changes being made to a system going to have the intended impact on the system?

6. Metrics •  To measure/benchmark system performance or business, what to monitor is so important. •  Does that metrics describe your challenge well? •  Is that metrics diﬃcult to hack?

7. Business?

8. Sample case1: Metrics to monitor the business •  If you want to monitor how the business is going on, which metrics do you monitor?? hCp://www.slideshare.net/TokorotenNakayama/dau-21559783

9. Customer Experience?

10. Sample case2: Metrics to monitor customer experience •  If you want to monitor how good is the customer experience, which metrics do you monitor??

11. Percen:le

12. Percen:le •  Amazon heavily relies on “Percen:le”. •  Percen:le: – Describes user/customer experience directly. 99.9% = 42ms

13. Percen:le •  Amazon heavily relies on “Percen:le”. •  Percen:le: – Describes user/customer experience directly. samples=1,000 It means 999 queries has been ﬁnished in 42ms. 99.9% = 42ms

14. Percen:le •  If you pick average for your SLA, it does not describe customer’s experience. 99.9% = 42ms Average=29ms In such standard distribu:on, Average might be OK but…

15. Percen:le 99.9% =46ms 99.5% =44ms •  Even if such form of histogram, percen:le can properly describe customer experience. 99% =41ms

16. Percen:le 99.9% = 50ms Average=31ms •  If you pick average, it does not describe customer’s experience. In such distribu:on, Average does not work well

17. Percen:le 99.9% =45ms 99.5% =42ms •  Percen:le is good for SLA decision in business because it well describes customer’s experience. 99% =40ms

18. Percen:le 99.9% =45ms 99.5% =42ms •  Percen:le is good for SLA decision in business because it well describes customer’s experience. 99% =40ms

19. Percen:le 99.9% =45ms 99.5% =42ms •  Percen:le is good for SLA decision in business because it well describes customer’s experience. 99% =40ms OK, let’s set business SLA to 40ms in 99.9%

20. 99.9% =45ms 99.5% =42ms 99% =40ms 99.9% =40ms If you want to provide 40ms or lower latencies in 99.9% of query… Then you will have to move distribu:on lel. AS-IS TO-BE

21. Percen:le •  Percen:le is also good for service level monitoring. 4/1 99.9% = 42ms

22. Percen:le •  Percen:le is also good for service level monitoring. 4/1 99.9% = 42ms 4/7 99.9% = 44ms

23. Percen:le •  Percen:le is also good for service level monitoring. 4/1 99.9% = 42ms 4/7 99.9% = 44ms 4/14 99.9% = 46ms

24. Percen:le •  Percen:le is also good for service level monitoring. 4/1 99.9% = 42ms 4/7 99.9% = 44ms 4/14 99.9% = 46ms Throughput increased? Data volume increased? Let’s start inves:ga:on.

25. Metrics: Summary •  Choose metrics well describe your challenge. •  Choose NOT hack-able metrics!

27. The Benchmark Lifecycle Test Design Test Analysis Measure against goal Report Test Conﬁgura:on Start with a Goal Carefully control changes Test Execu:on Run a series of controlled experiments Design your workload Build Environment Generate Load

29. First… •  What is “OK”? – “Faster” means “Inﬁnite”. •  Choose your benchmark. – Your applica:on is the best benchmark tool.

30. Ensure your design works if scale changes by 10X or 20X but the right solu:on for X olen not op:mal for 100X Jeﬀ Dean, Google The hints for deﬁne “OK”

31. Sacriﬁcial Architecture Essen:ally it means accep:ng now that in a few years :me you’ll (hopefully) need to throw away what you’re currently building. Mar:n Fowler The hints for deﬁne “OK”

32. Set performance targets Target: Achieve adequate performance •  If no target exists –  Use current performance –  Run experiments to deﬁne baseline –  Copy from someone else –  Guess •  Why set performance targets? –  To know when you are done –  Target met or :me to rewrite…

33. Example: Set performance targets Total users: 10,000,000 Request rate: 1,000 RPS Peak rate: 5,000 RPS Concurrent users: 10,000 Peak users: 50,000 Transac'on Mix ra'o 95% (msec) New user sign-up 5% 1500 Sign-in 25% 1250 Catalog search 50% 1000 Order item 10% 1500 Check order status 10% 1000

34. Choose your workloads •  Select features –  Most important –  Most popular –  Highest complaints –  “Worst” performing •  Deﬁne the workload mix –  Ra:o of features –  Typical “uesrs” and what they do –  Popula:on and distribu:on of users •  Random(even distribu:on) •  Hotspots

35. 3 ways to use benchmark 1.  Run a benchmark using your exis:ng applica:on and workloads 2.  Run a standard benchmark 3.  Use published benchmark results

36. 1. Use your exis:ng applica:on •  Choose which part of the applica:on •  Determine how to generate load •  Decide how to measure and what metrics •  Design how reports get generated

37. 2. Run a standard benchmark •  Is the test relevant to your requirements? •  How does the test map to your applica:on? •  Be aware of most of them are micro-bench.

38. When you cant’ use your applica:on, standard benchmarks can help •  Standard benchmarks s:ll leave work to be done: –  Tuning needed –  Automa:on and test execu:on –  How are they test results relevant? –  How is this test implementa:on relevant? •  Examples and :ps referencing standard benchmarks are not endorsements of these benchmarks 2. Run a standard benchmark

39. 3. Use published benchmark results •  What is being measured? •  Why is it being measured? •  How is it being measured? •  How closely does this benchmark resemble my results? •  How accurate are the reports and cita:ons? •  Are the results repeatable?

40. Tip: The 4 Rs •  Relevant –  the best test is based on your applica:on •  Recent –  Out of date results are rarely useful •  Repeatable –  Is there enough informa:on to repeat test? •  Reliable –  Do you trust the tools, the publisher and the results?

42. How to generate load •  Humans(Don’t use human, if you want repeatable and reproducible one) –  “Record/Playback” traﬃc –  Volunteers –  Mechanical Turk •  Synthe:c load –  Open source –  Commercial •  SOASTA, Neustar, Gomez, Keynote –  Write your own…

43. How to measure •  Load generator metrics •  Applica:on metrics(end to end) •  Add instrumenta:on •  Stopwatch •  Use log ﬁles –  Note that emiung lot of log will introduce another workload.

44. Tips: End-to-end tes:ng •  You need to understand and trust the tests –  Some:mes tools(clients) have boClenecks •  Use realis:c data –  Scale –  Distribu:on •  Use ramp-up, steady-state, and ramp-down •  Choose reasonable test dura:on –  Use scale down environment for longer test. For something like Like SLA proof tests. •  Run mul:ple tests and calculate variability

45. Finding boClenecks •  Search metrics and and logs for clues •  If there aren’t any, add instrumenta:on •  Isolate and individually test services and infrastructure •  Test “categories” –  Business logic –  Presenta:on –  Compute –  Memory –  Disk I/O –  Network –  Database –  Other services

46. Cloud: the good tool for benchmark •  Benchmark is not easy because building up and tearing down test conﬁgura:ons can be very labor intensive •  Benchmarking in cloud is fast with parallel execu:on, aﬀordable(pay as you go), scalable and can be automated!

48. In my experience •  I had to run Sysbench to ﬁnd CPU/Memory/IO performances are consistent in each Amazon EC2 instance type. •  I spun up 60 instances for each instance type and ran Sysbench…. •  Of cource automa:cally.

49. To automate perf tests… Result_Value1 Result_Value2 Result_Value3 Result_Value4 Result_Value5 Condi:on1 Condi:on2 Condi:on3 Condi:on4 Condi:on5 •  Create output/report format ﬁrst. •  Then write a script to run tests like…

50. Automate end-to-end foreach my $pram (@condi:ons){ write_report(run_ec2( $param{instance_type}, $param{image_id}, $param{script_to_run} )); }

51. API Gateway Slack Lambda ECS Lambda S3 Aurora Outgoing Webhook -  cluster name -  # of tasks -  commands RunTasks -  cluster name -  # of tasks -  commands as environment variables -  output loca:on Output STDOUT as ﬁle Spin up containers and run tasks Incoming Webhook -  Read ﬁle from S3 and emit it to Slack Automated distributed Sysbench to Amazon Aurora

52. Benchmark: Summary •  Goal? •  Workload? •  Load generator? Environment? •  Make the list of all of tests •  Run(and automate!)