Deep Dive on Amazon DynamoDB

372 views
308 views

Published on

Amazon DynamoDB is a fully managed NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. This talk explores DynamoDB capabilities and benefits in detail and discusses how to get the most out of your DynamoDB database. We go over schema design best practices with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We also explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, Streams, and more.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
372
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Deep Dive on Amazon DynamoDB

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Andreas Chatzakis, AWS Solutions Architecture 7th July 2016 Deep Dive on Amazon DynamoDB
  2. 2. Objectives • Prepare for success • Large tables & demanding use-cases • High Performance • Cost optimized • New functionality
  3. 3. Technology adoption and the hype curve
  4. 4. Why NoSQL? Optimized for storage Optimized for scalability Normalized/relational Denormalized/hierarchical Ad hoc queries Instantiated views Scale vertically Scale horizontally SQL NoSQL
  5. 5. Scaling efficiently
  6. 6. Size (Gigabytes) Throughput (Requests per second) Scaling
  7. 7. Partitioning
  8. 8. Partition count: Size # 𝑜𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑎𝑏𝑙𝑒 𝑆𝑖𝑧𝑒 𝑖𝑛 𝑏𝑦𝑡𝑒𝑠 10 𝐺𝐵(𝑓𝑜𝑟 𝑠𝑖𝑧𝑒) In the future, these details might change…
  9. 9. Throughput • Write capacity units (WCUs): 1 KB • Read capacity units (RCUs): 4 KB • 1 RCU => 1 strongly consistent read • 1 RCU => 2 eventually consistent reads
  10. 10. Partition count: Throughput # 𝑜𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 (𝑓𝑜𝑟 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡) = 𝑅𝐶𝑈𝑓𝑜𝑟 𝑟𝑒𝑎𝑑𝑠 3000 𝑅𝐶𝑈 + 𝑊𝐶𝑈𝑓𝑜𝑟 𝑤𝑟𝑖𝑡𝑒𝑠 1000 𝑊𝐶𝑈 In the future, these details might change…
  11. 11. ProvisionedThroughputExceededException
  12. 12. Built-in flexibility for small spikes 0 400 800 1200 1600 CapacityUnits Time Provisioned Consumed “save up” unused capacity consume saved up capacity
  13. 13. Burst capacity 0 400 800 1200 1600 CapacityUnits Time Provisioned Consumed Attempted Burst capacity: 300 seconds (1200 × 300 = 3600 CU) Throttled requests Don’t completely depend on burst capacity… provision sufficient throughput
  14. 14. Throughput per partition 100,000 𝑅𝐶𝑈 50 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 ≈ 𝟐𝟎𝟎𝟎 𝑟𝑒𝑎𝑑 𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 𝑢𝑛𝑖𝑡𝑠 𝑝𝑒𝑟 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 Partition 1 2000 RCU Partition K 2000 RCU Partition M 2000 RCU Partition 50 2000 RCU ProductCatalog Table
  15. 15. Space (which partition keys) Time (consumed capacity per second) Aim for Uniformity
  16. 16. Examine your traffic pattern: Space Partition Time Heat
  17. 17. Hot key issues manifest after you scale Client Client Table Partition Table Partition Client Client Client Client Partition Partition Partition Partition
  18. 18. A bad choice for a partition key f(x) Partition 1 Partition 2 Partition 3 Partition 4 Partition key: “07-07-2016” Range key: “Session Attendee X” Partition key: “07-07-2016” Range key: “Session Attendee Y” Table: SummitSessionAttendance
  19. 19. But I have random partition keys! Keys/partition is important but also other outliers: - Frequency (Hot keys) - Size (Large objects or collections) - Table history (partitions are not merged) ?
  20. 20. Partition key value Uniformity User ID, where the application has many users and each user has similar activity levels. Status code, where there are only a few possible status codes. Device ID, where each device accesses data at relatively similar intervals Device ID, where one device generates a lot more traffic than any other device
  21. 21. What a hot partition problem looks like Read Capacity Throttled read requests provisioned consumed
  22. 22. Troubleshooting hot partitions - CloudWatch - AWS Support - Access logs - ReturnConsumedCapacity - Sampling works well - GSIs - must also have enough write capacity - uniformity requirement also applies
  23. 23. Examine your traffic pattern: Time Partition Time Heat
  24. 24. Avoid Sudden Bursts of Read Activity throttling
  25. 25. Query rather than scan Query - Specify partition key name - Condition on sort key - Cheap with high cardinality keys Scan - Reads all data - Conditions available through filters - Expensive for large tables Partition Sort Atribute1 … Attribute N
  26. 26. When you have to scan a table • Scans constrained by single partition throughput • Use parallel Scans if table>20GB • Avoid sudden bursts vs provisioned capacity • Offload to S3, HDFS, Redshift, ElasticSearch or second table
  27. 27. Design patterns & best practices
  28. 28. Product catalog Popular items (read)
  29. 29. Partition 1 2000 RCUs Partition K 2000 RCUs Partition M 2000 RCUs Partition 50 2000 RCU Scaling bottlenecks Product A Product B Shoppers ProductCatalog Table SELECT Id, Description, ... FROM ProductCatalog WHERE Id="POPULAR_PRODUCT"
  30. 30. Partition 1 Partition 2 ProductCatalog Table User DynamoDB User Cache popular items SELECT Id, Description, ... FROM ProductCatalog WHERE Id="POPULAR_PRODUCT"
  31. 31. Real-time voting Write-heavy items
  32. 32. Partition 1 1000 WCUs Partition K 1000 WCUs Partition M 1000 WCUs Partition N 1000 WCUs Votes Table Candidate A Candidate B Scaling bottlenecks Voters Provision 200,000 WCUs
  33. 33. Write sharding Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_7 Candidate B_8 Candidate A_6 Candidate A_8 Candidate A_5 Voter Votes Table
  34. 34. Write sharding Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_7 Candidate B_8 UpdateItem: “CandidateA_” + rand(0, 10) ADD 1 to Votes Candidate A_6 Candidate A_8 Candidate A_5 Voter Votes Table
  35. 35. Votes Table Shard aggregation Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_5 Candidate A_6 Candidate A_8 Candidate A_7 Candidate B_8 Periodic process Candidate A Total: 2.5M 1. Sum 2. Store Voter
  36. 36. Trade off read cost for write scalability Consider throughput per partition key Shard write-heavy partition keys Your write workload is not horizontally scalable
  37. 37. Cost Optimization tips
  38. 38. Auto Scaling • Cost saving technique • Open Source solutions • Set minimums and maximums • Scale up proactively, scale down conservatively • Scale up time can be from minutes to hours • Implement a circuit-breaker
  39. 39. Event logging Storing time series data
  40. 40. A mix of hot and cold data Events_tableil Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N RCUs = 10000 WCUs = 10000Current table Antipattern: • Mix of hot and cold data • Old data rarely accessed • Unbounded data (partition) growth • Partition dilution • Scan costs increase with table size • Deletes of old data not trivial or cheap
  41. 41. Time series tables Events_table_2015_April Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N Events_table_2015_March Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N Events_table_2015_Feburary Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N Events_table_2015_January Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N RCUs = 1000 WCUs = 1 RCUs = 10000 WCUs = 10000 RCUs = 100 WCUs = 1 RCUs = 10 WCUs = 1 Current table Older tables HotdataColddata Don’t mix hot and cold data; archive cold data to Amazon S3
  42. 42. Use a table per time period Precreate daily, weekly, monthly tables Provision required throughput for current table Writes go to the current table Turn off (or reduce) throughput for older tables Cheaper scans – free deletes Dealing with time series data
  43. 43. Multiplayer online gaming Query filters vs. composite key indexes
  44. 44. GameId Date Host Opponent Status d9bl3 2014-10-02 David Alice DONE 72f49 2014-09-30 Alice Bob PENDING o2pnb 2014-10-08 Bob Carol IN_PROGRESS b932s 2014-10-03 Carol Bob PENDING ef9ca 2014-10-03 David Bob IN_PROGRESS Games table Hierarchical data structures
  45. 45. Query for incoming game requests DynamoDB indexes provide partition and sort What about queries for two equalities and a sort? SELECT * FROM Game WHERE Opponent='Bob‘ AND Status=‘PENDING' ORDER BY Date DESC (hash) (range) (?)
  46. 46. Secondary index Opponent Date GameId Status Host Alice 2014-10-02 d9bl3 DONE David Carol 2014-10-08 o2pnb IN_PROGRESS Bob Bob 2014-09-30 72f49 PENDING Alice Bob 2014-10-03 b932s PENDING Carol Bob 2014-10-03 ef9ca IN_PROGRESS David Approach 1: Query filter BobPartition key Sort key
  47. 47. Secondary Index Approach 1: Query filter Bob Opponent Date GameId Status Host Alice 2014-10-02 d9bl3 DONE David Carol 2014-10-08 o2pnb IN_PROGRESS Bob Bob 2014-09-30 72f49 PENDING Alice Bob 2014-10-03 b932s PENDING Carol Bob 2014-10-03 ef9ca IN_PROGRESS David SELECT * FROM Game WHERE Opponent='Bob' ORDER BY Date DESC FILTER ON Status='PENDING' (filtered out)
  48. 48. Needle in a haystack Bob
  49. 49. Send back less data “on the wire” Simplify application code Simple SQL-like expressions • AND, OR, NOT, () Use query filter Your index isn’t entirely selective
  50. 50. Approach 2: Composite key StatusDate DONE_2014-10-02 IN_PROGRESS_2014-10-08 IN_PROGRESS_2014-10-03 PENDING_2014-09-30 PENDING_2014-10-03 Status DONE IN_PROGRESS IN_PROGRESS PENDING PENDING Date 2014-10-02 2014-10-08 2014-10-03 2014-10-03 2014-09-30 + =
  51. 51. Secondary Index Approach 2: Composite key Opponent StatusDate GameId Host Alice DONE_2014-10-02 d9bl3 David Carol IN_PROGRESS_2014-10-08 o2pnb Bob Bob IN_PROGRESS_2014-10-03 ef9ca David Bob PENDING_2014-09-30 72f49 Alice Bob PENDING_2014-10-03 b932s Carol Partition key Sort key
  52. 52. Opponent StatusDate GameId Host Alice DONE_2014-10-02 d9bl3 David Carol IN_PROGRESS_2014-10-08 o2pnb Bob Bob IN_PROGRESS_2014-10-03 ef9ca David Bob PENDING_2014-09-30 72f49 Alice Bob PENDING_2014-10-03 b932s Carol Secondary index Approach 2: Composite key Bob SELECT * FROM Game WHERE Opponent='Bob' AND StatusDate BEGINS_WITH 'PENDING'
  53. 53. Needle in a sorted haystack Bob
  54. 54. Sparse indexes CustomerId (Partition) OrderId (Sort) Total Date Open 1 234234 $100 2016-07-01 1 526346 $10 2016-07-02 2 746346 $200 2016-07-02 1 23462 $300 2016-07-05 X 3 635245 $150 2016-07-05 4 245362 $80 2016-07-07 Customer Orders CustomerId (Partition) Open (Sort) Total OrderId Date 1 X $300 23462 2016-07-05 OpenOrders-GSI
  55. 55. Concatenate attributes to form useful secondary index keys Take advantage of sparse indexes Replace filter with indexes You want to optimize a query as much as possible Status + Date
  56. 56. Messaging app Large items, Varied Access Patterns Filters vs. Indexes M:N Modeling—inbox and outbox
  57. 57. Messages table Messages app David SELECT * FROM Messages WHERE Recipient='David' LIMIT 50 ORDER BY Date DESC Inbox SELECT * FROM Messages WHERE Sender ='David' LIMIT 50 ORDER BY Date DESC Outbox
  58. 58. Recipient Date Sender Message David 2014-10-02 Bob … … 48 more messages for David … David 2014-10-03 Alice … Alice 2014-09-28 Bob … Alice 2014-10-01 Carol … Large and small attributes mixed (Many more messages) David Messages table 50 items × 256 KB each Partition key Sort key Large message bodies Attachments SELECT * FROM Messages WHERE Recipient='David' LIMIT 50 ORDER BY Date DESC Inbox
  59. 59. Computing inbox query cost Items evaluated by query Average item size Conversion ratio Eventually consistent reads 50 * 256KB * (1 RCU / 4KB) * (1 / 2) = 1600 RCU All those RCUs against one partition key
  60. 60. Recipient Date Sender Subject MsgId David 2014-10-02 Bob Hi!… afed David 2014-10-03 Alice RE: The… 3kf8 Alice 2014-09-28 Bob FW: Ok… 9d2b Alice 2014-10-01 Carol Hi!... ct7r Separate the bulk data Inbox-GSI Messages table MsgId Body 9d2b … 3kf8 … ct7r … afed … David 1. Query Inbox-GSI: 1 RCU 2. BatchGetItem Messages: 1600 RCU (50 separate items at 256 KB) (50 sequential items at 128 bytes)
  61. 61. Inbox GSI Define which attributes to copy into the index
  62. 62. Outbox Sender Outbox GSI SELECT * FROM Messages WHERE Sender ='David' LIMIT 50 ORDER BY Date DESC
  63. 63. Messaging app Messages Table David Inbox global secondary index Inbox Outbox global secondary index Outbox
  64. 64. Reduce one-to-many item sizes Configure secondary index projections Use GSIs to model M:N relationship between sender and recipient Distribute large items Querying many large items at once InboxMessagesOutbox
  65. 65. Event driven applications and DynamoDB Streams
  66. 66. • Stream of updates • Asynchronous • Exactly once • Strictly ordered (per item) • Highly durable • Scale with table • 24-hour lifetime • Sub-second latency DynamoDB Streams
  67. 67. Stream Table Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Table Shard 1 Shard 2 Shard 3 Shard 4 KCL Worker KCL Worker KCL Worker KCL Worker Amazon Kinesis Client Library application DynamoDB client application Updates DynamoDB Streams and Amazon Kinesis Client Library
  68. 68. DynamoDB Streams Open Source Cross- Region Replication Library Asia Pacific (Sydney) EU (Ireland) Replica US East (N. Virginia) Cross-region replication
  69. 69. DynamoDB Streams and AWS Lambda
  70. 70. Triggers Lambda function Notify change Derivative tables Amazon CloudSearch Amazon ElastiCache
  71. 71. Search your DynamoDB tables
  72. 72. A polyglot data layer
  73. 73. Please remember to rate this session under My Agenda on awssummit.london

×