Tuning TCP and NGINX on EC2

7,157
-1

Published on

Our Sr. Web Operations Engineer, Justin Lintz, goes over some parameters we tuned in TCP and NGINX to improve the performance and stability of our systems. These slides are a complement to a two part blog post found over on our engineering blog.

http://engineering.chartbeat.com/2014/01/02/part-1-lessons-learned-tuning-tcp-and-nginx-in-ec2/

http://engineering.chartbeat.com/2014/02/12/part-2-lessons-learned-tuning-tcp-and-nginx-in-ec2/

Published in: Technology
0 Comments
37 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,157
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
142
Comments
0
Likes
37
Embeds 0
No embeds

No notes for slide
  • Record traffic during US Election and World Cup, USA vs Germany 10+ million concurrents. Presidential election at the time was 2x our normal traffic
  • High packet rate, low bandwidth. 43 bytes is small empty image we can send. Need this for error handling purposes on frontend side. Can’t send empty response
  • Reports from users about slowness in sending “pings” to our servers. Slow clients, slowness doesn’t really affect our numbers too much as long as its arriving < 5 seconds. Asked for some numbers, seeing pings taking around 3 seconds . That number sets off some alarms
  • These two numbers should raise alarms when you are doing any troubleshooting with TCP connections. 3 seconds is the default timeout for re-trying a connection, will backoff and re-try after 6 seconds, so 9 seconds total for a connection
  • Maybe on client side? How do we know it’s on us? At the time our Pingdom monitoring didn’t show anything unusual, later learned this is definitely not enough
  • Especially if you don’t have a good baseline for some metrics, you will end up chasing oddities in graphs that may be completely irrelevant
  • Only graphed relevant info from netstat -s, there is a ton of metrics that may be useful for debugging other issues, but we started with these since they appeared to be most related with the issue at hand. For example, “fast retransmit” , while relevant, wouldnt indicate a delay of 3 seconds, since it bypasses the timeout. Push these to ganglia/graphite
  • Had to enable logging, discard after hour, space contraints. Rotation impacts performance, switch to ext4 on log volume helped, no compress
  • Confirmed issues, didn’t give us a source. But some symptoms to look into.
  • Two queues, first for half established connections , can make large to help with SYN floods, although given todays flood attacks, probably not much help
    Second queue for established connections for your app to pluck off
    max_syn_backlog = system wide
    net.core.somaxconn = per process
  • Still not 100% sure what this controls, from looking at Kernel source , it appears to be this

    Nginx backlog originally wasn’t documented in documentation, I had to find this in the source code and from googling
  • Didnt know about nginx listen backlog at first. Initially changed first three values, saw a slight decrease in timeouts and listen queue overflows ,took a bunch of reading till I learned about the fact that each application has to set its own backlog queue and even further research to find what nginxs default value was

  • Kernel will reuse sockets in TIME_WAIT when it can, a socket in a TIME_WAIT state actually doesn’t take up any resources
  • Tweak if dealing with sending/receiving large amounts of data to improve throughput
    We changed but our throughput is fairly low per server so didn’t see any measurable impact
  • Internet gets this wrong a lot. TIME_WAIT takes up no memory
  • everyone on the internet gets this wrong! If you really want to change TIME_WAIT time , see ip_conntrack_tcp_timeout_time_wait in ip_conntrack module
  • The pressure relates to the rmem and wmem settings we set earlier.
  • Definitions wrong, harmful settings recommended, even seen this in a lot of books when searching books.google.com for settings
  • If you are reading any blog/book that recommends enabling this, run far away
  • Amazing read into why recycle is bad, and why TIME_WAIT exists
  • Allows for more data in flight, if you are serving up larger content, you will see nice improvements here
  • Defer , saves on resources where handshake occurs but no data is sent or data is delayed. Leaves Nginx free to deal with connections already sending a data payload
  • we set both to on. Small payloads
    tcp_nopush = application controls things
    tcp_nodelay = seamless to developer, just happens
    multi_accept, we have off, given our constant stream of connections, it can overwhelm downstream
  • Lower CPU utilization as well
  • Previous behavior, if request hit ELB in AZ us-east-1d, would only get routed to instances behind there. This change really smoothed out distribution for us
  • Indicates capacity issues
  • It’s easy to get carried away and tune too many things (premature optimization) or settings which may have little to no effect for you.
  • Dont trust random blogs, filled with terrible information. Sysctl settings defined wrong or extremely vague
  • Tuning TCP and NGINX on EC2

    1. 1. Tuning TCP and NGINX on EC2
    2. 2. Who are we? Chartbeat measures and monetizes attention on the web. Working with 80% of the top US news sites and global media sites in 50 countries, Chartbeat brings together editors and advertisers to identify in real time the active time an audience consumes articles, videos, paid content, and display advertising. 2 | TCP/NGINX Tuning on EC2
    3. 3. ● Founded in 2009 ● Hosted on AWS , 400-500 servers depending on time of day ● Around 180k - 220k req/sec ● 6 - 9 million concurrents chartbeat.com/totaltotal 2 | TCP/NGINX Tuning on EC2
    4. 4. Who am I? ● Sr Web Operations Engineer ● Previously worked at ○ Bitly ○ TheStreet.com ○ Promotions.com 2 | TCP/NGINX Tuning on EC2
    5. 5. Traffic Characteristics Every 15 seconds 213 byte, request size 43 byte, response size 2 | TCP/NGINX Tuning on EC2
    6. 6. Problem ● Reports of slowness from some customers ● Taking 3 seconds to send data 2 | TCP/NGINX Tuning on EC2 Default Retransmission Timeout RFC 1122: Section 4.2.3.1 The following values SHOULD be used to initialize the estimation parameters for a new connection: (a) RTT = 0 seconds. (b) RTO = 3 seconds. (The smoothed variance is to be initialized to the value that will result in this RTO).
    7. 7. 2 | TCP/NGINX Tuning on EC2 flickr: wallyg
    8. 8. 2 | TCP/NGINX Tuning on EC2 flickr: oregondot
    9. 9. Now what? TCPDump + Wireshark confirms retransmissions 2 | TCP/NGINX Tuning on EC2
    10. 10. DON’T GRAPH ALL THE THINGS ● Graph only relevant metrics ○ you’ll end up with a ton of red herrings 2 | TCP/NGINX Tuning on EC2
    11. 11. Sources of info ● ss -s ○ summary of socket statistics TCP: 10678 (estab 2503, closed 8167, orphaned 0, synrecv 0, timewait 8167/0), ports 0 ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", ● 2 | TCP/NGINX Tuning on EC2
    12. 12. 2 | TCP/NGINX Tuning on EC2 TCP/IP Illustrated Volume 1 Second Ed.
    13. 13. 2 | TCP/NGINX Tuning on EC2 Logster + Graphite https://github.com/etsy/logster Tails logs, generates metrics and outputs to Graphite or Ganglia
    14. 14. FINDINGS 2 | TCP/NGINX Tuning on EC2
    15. 15. Sources of info ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", 2 | TCP/NGINX Tuning on EC2 Values > 1, can’t be good Confirmed what we suspected WHUT
    16. 16. 2 | TCP/NGINX Tuning on EC2
    17. 17. net.ipv4.tcp_max_syn_backlog Systems Performance Enterprise and the Cloud by Brendan Gregg, pg 492 2 | TCP/NGINX Tuning on EC2 net.core.somaxconn Nginx: listen backlog=####
    18. 18. Insane Defaults ● net.core.netdev_max_backlog = 1000 ○ Per CPU backlog? ○ Network Frames ● net.ipv4.tcp_max_syn_backlog = 128 ● net.core.somaxconn = 128 ● nginx listen backlog = 511 ?!? ○ Silently truncated to somaxconn value 2 | TCP/NGINX Tuning on EC2
    19. 19. New Values ● net.core.netdev_max_backlog = 16384 ● net.ipv4.tcp_max_syn_backlog = 65536 ● net.core.somaxconn = 16384 ● nginx listen backlog = 16384 ○ should be <= somaxconn 2 | TCP/NGINX Tuning on EC2
    20. 20. 2 | TCP/NGINX Tuning on EC2 Results
    21. 21. Further settings explored net.ipv4.tcp_slow_start_after_idle net.ipv4.tcp_max_tw_buckets net.ipv4.tcp_rmem/wrem net.ipv4.tcp_fin_timeout net.ipv4.tcp_mem 2 | TCP/NGINX Tuning on EC2
    22. 22. net.ipv4.tcp_slow_start_after_idle Set to 0 to ensure connections don’t go back to default window size after being idle too long. Example: HTTP KeepAlive 2 | TCP/NGINX Tuning on EC2
    23. 23. net.ipv4.tcp_max_tw_buckets Max number of sockets in TIME_WAIT. We actually set this very high, since before we moved instances behind an ELB it was normal to have 200k+ sockets in TIME_WAIT state. Exceeding this leads to sockets being torn down until under limit 2 | TCP/NGINX Tuning on EC2
    24. 24. net.ipv4.tcp_rmem/wrem Format: min default max ( in bytes) The kernel will autotune the number of bytes to use for each socket based on these settings. It will start at default and work between the min and max 2 | TCP/NGINX Tuning on EC2
    25. 25. net.ipv4.tcp_fin_timeout The time a connection should spend in FIN_WAIT_2 state. Default is 60 seconds, lowering this will free memory more quickly and transition the socket to TIME_WAIT. This will NOT reduce the time a socket is in TIME_WAIT which is set to 2 * MSL (max segment lifetime) 2 | TCP/NGINX Tuning on EC2
    26. 26. net.ipv4.tcp_fin_timeout continued... MSL is hardcoded in the kernel at 60 seconds! https://github. com/torvalds/linux/blob/master/include/net/tcp. h#L115 #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT * state, about 60 seconds */ 2 | TCP/NGINX Tuning on EC2
    27. 27. net.ipv4.tcp_mem Format: low pressure max (in pages!) Below low, Kernel won’t put pressure on sockets to reduce mem usage. Once pressure hits, sockets reduce memory until low is hit. If max hit, no new sockets. 2 | TCP/NGINX Tuning on EC2
    28. 28. 2 | TCP/NGINX Tuning on EC2
    29. 29. 2 | TCP/NGINX Tuning on EC2
    30. 30. net.ipv4.tcp_tw_recycle (DANGEROUS) ● Clients behind NAT/Stateful FW will get dropped ● *99.99999999% of time should never be enabled * Probably 100% but there may be a valid case out there 2 | TCP/NGINX Tuning on EC2
    31. 31. net.ipv4.tcp_tw_reuse ● Makes a safer attempt at freeing sockets in TIME_WAIT state. 2 | TCP/NGINX Tuning on EC2
    32. 32. Recycle vs Reuse Deep Dive http://bit.ly/tcp-time-wait 2 | TCP/NGINX Tuning on EC2
    33. 33. One last thing… TCP Congestion Window - initcwnd (initial) 2 | TCP/NGINX Tuning on EC2 Starting in Kernel 2.6.39 , set to 10 Previous default was 3! http://research.google.com/pubs/pub36640.html Older Kernel? $ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
    34. 34. 2 | TCP/NGINX Tuning on EC2 NGINX
    35. 35. listen statement ● backlog ○ limited by net.core.somaxconn ● defer ○ TCP_DEFER_ACCEPT - Wait till we receive data packet before passing socket to server. Completing TCP Handshake won’t trigger an accept() 2 | TCP/NGINX Tuning on EC2
    36. 36. server block ● sendfile ○ Saves context switching from userspace on read/write. ○ “zero copy” , happens in kernel space ● tcp_nopush ○ TCP_CORK ○ allows application to control building of packet, e.g pack a packet with full HTTP response ● tcp_nodelay ○ Nagle’s Algorithm ○ Only affects keep-alive connections ● multi_accept ○ Accept all connections on listen queue at once (careful, can overwhelm workers) 2 | TCP/NGINX Tuning on EC2
    37. 37. Nagle’s Algorithm (tcp_nopush) Small payload + need for low latency? Disable 2 | TCP/NGINX Tuning on EC2
    38. 38. HTTP Keep-Alive ● Enabled once behind ELB ● Given small payload and 15 seconds between data, waste of resources for us to enable exposed directly to net 2 | TCP/NGINX Tuning on EC2
    39. 39. HTTP Keep-Alive Cont.. ● Also enable on upstream proxies ○ Available since 1.1.4 ○ *cough* had to upgrade Nginx and Fix memory leak dealing with libevent and keepalives before we could get this fully setup 2 | TCP/NGINX Tuning on EC2
    40. 40. 2 | TCP/NGINX Tuning on EC2 ELB
    41. 41. Cross-Zone load balancing Ensures requests to each ELB in each AZ go to ALL instances in ALL AZs 2 | TCP/NGINX Tuning on EC2
    42. 42. Idle Connection Timeout ● Defaults to 60 seconds ● Finally tunable via API. ● Tweak if doing anything long lived , e.g. Websockets, or ensure you are sending “pings” 2 | TCP/NGINX Tuning on EC2
    43. 43. Connection draining “Graceful” removal of node from ELB, will ensure existing connections can finish instead of a hard cutoff (old behavior) 2 | TCP/NGINX Tuning on EC2
    44. 44. Metrics to monitor ● SurgeQueueLength (Not Good) A count of the total number of requests that are pending submission to a registered instance. ● SpilloverCount (BAD) A count of the total number of requests that were rejected due to the queue being full. 2 | TCP/NGINX Tuning on EC2
    45. 45. Conclusions ● The internet is full of lies ● With enough traffic, tweaking system and application defaults are a necessary ● Find trusted sources (Me? Maybe?) for settings and test in staging environments ● Measure impact and understand what metrics may be impacted by your tweaks ● Don’t get lost in all the sysctl settings ● TCP is complicated 2 | TCP/NGINX Tuning on EC2
    46. 46. 2 | TCP/NGINX Tuning on EC2 FIN FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT
    47. 47. Resources and References https://www.kernel. org/doc/Documentation/networking/ip-sysctl.txt 2 | TCP/NGINX Tuning on EC2 man tcp(7)
    48. 48. Additional reading http://engineering.chartbeat.com Full story about experiences with our architecture and material discussed in slides 2 | TCP/NGINX Tuning on EC2
    49. 49. Questions / Comments? 2 | TCP/NGINX Tuning on EC2 @Lintzston justin@chartbeat.com

    ×