Tuning TCP and NGINX on EC2

1. Tuning TCP and NGINX on EC2

2. Who are we? Chartbeat measures and monetizes attention on the web. Working with 80% of the top US news sites and global media sites in 50 countries, Chartbeat brings together editors and advertisers to identify in real time the active time an audience consumes articles, videos, paid content, and display advertising. 2 | TCP/NGINX Tuning on EC2

3. ● Founded in 2009 ● Hosted on AWS , 400-500 servers depending on time of day ● Around 180k - 220k req/sec ● 6 - 9 million concurrents chartbeat.com/totaltotal 2 | TCP/NGINX Tuning on EC2

4. Who am I? ● Sr Web Operations Engineer ● Previously worked at ○ Bitly ○ TheStreet.com ○ Promotions.com 2 | TCP/NGINX Tuning on EC2

5. Traffic Characteristics Every 15 seconds 213 byte, request size 43 byte, response size 2 | TCP/NGINX Tuning on EC2

6. Problem ● Reports of slowness from some customers ● Taking 3 seconds to send data 2 | TCP/NGINX Tuning on EC2 Default Retransmission Timeout RFC 1122: Section 4.2.3.1 The following values SHOULD be used to initialize the estimation parameters for a new connection: (a) RTT = 0 seconds. (b) RTO = 3 seconds. (The smoothed variance is to be initialized to the value that will result in this RTO).

7. 2 | TCP/NGINX Tuning on EC2 flickr: wallyg

8. 2 | TCP/NGINX Tuning on EC2 flickr: oregondot

9. Now what? TCPDump + Wireshark confirms retransmissions 2 | TCP/NGINX Tuning on EC2

10. DON’T GRAPH ALL THE THINGS ● Graph only relevant metrics ○ you’ll end up with a ton of red herrings 2 | TCP/NGINX Tuning on EC2

11. Sources of info ● ss -s ○ summary of socket statistics TCP: 10678 (estab 2503, closed 8167, orphaned 0, synrecv 0, timewait 8167/0), ports 0 ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", ● 2 | TCP/NGINX Tuning on EC2

12. 2 | TCP/NGINX Tuning on EC2 TCP/IP Illustrated Volume 1 Second Ed.

13. 2 | TCP/NGINX Tuning on EC2 Logster + Graphite https://github.com/etsy/logster Tails logs, generates metrics and outputs to Graphite or Ganglia

14. FINDINGS 2 | TCP/NGINX Tuning on EC2

15. Sources of info ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", 2 | TCP/NGINX Tuning on EC2 Values > 1, can’t be good Confirmed what we suspected WHUT

16. 2 | TCP/NGINX Tuning on EC2

17. net.ipv4.tcp_max_syn_backlog Systems Performance Enterprise and the Cloud by Brendan Gregg, pg 492 2 | TCP/NGINX Tuning on EC2 net.core.somaxconn Nginx: listen backlog=####

18. Insane Defaults ● net.core.netdev_max_backlog = 1000 ○ Per CPU backlog? ○ Network Frames ● net.ipv4.tcp_max_syn_backlog = 128 ● net.core.somaxconn = 128 ● nginx listen backlog = 511 ?!? ○ Silently truncated to somaxconn value 2 | TCP/NGINX Tuning on EC2

19. New Values ● net.core.netdev_max_backlog = 16384 ● net.ipv4.tcp_max_syn_backlog = 65536 ● net.core.somaxconn = 16384 ● nginx listen backlog = 16384 ○ should be <= somaxconn 2 | TCP/NGINX Tuning on EC2

20. 2 | TCP/NGINX Tuning on EC2 Results

21. Further settings explored net.ipv4.tcp_slow_start_after_idle net.ipv4.tcp_max_tw_buckets net.ipv4.tcp_rmem/wrem net.ipv4.tcp_fin_timeout net.ipv4.tcp_mem 2 | TCP/NGINX Tuning on EC2

22. net.ipv4.tcp_slow_start_after_idle Set to 0 to ensure connections don’t go back to default window size after being idle too long. Example: HTTP KeepAlive 2 | TCP/NGINX Tuning on EC2

23. net.ipv4.tcp_max_tw_buckets Max number of sockets in TIME_WAIT. We actually set this very high, since before we moved instances behind an ELB it was normal to have 200k+ sockets in TIME_WAIT state. Exceeding this leads to sockets being torn down until under limit 2 | TCP/NGINX Tuning on EC2

24. net.ipv4.tcp_rmem/wrem Format: min default max ( in bytes) The kernel will autotune the number of bytes to use for each socket based on these settings. It will start at default and work between the min and max 2 | TCP/NGINX Tuning on EC2

25. net.ipv4.tcp_fin_timeout The time a connection should spend in FIN_WAIT_2 state. Default is 60 seconds, lowering this will free memory more quickly and transition the socket to TIME_WAIT. This will NOT reduce the time a socket is in TIME_WAIT which is set to 2 * MSL (max segment lifetime) 2 | TCP/NGINX Tuning on EC2

26. net.ipv4.tcp_fin_timeout continued... MSL is hardcoded in the kernel at 60 seconds! https://github. com/torvalds/linux/blob/master/include/net/tcp. h#L115 #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT * state, about 60 seconds */ 2 | TCP/NGINX Tuning on EC2

27. net.ipv4.tcp_mem Format: low pressure max (in pages!) Below low, Kernel won’t put pressure on sockets to reduce mem usage. Once pressure hits, sockets reduce memory until low is hit. If max hit, no new sockets. 2 | TCP/NGINX Tuning on EC2

30. net.ipv4.tcp_tw_recycle (DANGEROUS) ● Clients behind NAT/Stateful FW will get dropped ● *99.99999999% of time should never be enabled * Probably 100% but there may be a valid case out there 2 | TCP/NGINX Tuning on EC2

31. net.ipv4.tcp_tw_reuse ● Makes a safer attempt at freeing sockets in TIME_WAIT state. 2 | TCP/NGINX Tuning on EC2

32. Recycle vs Reuse Deep Dive http://bit.ly/tcp-time-wait 2 | TCP/NGINX Tuning on EC2

33. One last thing… TCP Congestion Window - initcwnd (initial) 2 | TCP/NGINX Tuning on EC2 Starting in Kernel 2.6.39 , set to 10 Previous default was 3! http://research.google.com/pubs/pub36640.html Older Kernel? $ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10

34. 2 | TCP/NGINX Tuning on EC2 NGINX

35. listen statement ● backlog ○ limited by net.core.somaxconn ● defer ○ TCP_DEFER_ACCEPT - Wait till we receive data packet before passing socket to server. Completing TCP Handshake won’t trigger an accept() 2 | TCP/NGINX Tuning on EC2

36. server block ● sendfile ○ Saves context switching from userspace on read/write. ○ “zero copy” , happens in kernel space ● tcp_nopush ○ TCP_CORK ○ allows application to control building of packet, e.g pack a packet with full HTTP response ● tcp_nodelay ○ Nagle’s Algorithm ○ Only affects keep-alive connections ● multi_accept ○ Accept all connections on listen queue at once (careful, can overwhelm workers) 2 | TCP/NGINX Tuning on EC2

37. Nagle’s Algorithm (tcp_nopush) Small payload + need for low latency? Disable 2 | TCP/NGINX Tuning on EC2

38. HTTP Keep-Alive ● Enabled once behind ELB ● Given small payload and 15 seconds between data, waste of resources for us to enable exposed directly to net 2 | TCP/NGINX Tuning on EC2

39. HTTP Keep-Alive Cont.. ● Also enable on upstream proxies ○ Available since 1.1.4 ○ *cough* had to upgrade Nginx and Fix memory leak dealing with libevent and keepalives before we could get this fully setup 2 | TCP/NGINX Tuning on EC2

40. 2 | TCP/NGINX Tuning on EC2 ELB

41. Cross-Zone load balancing Ensures requests to each ELB in each AZ go to ALL instances in ALL AZs 2 | TCP/NGINX Tuning on EC2

42. Idle Connection Timeout ● Defaults to 60 seconds ● Finally tunable via API. ● Tweak if doing anything long lived , e.g. Websockets, or ensure you are sending “pings” 2 | TCP/NGINX Tuning on EC2

43. Connection draining “Graceful” removal of node from ELB, will ensure existing connections can finish instead of a hard cutoff (old behavior) 2 | TCP/NGINX Tuning on EC2

44. Metrics to monitor ● SurgeQueueLength (Not Good) A count of the total number of requests that are pending submission to a registered instance. ● SpilloverCount (BAD) A count of the total number of requests that were rejected due to the queue being full. 2 | TCP/NGINX Tuning on EC2

45. Conclusions ● The internet is full of lies ● With enough traffic, tweaking system and application defaults are a necessary ● Find trusted sources (Me? Maybe?) for settings and test in staging environments ● Measure impact and understand what metrics may be impacted by your tweaks ● Don’t get lost in all the sysctl settings ● TCP is complicated 2 | TCP/NGINX Tuning on EC2

46. 2 | TCP/NGINX Tuning on EC2 FIN FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT

47. Resources and References https://www.kernel. org/doc/Documentation/networking/ip-sysctl.txt 2 | TCP/NGINX Tuning on EC2 man tcp(7)

48. Additional reading http://engineering.chartbeat.com Full story about experiences with our architecture and material discussed in slides 2 | TCP/NGINX Tuning on EC2

49. Questions / Comments? 2 | TCP/NGINX Tuning on EC2 @Lintzston justin@chartbeat.com