At SigOpt, we provide a reliable, scalable service for enterprises and academics. In order to provide uptime that meets the demanding needs of our customers, we use AWS’ “Classic” Elastic Load Balancer (ELB) to distribute incoming requests among several AWS instances. These AWS instances host Nginx reverse proxies which pass requests to our SigOpt API services.
Infrastructure diagram for SigOpt API staging environment. During this investigation, the number of registered instances was 2 (n=2).
In April, we noticed that a small number of API requests were timing out, returning 504 (Gateway Timeout) errors to users. Requests will sometimes fail for reasons out of our control, but we wanted to make sure we were doing everything on our end to ensure a flawless experience with our product. This post will describe how we were able to get to the bottom of these mysterious 504 errors and what others can do to prevent them in the future.
ELB 5XX errors per day (representing less than 0.01% of total requests per day).
These timeout errors were hard to diagnose since the associated requests did not show up as errors in our API service logs. The 504s would be served almost immediately after requests were issued and our ELB logs indicated that these requests never reached a registered instance for processing:
Example log from ELB
According to AWS, there are two main causes of ELB 504s:
- The registered instances take too long to handle the request.
- The registered instances close the connections to the ELB prematurely.
From what I had observed so far, the second situation seemed more likely:
- There was no surge in CPU/memory load or request length on the instances when the 504 errors occurred.
- The ELB would respond with a 504 almost immediately, much faster than the ELB’s idle timeout and the processing time for an average request.
- The timeouts were much more frequent than service or machine restarts.
- Failed requests often couldn’t be traced from the load balancers to the corresponding backend service.
The second case can occur when a server running on a registered instance is incorrectly configured. AWS recommends setting all keepalive timeout configurations to be greater than the idle timeout. This enables ELBs to keep connections with registered instances open and reuse them for multiple requests, allowing for faster response times.
If the service on a registered instance has a keepalive timeout shorter than the ELB’s idle timeout, the registered instance will attempt to close the established connection with the ELB when the keepalive timeout is reached. The ELB will remain unaware of this closed connection and proceed as if the connection is still open. When the next request is received, the ELB will attempt to forward it to the registered instance which will not respond due to the closed connection, causing the ELB to return a 504 error.
Simplified connection diagram for a request where a misconfigured keepalive timeout causes a 504 error (excluding any latency/delay)
Investigating TCP Connections
Given the observations above, the most likely cause of the ELB 504 errors is that the Nginx proxy servers, hosted on our registered instances, are prematurely closing connections to the ELB.
A quick look over our Nginx configurations showed that the keepalive connections were set to 75s. This is longer than our configured ELB idle timeout of 60 seconds. As mentioned above, AWS’s recommendations state that the ELB timeout should be less than the keepalive timeout to avoid issues. Since our ELB idle timeout is less than our keepalive timeout, the connections between the Nginx proxies and the ELB should remain open for at least the ELB’s idle timeout. The ELB should be the one opening and closing the connections and we shouldn’t be experiencing these 504s.
To find out which side was ending the TCP connection – ELB or Nginx – I started monitoring the TCP connections on one of the registered instances in our staging environment. I recorded the TCP activity on the machine and saved it to a packet capture file for inspection:
sudo tcpdump tcp -Alvvv -s 0 port not 22 -w 504-investigation.pcap
After copying the packet capture to my machine, I inspected its contents with wireshark. Sorting the packets by similar “Info” – information in the packet data – it appeared that the instance (0.0.0.0) was sending Close Notify packets to the ELB nodes (184.108.40.206 and 220.127.116.11) every 10 seconds, indicating that it was preparing to close the connection:
Close Notify alerts sent by instance from packet capture with obfuscated IPs
This would be followed by the instance sending a TCP reset (RST) to the ELB node. One situation where the instance might send a TCP RST to the ELB is if the instance was preparing to close the connection, but did not have the opportunity to close it cleanly due to the arrival of an unexpected TCP packet sent from the ELB. This could happen if the ELB assumes that the connection is still valid and continues to send packets after the instance is already preparing to close the connection.
TCP RSTs from packet capture with obfuscated IPs
This investigation confirmed that:
- The instance (rather than the ELB) was the one terminating the TCP connections.
- TCP connection termination and re-establishment was happening more frequently than the ELB timeout.
Although the Nginx server’s keepalive_timeout was properly configured to be greater than the ELB’s idle timeout, this strange behavior warranted another look at the Nginx configurations.
Experimenting With Nginx Configurations
To see how these settings affect the TCP connections, I increased each setting independently to match the Nginx keepalive_timeout (75s), and analyzed new packet captures.
Increasing the client_body_timeout to 75s did not yield any changes. Since this timeout is only set between successive reads, this would require that the ELB take greater than 10 seconds to transmit request data to the registered instance. Since our body transfer doesn’t typically take that long, this configuration shouldn’t affect TCP connectivity.
After increasing client_header_timeout to 75s, Nginx was no longer sending TCP RST packets at 10s intervals. Instead, each of the ELB nodes gracefully closed (and reopened) the connections every 60s:
Close Notify alerts sent by ELB from packet capture with obfuscated IPs
Although changing the client_body_timeout from 10s to 75s did not affect the TCP connections, updating the client_header_timeout did. Nginx defines the client_header_timeout as being measured from the beginning of the TCP connection to the end of the header transmission. Since AWS ELBs preemptively open connections regardless of whether or not a request has been received, it is possible that no request (and therefore no header) is transmitted to the instance within 10s. This would cause Nginx to send a 408 error (request timeout) to the ELB node, resulting in the termination of the connection.
Validating TCP Keepalive Behaviour
Although the data from the tcpdump was relatively consistent, there could still be some irregularities in connection cadence. Incoming requests to the ELB would reset the idle timeout counter, causing connections to stay open for longer than 60s. Due to these irregularities, it would be difficult to make conclusions based off of the packet captures alone.
In order to confirm that Nginx was behaving the way the packet captures seemed to indicate, I decided to set up a TCP connection between my machine and a registered instance, bypassing the ELB, and measure how long the instance would take to close the connection.
First, I set up the instance to check for TCP connections to my machine (18.104.22.168), polling every one second:
sudo netstat -pntc | grep 22.214.171.124
From my local machine, I opened a TCP connection to the instance (not the ELB):
openssl s_client \
-connect 0.0.0.0:443 \
This opens a connection that times out after a default of 300s, greater than any of our hypothesized keepalive times (10s or 75s). With this configuration, the instance should close the connection after 10 seconds if client_header_timeout is set to 10s, or 75 seconds if it is set to 75s.
Using the original Nginx configurations (client_header_timeout = 10s), the connection remained established for only 10 seconds before being closed by the instance:
Output from TCP connection polling with a 1s polling interval
After updating the client_header_timeout to 75s, TCP connections were kept open for 75s rather than 10s. This confirmed that the new configurations were performing as expected.
For final validation, the client_header_timeout was reconfigured to 75s and deployed to our production Nginx servers (on 05/07). This yielded a significant decrease in 5XX errors, with the count per day dropping to 0!
0 ELB 5XX Errors per day after fix (05/07)
Each ELB node will preemptively open connections to registered instances, even if there are no incoming requests. This means that all relevant backend timeouts (not just explicit keepalive timeouts!) must be larger than the ELB’s idle timeout.
Explicitly configuring Nginx with a client_header_timeout value larger than our ELB’s idle timeout was sufficient in eliminating the 504 errors. The default value for the client_header_timeout is 60s, which is equal to the default ELB idle timeout. Therefore, even if this setting was not included in the Nginx configurations, it could still cause timeout errors. It is important to check the default values for relevant timeout configurations in addition to explicitly configured ones and ensure that they are all larger than your ELB’s idle timeout.