TCP Congestion: Routers directing many inputs to one output. Whats a router to do, put it a queue, as its a limited queue the overflow packets will get dropped. If TCP sender starts resending the lost packs, network gets congested. Quoting Jim Roskind, Internet is “Equal Opportunity destroyer of packets”. Solution: Shrink the send window to 50% of aggregate flow when congestion is detected.
App Overload: When demand is high ex, 110 packets instead of 100, apps will queue it up, but users click refresh button, then effective demand goes to 220 and higher. Meanwhile server will be responding to queries after a user has already hit reload. Effective work is zero, even though the system is at 100%. Solution: Throw away half the packets that arrive, better than getting overloaded exponentially. One of the trick is to reboot the server, which flushes all the queues and resets.
Retry Storm: Imagine a sequence of services calling each other and database is not responding, so service near the database will retry, eventually the calling service above will retry “One retry doubles the load”, and if there are 4 services, the load on DB will be 8 times.
So Culprits: Queue and Retry?
HTTP Keep-alive: Sending multiple requests without waiting for response. The header Connection: keep-alive
asks the server to keep the TCP connection open after the server has sent its response, so that the client can send further requests on it. To send another request, the client appends the second request to the first. Similarly, when the server replies with a second response on the same connection, it appends it to the first response
Solutions: Diligent code reviews and testing to find hidden queues and unbounded queue causing unbounded latency. Reduce retries to distressed services. Willing to fail//throttle some requests rather than spiral into collapse. Fail fast is good, rather than fail slow with latency. If retry, share/gossip the fact to all all callers. Exponential backoff with several retries does not work.
Detection: Using cloudwatch metrics of networkIn, networkOut, networkPacketsIn/Out and form a baseline and then create thresholds. Application load balancers can detect HTTP 500 messages and mark instance as anomalous and the ASG can increase servers intermittently. Bring these metrics into dashboard for easy visualization.
Avoid Overload: Keep malicious traffic out (using Cloudfront, Shield advanced for DDoS and WAF for reject malicious traffic), ask users to back off early (WAF can throttle the requests and send custom error codes back) by explicitly declaring and preventing refreshes, throttle upstream (using SQS, but the queue depth needs to be monitored and producer needs to be informed to stop, probably using a lambda).
Testing: Crush testing, and make sure if system is caching, then requests should be disparate (use iPerf) or AWS Fault Injection service. When the alarms were raised, did the system inform the right parties and how many times.
Batch the requests, sort them, remove duplicates and then update the database. Rinse/Repeat. When batching, make sure one request does not poison the whole batch.Test the system above expected load to explore stateful collapse (crush test), and how quickly does the latency recover.
Availability vs Efficiency
Excess capacity is traditionally added to avoid overload as teams dont want to operate at max loads, but efficiency requires minimizing capacity.