Frequent calls to DescribeInstances cause SQS backup and late notifications

`aws-node-termination-handler` seems to have the following logic (in `pkg/monitor/sqsevent/sqs-monitor.go`):

- Pull five messages from the SQS queue every 2 seconds with a 20-second-per-message visibility timeout
- Attempt to call the `ec2.DescribeInstances` API once per message (as part of the `retrieveNodeName` function). If the SQS queue is consistently full, this is 2.5 calls to `DescribeInstances` per second per queue.
- If `DescribeInstances` hits a rate limit, do not remove the message from the queue; instead leave it there to be retried 20 seconds later

We are now operating at sufficient scale that `DescribeInstances` is often hitting API rate limits -- I'm not sure those limits are published, so I'm not sure whether the 2.5-calls-per-second from each `aws-node-termination-handler` are a signifiant cause of our rate limiting, or merely one contributor. But once rate limiting begins we observe the following failure mode:

- `aws-node-termination-handler` will never pull more than 5 messages every 2 seconds.
- If it hits a rate limit the message stays on the queue to be redelivered 20 seconds later.
- The queue gets backed up. We consistently observe 40-50 messages in flight on our busiest SQS queues, though never more than 50 (note that 5 messages * 20 seconds / 2 seconds = 50 messages). There is often a backup of hundreds of messages and the wait times are 50 seconds to 300 seconds or more.
- At this point the majority of messages are handled so late that the instances are gone from EC2. The "good" news is that, when `DescribeInstances` manages to get through to the API, the message for an already-terminated instance gets deleted from the backed-up queue. The bad news is that it's way too late to warn Kubernetes by that point, so we are experiencing sudden unexpected failure of Kube pods.

I can think of a couple ways to change this code to help avoid this problem:

- Stop making so many calls to `DescribeInstances`. The purpose of these calls is to translate EC2 instance IDs to node names. Instance IDs are said to be unique (and are certainly vanishingly unlikely to be reused in rapid succession) and private DNS names are not mutable, so this mapping cannot change and can be cached -- we can periodically call `DescribeInstances` for all instances and cache the results.
- Make the number of SQS messages retrieved per 2-second cycle into a configurable parameter that can be tuned along with the number of workers. I'm not even sure why we fetch so few messages each time, except as a means of throttling the rate of `DescribeInstances` API calls, which could be managed differently as suggested above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions