generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 277
Closed
Labels
Priority: LowThis issue will not be seen by most users. The issue is a very specific use case or corner caseThis issue will not be seen by most users. The issue is a very specific use case or corner caseType: EnhancementNew feature or requestNew feature or requeststaleIssues / PRs with no activityIssues / PRs with no activity
Description
aws-node-termination-handler
seems to have the following logic (in pkg/monitor/sqsevent/sqs-monitor.go
):
- Pull five messages from the SQS queue every 2 seconds with a 20-second-per-message visibility timeout
- Attempt to call the
ec2.DescribeInstances
API once per message (as part of theretrieveNodeName
function). If the SQS queue is consistently full, this is 2.5 calls toDescribeInstances
per second per queue. - If
DescribeInstances
hits a rate limit, do not remove the message from the queue; instead leave it there to be retried 20 seconds later
We are now operating at sufficient scale that DescribeInstances
is often hitting API rate limits -- I'm not sure those limits are published, so I'm not sure whether the 2.5-calls-per-second from each aws-node-termination-handler
are a signifiant cause of our rate limiting, or merely one contributor. But once rate limiting begins we observe the following failure mode:
aws-node-termination-handler
will never pull more than 5 messages every 2 seconds.- If it hits a rate limit the message stays on the queue to be redelivered 20 seconds later.
- The queue gets backed up. We consistently observe 40-50 messages in flight on our busiest SQS queues, though never more than 50 (note that 5 messages * 20 seconds / 2 seconds = 50 messages). There is often a backup of hundreds of messages and the wait times are 50 seconds to 300 seconds or more.
- At this point the majority of messages are handled so late that the instances are gone from EC2. The "good" news is that, when
DescribeInstances
manages to get through to the API, the message for an already-terminated instance gets deleted from the backed-up queue. The bad news is that it's way too late to warn Kubernetes by that point, so we are experiencing sudden unexpected failure of Kube pods.
I can think of a couple ways to change this code to help avoid this problem:
- Stop making so many calls to
DescribeInstances
. The purpose of these calls is to translate EC2 instance IDs to node names. Instance IDs are said to be unique (and are certainly vanishingly unlikely to be reused in rapid succession) and private DNS names are not mutable, so this mapping cannot change and can be cached -- we can periodically callDescribeInstances
for all instances and cache the results. - Make the number of SQS messages retrieved per 2-second cycle into a configurable parameter that can be tuned along with the number of workers. I'm not even sure why we fetch so few messages each time, except as a means of throttling the rate of
DescribeInstances
API calls, which could be managed differently as suggested above.
chrispsplash
Metadata
Metadata
Assignees
Labels
Priority: LowThis issue will not be seen by most users. The issue is a very specific use case or corner caseThis issue will not be seen by most users. The issue is a very specific use case or corner caseType: EnhancementNew feature or requestNew feature or requeststaleIssues / PRs with no activityIssues / PRs with no activity