Skip to content

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

@mechanical-fish

Description

@mechanical-fish

aws-node-termination-handler seems to have the following logic (in pkg/monitor/sqsevent/sqs-monitor.go):

  • Pull five messages from the SQS queue every 2 seconds with a 20-second-per-message visibility timeout
  • Attempt to call the ec2.DescribeInstances API once per message (as part of the retrieveNodeName function). If the SQS queue is consistently full, this is 2.5 calls to DescribeInstances per second per queue.
  • If DescribeInstances hits a rate limit, do not remove the message from the queue; instead leave it there to be retried 20 seconds later

We are now operating at sufficient scale that DescribeInstances is often hitting API rate limits -- I'm not sure those limits are published, so I'm not sure whether the 2.5-calls-per-second from each aws-node-termination-handler are a signifiant cause of our rate limiting, or merely one contributor. But once rate limiting begins we observe the following failure mode:

  • aws-node-termination-handler will never pull more than 5 messages every 2 seconds.
  • If it hits a rate limit the message stays on the queue to be redelivered 20 seconds later.
  • The queue gets backed up. We consistently observe 40-50 messages in flight on our busiest SQS queues, though never more than 50 (note that 5 messages * 20 seconds / 2 seconds = 50 messages). There is often a backup of hundreds of messages and the wait times are 50 seconds to 300 seconds or more.
  • At this point the majority of messages are handled so late that the instances are gone from EC2. The "good" news is that, when DescribeInstances manages to get through to the API, the message for an already-terminated instance gets deleted from the backed-up queue. The bad news is that it's way too late to warn Kubernetes by that point, so we are experiencing sudden unexpected failure of Kube pods.

I can think of a couple ways to change this code to help avoid this problem:

  • Stop making so many calls to DescribeInstances. The purpose of these calls is to translate EC2 instance IDs to node names. Instance IDs are said to be unique (and are certainly vanishingly unlikely to be reused in rapid succession) and private DNS names are not mutable, so this mapping cannot change and can be cached -- we can periodically call DescribeInstances for all instances and cache the results.
  • Make the number of SQS messages retrieved per 2-second cycle into a configurable parameter that can be tuned along with the number of workers. I'm not even sure why we fetch so few messages each time, except as a means of throttling the rate of DescribeInstances API calls, which could be managed differently as suggested above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Priority: LowThis issue will not be seen by most users. The issue is a very specific use case or corner caseType: EnhancementNew feature or requeststaleIssues / PRs with no activity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions