NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly

**Describe the bug**
NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly. 

After following the set up described for Queue Processor mode and having EventBridge rules for `aws.ec2 - EC2 Instance State-change Notification` and `aws.autoscaling - EC2 Instance-terminate Lifecycle Action`, my understanding is that the following scenarios should happen: 

1) When a node is terminated via the ec2 console, the EventBridge rule should fire and add a message to the sqs queue, triggering the node being drained before being terminated.

2) When a node is terminated via a scale-in event (via asg autoscaling or `aws autoscaling terminate-instance-in-auto-scaling-group` cli command), the event bridge rule should fire and add a message to the sqs queue, triggering the node being drained before being terminated.

In practice, only scenario 2 works as expected. With scenario 1, I'm seeing that the drain is scheduled and begins but the node can (and generally does) terminate before node has been drained successfully.  

With scenario 2, my understanding is that because the terminate action is triggered via the autoscaling api, the lifecycle hook kicks in which prevents the node from terminating immediately without either reaching a timeout or further input from nth. This grace period (assuming pods are able to evict within the period) means that the node is only terminated once the node has been drained. 

However, as the lifecycle hook is _only_ relevant and triggered by termination requests via the autoscaling api, this same behaviour is not seen if terminating through the ec2 console. When terminating through the ec2 console, nodes are frequently terminated before it has been successfully/fully drained.  

**Steps to reproduce**
Deploy and configure NTH in queue processor mode with EventBridge rules to monitor autoscaling events and instance state change events. 

Terminate a node through the ec2 console and monitor  nodes and pods being terminated.
Scale down a node via autoscaling (or aws cli) and monitor nodes and pods being terminated

**Expected outcome**
In both scenarios, the node should finish draining before it is terminataed.

**Environment**

* NTH App Version: 1.20.0
* NTH Mode (IMDS/Queue processor): Queue processor
* OS/Arch: bottlerocket
* Kubernetes version: 1.24
* Installation method: helm


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions