Skip to content

Conversation

trutx
Copy link
Contributor

@trutx trutx commented Oct 18, 2022

Issue #704

Description of changes:

When working in Queue processor mode all interruption events have kind SQS_TERMINATE. This is wrong because this value reflects the operation mode or the kind of monitor handling the event and not the actual event kind. Besides SQS_TERMINATE is a pretty inaccurate and confusing term. However, when working in IMDS mode each event Kind field contains the real AWS event kind.

In this PR I'm fixing this inconsistency by making every InterruptionEvent.Kind field show the actual event kind and not the monitor kind that handled it. I'm also adding a new Monitor field that will hold the monitor kind for each event. An AWS rebalance recommendation event (or any other) is the same event regardless of where NTH gets it from (IMDS or SQS), so for the sake of consistency now all events in both operation modes will have the same Kind value and event kinds in SQS mode won't be masked behind a vague SQS_TERMINATE term.

Sample log lines (anonymized):

Before

2022/10/14 08:14:27 INF Adding new event to the event store event={"AutoScalingGroupName":"my-asg","Description":"Rebalance recommendation event received. Instance i-aaaaaaaaaaaaaaaaaa will be cordoned at 2022-10-14 08:14:26 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"rebalance-recommendation-event-65646564353331632d306265352d396161352d623aaaaaaaa366664313162636632613233","InProgress":false,"InstanceID":"i-aaaaaaaaaaaaaaaaa","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"ip-1-2-3-4.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-2c/i-aaaaaaaaaaaaaaaaa","StartTime":"2022-10-14T08:14:26Z","State":""}

After

2022/10/18 04:07:57 INF Adding new event to the event store event={"AutoScalingGroupName":"my-asg","Description":"Rebalance recommendation event received. Instance i-xxxxxxxxxxxxxxxxx will be cordoned at 2022-10-18 04:07:56 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"rebalance-recommendation-event-62386430343663302d323831352d633731312d646232312d373938316562623935633466","InProgress":false,"InstanceID":"i-xxxxxxxxxxxxxxxxx","IsManaged":true,"Kind":"REBALANCE_RECOMMENDATION","Monitor":"SQS_MONITOR","NodeLabels":null,"NodeName":"ip-1-2-3-4.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-2c/i-xxxxxxxxxxxxxxxxx","StartTime":"2022-10-18T04:07:56Z","State":""}

This PR doesn't change the basic NTH behavior but it does change its output and this might break monitoring systems based on such output. I.e. systems watching SQSTermination Kubernetes events or parsing log lines in search for "Kind":"SQS_TERMINATE" will have to be updated.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@trutx trutx requested a review from a team as a code owner October 18, 2022 09:28
@snay2
Copy link
Contributor

snay2 commented Nov 1, 2022

I like the direction of these changes, as they give cluster operators a clearer picture of what NTH is actually doing. However, we will have to be clear about what changes will be made: as you pointed out, customers with monitoring systems configured to look for specific strings would have to update their monitoring after upgrading.

It may be possible to make the logging changes backwards compatible by leaving the existing event_type and kind fields intact and adding new fields for the more specific event types and monitor types. However, the k8s events only have a single "reason" field, so it's not possible to make it entirely backwards compatible, even if we include both the old and the new strings in that field. For these reasons, I'm leaning toward making this change but calling out the differences explicitly as part of a minor version bump.

Potential description of the logging changes to be included in the release notes:


This release modifies several Debug, Info, and Warn logs, as well as Kuberentes events emitted by NTH. These changes primarily affect NTH when running in Queue Processor mode. These changes improve your observability about what NTH is doing when responding to events via SQS. If your monitoring system is configured to look for any of the specific strings in the tables below, you may need to modify your configuration to use the updated strings instead.

Changes to logs when starting up

  1. Remove event_type field from the Info log when starting a monitor; replace with monitor_type field, with new values. See Table 1.
  2. Remove event_type field from the Warn log when a monitor fails to start; replace with monitor_type field, with new values. See Table 1.

Changes to logs when processing an event

  1. New monitor field in the Info log. See Table 1.
  2. Potentially change value of kind field in the Info log, if running Queue Processor mode. See Table 2.
  3. Potentially change the "reason" field in the k8s event if running Queue Processor mode. See Table 3.

Changes to logs when receiving an SQS message

  1. Include the specific event type instead of SQS_TERMINATE in the Debug log if running Queue Processor mode. See Table 2.

Tables of changed values

Table 1: Monitor types

Previous New
REBALANCE_RECOMMENDATION REBALANCE_RECOMMENDATION_MONITOR
SCHEDULED_EVENT SCHEDULED_EVENT_MONITOR
SPOT_ITN SPOT_ITN_MONITOR
SQS_TERMINATE SQS_MONITOR

Table 2: Event types

Previous New
REBALANCE_RECOMMENDATION REBALANCE_RECOMMENDATION
SCHEDULED_EVENT SCHEDULED_EVENT
SPOT_ITN SPOT_ITN
SQS_TERMINATE REBALANCE_RECOMMENDATION SCHEDULED_EVENT SPOT_ITN STATE_CHANGE ASG_LIFECYCLE

Table 3: Event reasons

Previous reason New reason
RebalanceRecommendation RebalanceRecommendation
ScheduledEvent ScheduledEvent
SpotInterruption SpotInterruption
SQSTermination RebalanceRecommendation ScheduledEvent SpotInterruption StateChange ASGLifecycle

@trutx
Copy link
Contributor Author

trutx commented Nov 2, 2022

Yes, like you said there is no way around a breaking change, so given we're doing it let's just group all breaking changes in one release bump.

The release notes and tables you added look great to me, I would only make the table 2 and 3 titles explicitly clear so there is no confusion about what events we are talking about in each case. Something like Table 2: AWS event types and Table 3: Kubernetes event reasons. For the same reason in the text above I would add an AWS in one sentence, replacing when responding to events via SQS with when responding to AWS events via SQS.

Copy link
Contributor

@snay2 snay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with my team, we’ve decided we need to prioritize backwards compatibility as much as possible. We propose the following changes:

  • Keep any existing fields in debug, info, or warn logs the same
  • Where the previous information was incorrect or incomplete, add a new field with an appropriate name to convey the new information.

This way, customers who currently rely on specific values in specific fields of the log messages will continue to receive those, but customers who want the new values can add them into their monitoring configurations. We don’t have any way of knowing which customers are using the existing fields for monitoring, and we also don’t have a reliable way to warn customers about the change before they upgrade, especially if they don’t religiously follow our release notes.

So even though your proposal creates more accurate logs, I’m asking you to change it in the following ways:

  1. Keep the Kind property as-is. That is, it will basically indicate the monitor type, not the event type.
  2. Instead of a new property called Monitor, create a new property called AWSEventDetails (open to suggestions on the name, but the idea is that it has a more detailed, more accurate description of the actual AWS event). This stays the same for the IMDS-based monitors, but the SQS-based monitor will specify the actual AWS event type here.
  3. dAssociated logging changes to reflect this (described in inline comments).

All that said, I think we should do logging right in v2 (currently in alpha) so we don’t have this backwards compatibility debt in the future. @cjerad is working on that and has been part of my discussions internally, so I’ll make sure he’s in the loop.

Thank you and your team for all your hard work on this so far! Sorry about the churn.

"github.com/aws/aws-node-termination-handler/pkg/node"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you centralized the event types here--keep this!

type InterruptionEvent struct {
EventID string
Kind string
Monitor string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this AWSEventDetails (open to suggestions on the name if you disagree). Because the Kind field basically contained the monitor type in the past, let's keep that behavior even though the name is sort of wrong. This new field will contain the accurate description of the actual AWS event (e.g., AWSEventDetails = "SPOT_ITN" and Kind = "SQS_TERMINATE").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, if it's only meant to contain the AWS event type I'd call this field AWSEventType or AWSEventKind, but that's just a suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I like AWSEventType.


// Kind denotes the kind of event that is processed
// Kind denotes the kind of monitor
func (m SQSMonitor) Kind() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm opening to renaming this method to MonitorKind() so that it's more clear internally that we are referring to the monitor type, not the event type. Similar change on the IMDS-based monitors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sounds good. Will do.

Str("monitor", event.Monitor).
Str("node-name", event.NodeName).
Str("instance-id", event.InstanceID).
Str("provider-id", event.ProviderID).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 252 when we emit the Kubernetes event, we need to keep the "reason" argument the same as it was before (observability.GetReasonForKind(event.Kind)).

However, let's add another argument at the end, after event.Description that includes the AWSEventDetails. I'd suggest we use the "screaming snake case" names without conversion, but I'm open to doing a conversion like GetReasonForKind() does right now if you disagree.

Copy link
Contributor Author

@trutx trutx Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, but I have to respectfully say I'm against this one. In all other cases the old and wrong field, data, log line or whatever can be kept and coexist with a new and fixed one. In this particular case like you pointed out there's only one reason, so either we use the right or the wrong reason. Kubernetes event reasons are the cornerstone of event-based Kubernetes monitoring, and we're choosing to leave it wrong. I understand backwards compatibility but in this case here we're not talking about an enhancement or a new feature but a fix, and I think not fixing things that are known to be wrong is not the best way to go.

That said, I think the best option is to leave the recorder.Emit() call in 252 as is and instead modify the Description field generation in all monitors. I.e., we would change line 93 in spot-itn-monitor.go to something like

Description:  fmt.Sprintf("Spot ITN received. Instance will be interrupted at %s. AWS event type: %s \n", instanceAction.Time, monitor.SpotITNKind),

or maybe

Description:  fmt.Sprintf("Spot ITN received (%s). Instance will be interrupted at %s \n", monitor.SpotITNKind, instanceAction.Time),

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you also intending to keep the return values of observability.GetReasonForKind() as they were before, such that the only change to the argument values in recorder.Emit() would be this updated Description field? If so, I am in favor of that plan.

I like your first suggestion (AWS event type: ... at the end of the Description).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we are to keep the old and wrong reason there is no other option than reverting the changes in GetReasonForKind() so they return the old values.

log.Info().
Str("event-id", event.EventID).
Str("kind", event.Kind).
Str("monitor", event.Monitor).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this aws-event-details or perhaps event-type-details, and have it reflect the underlying AWS event. kind will continue to refer to the monitor type as it did in the past, to preserve backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call this aws-event-type or aws-event-kind. IMO adding details here is confusing because the field only contains one single detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's go with aws-event-type.

for _, fn := range monitoringFns {
go func(monitor monitor.Monitor) {
log.Info().Str("event_type", monitor.Kind()).Msg("Started monitoring for events")
log.Info().Str("monitor_type", monitor.Kind()).Msg("Started monitoring for events")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the original field name event_type to preserve backwards compatibility, even though that's technically not correct. I'm open to adding another field named monitor_type as well, to make the meaning more clear, but we need to keep the existing field name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to keep the event_type field but change its value? Old values here were consts like RebalanceRecommendationKind = "REBALANCE_RECOMMENDATION" while new ones are consts like RebalanceRecommendationMonitorKind = "REBALANCE_RECOMMENDATION_MONITOR". With that we are keeping the field name but changing its value so we're effectively introducing breaking changes anyway when it comes to people parsing such fields (I guess they are also parsing the field value, not just the mere presence of the field with any value).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I meant that we should keep the original value, too. So event_type should be REBALANCE_RECOMMENDATION, SCHEDULED_EVENT, SPOT_ITN, or SQS_TERMINATE, rather than the newer *_MONITOR values returned by MonitorKind().

@trutx
Copy link
Contributor Author

trutx commented Nov 9, 2022

I have started working on the requested changes but after a while I realized they ultimately mean undoing the entire PR. One change leads to another and in the end everything stays almost the same. If we're to keep backwards compatibility and keep all log messages and event reasons like they are today there's really not much to be done here. Centralizing the event kinds in the monitor package makes no sense now and renaming the Kind() method to MonitorKind() doesn't solve much because it is both used to refer to monitor and event kinds and it also returns a const value that applies both to monitors and events, so by fixing the former we're breaking the latter.

The only new thing is the addition of the AWSEventType field and appending it to the event.Description, and even this could be a breaking change if people's monitoring/log parsing is expecting the description as it is today. For example, in this spot ITN description

Description:  fmt.Sprintf("Spot ITN received. Instance will be interrupted at %s \n", instanceAction.Time),

they could be reading the last part of the string with something like

descriptionParts := strings.Split(event.Description, " ")
interruptionTime := descriptionParts[len(descriptionParts)-1]

expecting a time which would be used somewhere else.

All in all, I think this PR should be closed. Even if we introduced the small changes above we wouldn't be fixing #704.

As always, thoughts welcome.

@cjerad
Copy link
Contributor

cjerad commented Nov 14, 2022

Hi trutx, thanks for your patience and working with us.

snay2 and I think another path is to introduce the concept of a select-able log format version. "Version 1" would correspond to the existing log format and messages, and a new "version 2" would contain your changes. The log format version would be specified via Helm config, with a default of version 1. This would make the change in format "opt-in".

I have implemented this feature, and can add it to this PR if you enable the Allow edits and access to secrets by maintainers checkbox.

Let me know your thoughts.

@trutx
Copy link
Contributor Author

trutx commented Nov 16, 2022

Hey, sorry about the silence. I can't find the checkbox you mention, maybe some repo settings need to be changed?

In any case I'm switching jobs and will be leaving New Relic this Friday, but I am on holidays so it means I effectively left already, hence the silence. I will lose my perms on the newrelic-forks org and the repo this PR comes from, and I don't find comfortable allowing edits to a PR coming from a repo owned by a company I'm just about to leave.

Feel free to use this PR and the changes in it at will. Probably the best move is that you guys open a PR in the upstream repo with all my changes and the feature you just implemented. Or else I can check if I can create a new PR from another fork, either mine or Spotify's.

@cjerad
Copy link
Contributor

cjerad commented Nov 17, 2022

@trutx Thanks for the update. I've opened a new PR (#725) that incorporates both changes (and includes you as author). So I'll close out this PR.

@cjerad cjerad closed this Nov 17, 2022
@trutx
Copy link
Contributor Author

trutx commented Nov 17, 2022

Sounds great, thanks for doing that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants