-
Notifications
You must be signed in to change notification settings - Fork 1.4k
✨ Add support for checking Machine conditions in MachineHealthCheck #12827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add support for checking Machine conditions in MachineHealthCheck #12827
Conversation
af68d6e to
7cb44c3
Compare
ab19424 to
c6a7148
Compare
|
/test pull-cluster-api-e2e-main |
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go
Outdated
Show resolved
Hide resolved
test/e2e/data/infrastructure-docker/main/cluster-template-kcp-remediation/mhc.yaml
Outdated
Show resolved
Hide resolved
test/e2e/data/infrastructure-docker/main/clusterclass-quick-start.yaml
Outdated
Show resolved
Hide resolved
test/e2e/data/infrastructure-docker/v1.11/clusterclass-quick-start.yaml
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go
Show resolved
Hide resolved
424114f to
5ee7d25
Compare
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
|
/test pull-cluster-api-e2e-main |
|
Quick update on failing main tests:
--- FAIL: TestMachineHealthCheck_Reconcile (441.16s)
--- FAIL: TestHealthCheckTargets (0.00s)Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you. |
I think gomega should already not care about the order if we use the right matcher I also thought we have some matcher that allows ignoring timestamps for condition comparisons (grep for HaveSameStateOf, maybe it's useful) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).
Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.
the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage
cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Lines 134 to 183 in f96b742
| if t.Node == nil { | |
| if timeoutForMachineToHaveNode == disabledNodeStartupTimeout { | |
| // Startup timeout is disabled so no need to go any further. | |
| // No node yet to check conditions, can return early here. | |
| return false, 0 | |
| } | |
| controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) | |
| clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) | |
| machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition) | |
| machineCreationTime := t.Machine.CreationTimestamp.Time | |
| // Use the latest of the following timestamps. | |
| comparisonTime := machineCreationTime | |
| logger.V(5).Info("Determining comparison time", | |
| "machineCreationTime", machineCreationTime, | |
| "clusterInfraReadyTime", clusterInfraReady, | |
| "controlPlaneInitializedTime", controlPlaneInitialized, | |
| "machineInfraReadyTime", machineInfraReady, | |
| ) | |
| if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) { | |
| comparisonTime = controlPlaneInitialized.Time | |
| } | |
| if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) { | |
| comparisonTime = clusterInfraReady.Time | |
| } | |
| if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) { | |
| comparisonTime = machineInfraReady.Time | |
| } | |
| logger.V(5).Info("Using comparison time", "time", comparisonTime) | |
| timeoutDuration := timeoutForMachineToHaveNode.Duration | |
| if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) { | |
| v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration) | |
| logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration) | |
| conditions.Set(t.Machine, metav1.Condition{ | |
| Type: clusterv1.MachineHealthCheckSucceededCondition, | |
| Status: metav1.ConditionFalse, | |
| Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason, | |
| Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration), | |
| }) | |
| return true, time.Duration(0) | |
| } | |
| durationUnhealthy := now.Sub(comparisonTime) | |
| nextCheck := timeoutDuration - durationUnhealthy + time.Second | |
| return false, nextCheck | |
| } |
cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Lines 105 to 116 in f96b742
| if t.nodeMissing { | |
| logger.V(3).Info("Target is unhealthy: node is missing") | |
| v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "") | |
| conditions.Set(t.Machine, metav1.Condition{ | |
| Type: clusterv1.MachineHealthCheckSucceededCondition, | |
| Status: metav1.ConditionFalse, | |
| Reason: clusterv1.MachineHealthCheckNodeDeletedReason, | |
| Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name), | |
| }) | |
| return true, time.Duration(0) | |
| } |
While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.
More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.
Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:
- when node is not showing up at startup
- when the node has been deleted at a later stage
- when the node exists (which is the only scenario covered in the current change set)
I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go
Outdated
Show resolved
Hide resolved
|
/test pull-cluster-api-test-main |
@fabriziopandini Thanks for the detailed feedback! You're absolutely right about the inconsistent behavior. I've refactored now the Changes I made:
Let me know what you think about the refactor. |
|
@sbueringer, thanks for another round of feedback on conversion; hopefully all your suggestions should be incorporated now. |
|
/test pull-cluster-api-e2e-main |
fabriziopandini
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @furkatgofurov7 for this iteration!
I'm wondering if we can further simplify the code/improve readability by using two sub functions, one for machineChecks and the other for nodeChecks.
The resulting needsRemediation will look like:
func (t *healthCheckTarget) needsRemediation(logger logr.Logger, timeoutForMachineToHaveNode metav1.Duration) (bool, time.Duration) {
// checks for HasRemediateMachineAnnotation, ClusterControlPlaneInitializedCondition, ClusterInfrastructureReadyCondition
...
// Check machine conditions
unhealthyMachineMessages, nextMachineCheck := t.machineChecks(logger)
// Check node conditions
nodeConditionReason, nodeV1beta1ConditionReason, unhealthyNodeMessages, nextNodeCheck := t.nodeChecks(logger, timeoutForMachineToHaveNode)
// Combine results and set conditions
...
}
Another benefit of this code struct, is that condition management is implemented only in one place.
In case it can help, this is a commit where I experimented a little bit about this idea
wdyt?
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
|
/test pull-cluster-api-e2e-main |
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly. Adding support for Machine conditions enables us to perform remediation during control plane upgrades. This PR introduces a new field as part of the MachineHealthCheckChecks: - `UnhealthyMachineConditions` This will mirror the behavior of `UnhealthyNodeConditions` but the MachineHealthCheck controller will instead check the machine conditions. This reimplements and extends earlier work originally proposed in a previous PR 12275. Co-authored-by: Justin Miron <[email protected]> Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
…iation() method If both a node condition and machine condition are unhealthy, pick one reason but combine all the messages Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
Refactors `needsRemediation`, specifically following changes were made: - Move machine condition evaluation to always execute first, regardless of node state - Ensure machine conditions are checked in ALL scenarios: * When node is missing (t.nodeMissing) * When node hasn't appeared yet (t.Node == nil) * When node exists (t.Node != nil) - Consistently merge node and machine condition messages in all failure scenarios - Maintain backward compatibility with existing condition message formats - Use appropriate condition reasons based on which conditions are unhealthy Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
…ns: one for machineChecks and the other for nodeChecks. Another benefit of this code struct, is that condition management is implemented only in one place. Co-authored-by: Fabrizio Pandini Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
… reasons Signed-off-by: Furkat Gofurov <[email protected]>
|
Looks like I am hitting #12334 in e2e tests here? |
c45f0fe to
4c1004f
Compare
|
/test pull-cluster-api-e2e-main |
1 similar comment
|
/test pull-cluster-api-e2e-main |
sbueringer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor findings
test/infrastructure/docker/templates/clusterclass-in-memory.yaml
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Furkat Gofurov <[email protected]>
|
Thank you very much!!! /lgtm |
|
LGTM label has been added. Git tree hash: 4ead6b65f629a47eeda3ee75246b77679a6ddc5a
|
|
/test pull-cluster-api-e2e-main |
|
Nice! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.
Adding support for Machine conditions enables us to perform remediation during control plane upgrades.
This PR introduces a new field as part of the MachineHealthCheckChecks:
UnhealthyMachineConditionsThis will mirror the behavior of
UnhealthyNodeConditionsbut the MachineHealthCheck controller will instead check the machine conditions.This reimplements and extends the work originally proposed by @justinmir in PR #12275.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes: #5450
Part of #12291
Label(s) to be applied
/kind feature
/area machinehealthcheck
Notes for Reviewers
Additional notes on what has changed in this PR to get a general idea of what this change is trying to achieve.
MHC related tests:
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions:
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go: Added a test case with UnhealthyMachineConditions to verify machine condition evaluationinternal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go: Added unit tests verifying machines need remediation based on machine conditionsUnhealthyNodeConditionsandUnhealthyMachineConditionsare configured to ensure they work together correctlyCore Logic Refactor:
Modified
needsRemediation()ininternal/controllers/machinehealthcheck/machinehealthcheck_targets.goto:CEL Validation for UnhealthyMachineConditions:
UnhealthyMachineCondition.TypefieldReady,Available,HealthCheckSucceeded,OwnerRemediated,ExternallyRemediatedinternal/webhooks/test/machinehealthcheck_test.goto verify CEL validation enforces the restriction.