✨ Add support for checking Machine conditions in MachineHealthCheck #12827

furkatgofurov7 · 2025-10-07T21:12:23Z

What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation during control plane upgrades.

This PR introduces a new field as part of the MachineHealthCheckChecks:

UnhealthyMachineConditions

This will mirror the behavior of UnhealthyNodeConditions but the MachineHealthCheck controller will instead check the machine conditions.

This reimplements and extends the work originally proposed by @justinmir in PR #12275.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes: #5450
Part of #12291

Label(s) to be applied
/kind feature
/area machinehealthcheck

Notes for Reviewers
Additional notes on what has changed in this PR to get a general idea of what this change is trying to achieve.

MHC related tests:
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions:

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go: Added a test case with UnhealthyMachineConditions to verify machine condition evaluation
internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go: Added unit tests verifying machines need remediation based on machine conditions
Added test coverage for scenarios where both UnhealthyNodeConditions and UnhealthyMachineConditions are configured to ensure they work together correctly

Core Logic Refactor:
Modified needsRemediation() in internal/controllers/machinehealthcheck/machinehealthcheck_targets.go to:

Always evaluate machine conditions first, regardless of node state
Ensure machine conditions are checked in ALL scenarios (node missing, startup timeout, node exists)
Consistently merge machine and node condition messages across all failure scenarios
Maintain backward compatibility with existing condition message formats

CEL Validation for UnhealthyMachineConditions:

Added CEL validation rule to UnhealthyMachineCondition.Type field
Disallowed condition types: Ready, Available, HealthCheckSucceeded, OwnerRemediated, ExternallyRemediated
Added envtest-based integration tests in internal/webhooks/test/machinehealthcheck_test.go to verify CEL validation enforces the restriction.

furkatgofurov7 · 2025-10-07T22:23:52Z

/test pull-cluster-api-e2e-main

api/core/v1beta1/conversion_test.go

api/core/v1beta2/machinehealthcheck_types.go

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

internal/webhooks/clusterclass_test.go

test/e2e/data/infrastructure-docker/main/cluster-template-kcp-remediation/mhc.yaml

test/e2e/data/infrastructure-docker/main/clusterclass-quick-start.yaml

test/e2e/data/infrastructure-docker/v1.11/clusterclass-quick-start.yaml

test/e2e/kcp_remediations.go

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

internal/webhooks/machinehealthcheck_test.go

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

sbueringer · 2025-10-14T06:00:06Z

/test pull-cluster-api-e2e-main

furkatgofurov7 · 2025-10-14T08:11:02Z

Quick update on failing main tests:

they are failing now, and probably the reason is that the condition matcher seems to be too strict (order/length and dynamic fields like ObservedGeneration/LastTransitionTime) example failure from local run:

--- FAIL: TestMachineHealthCheck_Reconcile (441.16s)

machinehealthcheck_controller_test.go:232:  
Timed out after 5.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14000c9498c), CurrentHealthy:(*int32)(0x14000c94990), RemediationsAllowed:(*int32)(0x14000c94994), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x140008227d0)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001fa10ac), CurrentHealthy:(*int32)(0x14001fa10b0), RemediationsAllowed:(*int32)(0x14001fa10b4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x14001d9e6d0)}  

machinehealthcheck_controller_test.go:366:  
Timed out after 30.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001ebd99c), CurrentHealthy:(*int32)(0x14001ebd9a0), RemediationsAllowed:(*int32)(0x14001ebd9a4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061eb08)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}}, ExpectedMachines:(*int32)(0x14001ebc1cc), CurrentHealthy:(*int32)(0x14001ebc1d0), RemediationsAllowed:(*int32)(0x14001ebc1d4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061e240)}

aggregated unhealthy messages didn’t consistently include the “Health check failed:” prefix:

--- FAIL: TestHealthCheckTargets (0.00s)

machinehealthcheck_targets_test.go:636:  
Expected  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

to contain elements  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

the missing elements were  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

sbueringer · 2025-10-14T08:18:01Z

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

I think gomega should already not care about the order if we use the right matcher

I also thought we have some matcher that allows ignoring timestamps for condition comparisons (grep for HaveSameStateOf, maybe it's useful)

fabriziopandini

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 134 to 183 in f96b742

    
           if t.Node == nil { 
        
           	if timeoutForMachineToHaveNode == disabledNodeStartupTimeout { 
        
           		// Startup timeout is disabled so no need to go any further. 
        
           		// No node yet to check conditions, can return early here. 
        
           		return false, 0 
        
           	} 
        
           	controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) 
        
           	clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) 
        
           	machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition) 
        
           	machineCreationTime := t.Machine.CreationTimestamp.Time 
        
           	// Use the latest of the following timestamps. 
        
           	comparisonTime := machineCreationTime 
        
           	logger.V(5).Info("Determining comparison time", 
        
           		"machineCreationTime", machineCreationTime, 
        
           		"clusterInfraReadyTime", clusterInfraReady, 
        
           		"controlPlaneInitializedTime", controlPlaneInitialized, 
        
           		"machineInfraReadyTime", machineInfraReady, 
        
           	) 
        
           	if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) { 
        
           		comparisonTime = controlPlaneInitialized.Time 
        
           	} 
        
           	if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) { 
        
           		comparisonTime = clusterInfraReady.Time 
        
           	} 
        
           	if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) { 
        
           		comparisonTime = machineInfraReady.Time 
        
           	} 
        
           	logger.V(5).Info("Using comparison time", "time", comparisonTime) 
        
           	timeoutDuration := timeoutForMachineToHaveNode.Duration 
        
           	if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) { 
        
           		v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration) 
        
           		logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration) 
        
           		conditions.Set(t.Machine, metav1.Condition{ 
        
           			Type:    clusterv1.MachineHealthCheckSucceededCondition, 
        
           			Status:  metav1.ConditionFalse, 
        
           			Reason:  clusterv1.MachineHealthCheckNodeStartupTimeoutReason, 
        
           			Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration), 
        
           		}) 
        
           		return true, time.Duration(0) 
        
           	} 
        
           	durationUnhealthy := now.Sub(comparisonTime) 
        
           	nextCheck := timeoutDuration - durationUnhealthy + time.Second 
        
           	return false, nextCheck 
        
           }

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 105 to 116 in f96b742

    
           if t.nodeMissing { 
        
           	logger.V(3).Info("Target is unhealthy: node is missing") 
        
           	v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "") 
        
           	conditions.Set(t.Machine, metav1.Condition{ 
        
           		Type:    clusterv1.MachineHealthCheckSucceededCondition, 
        
           		Status:  metav1.ConditionFalse, 
        
           		Reason:  clusterv1.MachineHealthCheckNodeDeletedReason, 
        
           		Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name), 
        
           	}) 
        
           	return true, time.Duration(0) 
        
           }

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

when node is not showing up at startup
when the node has been deleted at a later stage
when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

internal/api/core/v1alpha3/conversion.go

internal/api/core/v1alpha3/conversion_test.go

internal/api/core/v1alpha4/conversion.go

internal/api/core/v1alpha4/conversion_test.go

furkatgofurov7 · 2025-10-17T08:54:26Z

/test pull-cluster-api-test-main

furkatgofurov7 · 2025-10-17T09:17:59Z

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 134 to 183 in f96b742

if t.Node == nil {

if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {

// Startup timeout is disabled so no need to go any further.

// No node yet to check conditions, can return early here.

return false, 0

}

controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)

clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)

machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)

machineCreationTime := t.Machine.CreationTimestamp.Time

// Use the latest of the following timestamps.

comparisonTime := machineCreationTime

logger.V(5).Info("Determining comparison time",

"machineCreationTime", machineCreationTime,

"clusterInfraReadyTime", clusterInfraReady,

"controlPlaneInitializedTime", controlPlaneInitialized,

"machineInfraReadyTime", machineInfraReady,

)

if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {

comparisonTime = controlPlaneInitialized.Time

}

if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {

comparisonTime = clusterInfraReady.Time

}

if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {

comparisonTime = machineInfraReady.Time

}

logger.V(5).Info("Using comparison time", "time", comparisonTime)

timeoutDuration := timeoutForMachineToHaveNode.Duration

if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {

v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)

logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)

conditions.Set(t.Machine, metav1.Condition{

Type: clusterv1.MachineHealthCheckSucceededCondition,

Status: metav1.ConditionFalse,

Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,

Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),

})

return true, time.Duration(0)

}

durationUnhealthy := now.Sub(comparisonTime)

nextCheck := timeoutDuration - durationUnhealthy + time.Second

return false, nextCheck

}

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 105 to 116 in f96b742

if t.nodeMissing {

logger.V(3).Info("Target is unhealthy: node is missing")

v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")

conditions.Set(t.Machine, metav1.Condition{

Type: clusterv1.MachineHealthCheckSucceededCondition,

Status: metav1.ConditionFalse,

Reason: clusterv1.MachineHealthCheckNodeDeletedReason,

Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),

})

return true, time.Duration(0)

}

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

when node is not showing up at startup

when the node has been deleted at a later stage

when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

@fabriziopandini Thanks for the detailed feedback! You're absolutely right about the inconsistent behavior. I've refactored now the needsRemediation function to address all the concerns you raised.

Changes I made:

We now always evaluate Machine Conditions first
Machine conditions are now evaluated before any node-related logic, to ensure they're checked in ALL scenarios:

When node is missing t.nodeMissing
When node hasn't appeared yet t.Node == nil
When node exists t.Node != nil

We should have a consistent message merging in ALL scenarios
Error messages from both machine and node conditions are now properly merged in every failure scenario:

Node Missing: "Node X has been deleted; Condition Y on Machine is reporting status Z"
Node Startup Timeout: "Node failed to report startup in 5m; Condition Y on Machine is reporting status Z"
Node Exists: "Condition A on Node is reporting status B; Condition Y on Machine is reporting status Z"

No More Early Returns Bypassing Machine Evaluation
The problematic early return paths now occur after machine conditions are evaluated, not before.

Let me know what you think about the refactor.

furkatgofurov7 · 2025-10-17T09:20:42Z

@sbueringer, thanks for another round of feedback on conversion; hopefully all your suggestions should be incorporated now.

furkatgofurov7 · 2025-10-17T09:25:58Z

/test pull-cluster-api-e2e-main

fabriziopandini

Thanks @furkatgofurov7 for this iteration!

I'm wondering if we can further simplify the code/improve readability by using two sub functions, one for machineChecks and the other for nodeChecks.

The resulting needsRemediation will look like:

func (t *healthCheckTarget) needsRemediation(logger logr.Logger, timeoutForMachineToHaveNode metav1.Duration) (bool, time.Duration) {
	// checks for HasRemediateMachineAnnotation, ClusterControlPlaneInitializedCondition, ClusterInfrastructureReadyCondition
        ...

	// Check machine conditions
	unhealthyMachineMessages, nextMachineCheck := t.machineChecks(logger)

	// Check node conditions
	nodeConditionReason, nodeV1beta1ConditionReason, unhealthyNodeMessages, nextNodeCheck := t.nodeChecks(logger, timeoutForMachineToHaveNode)

	// Combine results and set conditions
	...
}

Another benefit of this code struct, is that condition management is implemented only in one place.

In case it can help, this is a commit where I experimented a little bit about this idea

wdyt?

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

furkatgofurov7 · 2025-10-27T10:27:54Z

/test pull-cluster-api-e2e-main

MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly. Adding support for Machine conditions enables us to perform remediation during control plane upgrades. This PR introduces a new field as part of the MachineHealthCheckChecks: - `UnhealthyMachineConditions` This will mirror the behavior of `UnhealthyNodeConditions` but the MachineHealthCheck controller will instead check the machine conditions. This reimplements and extends earlier work originally proposed in a previous PR 12275. Co-authored-by: Justin Miron <[email protected]> Signed-off-by: Furkat Gofurov <[email protected]>

Signed-off-by: Furkat Gofurov <[email protected]>

…iation() method If both a node condition and machine condition are unhealthy, pick one reason but combine all the messages Signed-off-by: Furkat Gofurov <[email protected]>

Signed-off-by: Furkat Gofurov <[email protected]>

Refactors `needsRemediation`, specifically following changes were made: - Move machine condition evaluation to always execute first, regardless of node state - Ensure machine conditions are checked in ALL scenarios: * When node is missing (t.nodeMissing) * When node hasn't appeared yet (t.Node == nil) * When node exists (t.Node != nil) - Consistently merge node and machine condition messages in all failure scenarios - Maintain backward compatibility with existing condition message formats - Use appropriate condition reasons based on which conditions are unhealthy Signed-off-by: Furkat Gofurov <[email protected]>

Signed-off-by: Furkat Gofurov <[email protected]>

…ns: one for machineChecks and the other for nodeChecks. Another benefit of this code struct, is that condition management is implemented only in one place. Co-authored-by: Fabrizio Pandini Signed-off-by: Furkat Gofurov <[email protected]>

Signed-off-by: Furkat Gofurov <[email protected]>

… reasons Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 · 2025-10-27T13:46:10Z

Looks like I am hitting #12334 in e2e tests here?

capi-e2e: [It] When following the Cluster API quick-start with v1beta1 ClusterClass [ClusterClass] Should create a workload cluster [ClusterClass] expand_less | 3m54s
-- | --
{Failed after 13.722s. resourceVersions didn't stay stable The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:59 with:  Detected objects with changed resourceVersion  Object with changed resourceVersion KubeadmControlPlane/quick-start-i8nppc/quick-start-v5eonr-cp-gr9xz:   strings.Join({   	... // 6037 identical bytes   	"Replicas: {}\n        f:version: {}\n    manager: manager\n    oper",   	"ation: Update\n    subresource: status\n    time: \"2025-10-27T11:1", - 	"5:24", + 	"7:17",   	"Z\"\n  name: quick-start-v5eonr-cp-gr9xz\n  namespace: quick-start-",   	"i8nppc\n  ownerReferences:\n  - apiVersion: cluster.x-k8s.io/v1bet",   	"a2\n    blockOwnerDeletion: true\n    controller: true\n    kind: C",   	"luster\n    name: quick-start-v5eonr\n    uid: dc1be33f-0f0d-494a-",   	"a05c-49d99e2e956b\n  resourceVersion: \"2", - 	"6533", + 	"7270",   	"\"\n  uid: 366545c5-f6a2-441e-87ad-fd56f479464e\nspec:\n  kubeadmCon",   	"figSpec:\n    clusterConfiguration:\n      apiServer:\n        cert",   	... // 1156 identical bytes   	"    type: RollingUpdate\n  version: v1.34.0\nstatus:\n  availableRe",   	"plicas: 1\n  conditions:\n  - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n    message: \"\"\n    observedGeneration: 1\n    reason: Availab",   	"le\n    status: \"True\"\n    type: Available\n  - lastTransitionTime",   	... // 206 identical bytes   	" observedGeneration: 1\n    reason: Initialized\n    status: \"True",   	"\"\n    type: Initialized\n  - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n    message: \"\"\n    observedGeneration: 1\n    reason: Healthy",   	"\n    status: \"True\"\n    type: EtcdClusterHealthy\n  - lastTransit",   	... // 691 identical bytes   	" observedGeneration: 1\n    reason: NotScalingUp\n    status: \"Fal",   	"se\"\n    type: ScalingUp\n  - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n    message: \"\"\n    observedGeneration: 1\n    reason: Ready\n ",   	"   status: \"True\"\n    type: MachinesReady\n  - lastTransitionTime",   	... // 815 identical bytes   	"10-27T11:15:08Z\"\n        status: \"True\"\n        type: ControlPla",   	"neComponentsHealthy\n      - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n        status: \"True\"\n        type: EtcdClusterHealthy\n     ",   	" - lastTransitionTime: \"2025-10-27T11:13:56Z\"\n        status: \"T",   	... // 552 identical bytes   }, "")

......

furkatgofurov7 · 2025-10-27T14:47:35Z

/test pull-cluster-api-e2e-main

furkatgofurov7 · 2025-10-27T18:29:22Z

/test pull-cluster-api-e2e-main

sbueringer

Just a few minor findings

api/core/v1beta2/machinehealthcheck_types.go

test/infrastructure/docker/templates/clusterclass-in-memory.yaml

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go

Signed-off-by: Furkat Gofurov <[email protected]>

sbueringer · 2025-10-28T14:59:33Z

Thank you very much!!!

/lgtm
/assign @fabriziopandini

k8s-ci-robot · 2025-10-28T14:59:40Z

LGTM label has been added.

Git tree hash: 4ead6b65f629a47eeda3ee75246b77679a6ddc5a

sbueringer · 2025-10-28T15:00:26Z

/test pull-cluster-api-e2e-main

fabriziopandini · 2025-10-28T15:11:15Z

Nice!
/lgtm
/approve

k8s-ci-robot · 2025-10-28T15:11:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from elmiko and richardcase October 7, 2025 21:12

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 7, 2025

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from af68d6e to 7cb44c3 Compare October 7, 2025 21:24

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 7, 2025

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch 2 times, most recently from ab19424 to c6a7148 Compare October 7, 2025 22:08

This was referenced Oct 8, 2025

Tracking issue for In-place Updates implementation #12291

Open

✨ MachineHealthCheck supports checking Machine conditions #12275

Closed

sbueringer reviewed Oct 13, 2025

View reviewed changes

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go Show resolved Hide resolved

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch 3 times, most recently from 424114f to 5ee7d25 Compare October 13, 2025 18:21

fabriziopandini reviewed Oct 13, 2025

View reviewed changes

internal/webhooks/machinehealthcheck_test.go Outdated Show resolved Hide resolved

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go Outdated Show resolved Hide resolved

fabriziopandini reviewed Oct 15, 2025

View reviewed changes

sbueringer reviewed Oct 16, 2025

View reviewed changes

fabriziopandini reviewed Oct 17, 2025

View reviewed changes

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go Outdated Show resolved Hide resolved

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go Outdated Show resolved Hide resolved

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2025

k8s-ci-robot requested review from fabriziopandini and sbueringer October 27, 2025 09:24

furkatgofurov7 and others added 10 commits October 27, 2025 14:48

Fix PR check Markdown links CI

c5494ae

Signed-off-by: Furkat Gofurov <[email protected]>

Address review comments

8480736

Signed-off-by: Furkat Gofurov <[email protected]>

Address review comments: rework node and machine checks in needsRemed…

7a4351b

…iation() method If both a node condition and machine condition are unhealthy, pick one reason but combine all the messages Signed-off-by: Furkat Gofurov <[email protected]>

Address Stefan comments (conversion)

b527435

Signed-off-by: Furkat Gofurov <[email protected]>

Fix event message to reflect both machine and node condition checking

6e01a20

Signed-off-by: Furkat Gofurov <[email protected]>

Add CEL validation to prevent disallowed UnhealthyMachineCondition types

c4c895f

Signed-off-by: Furkat Gofurov <[email protected]>

Clarify UnhealthyMachineConditionV1Beta1Reason precedence over node…

4c1004f

… reasons Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from c45f0fe to 4c1004f Compare October 27, 2025 14:32

sbueringer reviewed Oct 28, 2025

View reviewed changes

Address review comments (Stefan)

d09791b

Signed-off-by: Furkat Gofurov <[email protected]>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 28, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2025

k8s-ci-robot merged commit f068d1c into kubernetes-sigs:main Oct 28, 2025
26 checks passed

k8s-ci-robot added this to the v1.12 milestone Oct 28, 2025

furkatgofurov7 deleted the unhealthyMachineConditions-check-mhc branch October 28, 2025 19:54

	if t.Node == nil {
	if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {
	// Startup timeout is disabled so no need to go any further.
	// No node yet to check conditions, can return early here.
	return false, 0
	}

	controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)
	clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)
	machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)
	machineCreationTime := t.Machine.CreationTimestamp.Time

	// Use the latest of the following timestamps.
	comparisonTime := machineCreationTime
	logger.V(5).Info("Determining comparison time",
	"machineCreationTime", machineCreationTime,
	"clusterInfraReadyTime", clusterInfraReady,
	"controlPlaneInitializedTime", controlPlaneInitialized,
	"machineInfraReadyTime", machineInfraReady,
	)
	if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {
	comparisonTime = controlPlaneInitialized.Time
	}
	if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {
	comparisonTime = clusterInfraReady.Time
	}
	if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {
	comparisonTime = machineInfraReady.Time
	}
	logger.V(5).Info("Using comparison time", "time", comparisonTime)

	timeoutDuration := timeoutForMachineToHaveNode.Duration
	if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {
	v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)
	logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)

	conditions.Set(t.Machine, metav1.Condition{
	Type: clusterv1.MachineHealthCheckSucceededCondition,
	Status: metav1.ConditionFalse,
	Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,
	Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),
	})
	return true, time.Duration(0)
	}

	durationUnhealthy := now.Sub(comparisonTime)
	nextCheck := timeoutDuration - durationUnhealthy + time.Second

	return false, nextCheck
	}

	if t.nodeMissing {
	logger.V(3).Info("Target is unhealthy: node is missing")
	v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")

	conditions.Set(t.Machine, metav1.Condition{
	Type: clusterv1.MachineHealthCheckSucceededCondition,
	Status: metav1.ConditionFalse,
	Reason: clusterv1.MachineHealthCheckNodeDeletedReason,
	Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),
	})
	return true, time.Duration(0)
	}

✨ Add support for checking Machine conditions in MachineHealthCheck #12827

✨ Add support for checking Machine conditions in MachineHealthCheck #12827

Uh oh!

Conversation

furkatgofurov7 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbueringer commented Oct 14, 2025

Uh oh!

furkatgofurov7 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabriziopandini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 17, 2025

Uh oh!

furkatgofurov7 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 17, 2025

Uh oh!

furkatgofurov7 commented Oct 17, 2025

Uh oh!

fabriziopandini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 27, 2025

Uh oh!

furkatgofurov7 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 27, 2025

Uh oh!

furkatgofurov7 commented Oct 27, 2025

Uh oh!

sbueringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbueringer commented Oct 28, 2025

Uh oh!

k8s-ci-robot commented Oct 28, 2025

Uh oh!

sbueringer commented Oct 28, 2025

Uh oh!

fabriziopandini commented Oct 28, 2025

furkatgofurov7 commented Oct 7, 2025 •

edited

Loading

furkatgofurov7 commented Oct 14, 2025 •

edited

Loading

sbueringer commented Oct 14, 2025 •

edited

Loading

fabriziopandini left a comment •

edited

Loading

furkatgofurov7 commented Oct 17, 2025 •

edited

Loading

furkatgofurov7 commented Oct 27, 2025 •

edited

Loading