Skip to content

Conversation

@furkatgofurov7
Copy link
Member

@furkatgofurov7 furkatgofurov7 commented Oct 7, 2025

What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation during control plane upgrades.

This PR introduces a new field as part of the MachineHealthCheckChecks:

  • UnhealthyMachineConditions

This will mirror the behavior of UnhealthyNodeConditions but the MachineHealthCheck controller will instead check the machine conditions.

This reimplements and extends the work originally proposed by @justinmir in PR #12275.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes: #5450
Part of #12291

Label(s) to be applied
/kind feature
/area machinehealthcheck

Notes for Reviewers
Additional notes on what has changed in this PR to get a general idea of what this change is trying to achieve.

MHC related tests:
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions:

  • internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go: Added a test case with UnhealthyMachineConditions to verify machine condition evaluation
  • internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go: Added unit tests verifying machines need remediation based on machine conditions
  • Added test coverage for scenarios where both UnhealthyNodeConditions and UnhealthyMachineConditions are configured to ensure they work together correctly

Core Logic Refactor:
Modified needsRemediation() in internal/controllers/machinehealthcheck/machinehealthcheck_targets.go to:

  • Always evaluate machine conditions first, regardless of node state
  • Ensure machine conditions are checked in ALL scenarios (node missing, startup timeout, node exists)
  • Consistently merge machine and node condition messages across all failure scenarios
  • Maintain backward compatibility with existing condition message formats

CEL Validation for UnhealthyMachineConditions:

  • Added CEL validation rule to UnhealthyMachineCondition.Type field
  • Disallowed condition types: Ready, Available, HealthCheckSucceeded, OwnerRemediated, ExternallyRemediated
  • Added envtest-based integration tests in internal/webhooks/test/machinehealthcheck_test.go to verify CEL validation enforces the restriction.

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. area/machinehealthcheck Issues or PRs related to machinehealthchecks cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 7, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 7, 2025
@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from af68d6e to 7cb44c3 Compare October 7, 2025 21:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 7, 2025
@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch 2 times, most recently from ab19424 to c6a7148 Compare October 7, 2025 22:08
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch 3 times, most recently from 424114f to 5ee7d25 Compare October 13, 2025 18:21
@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Oct 14, 2025

Quick update on failing main tests:

  1. they are failing now, and probably the reason is that the condition matcher seems to be too strict (order/length and dynamic fields like ObservedGeneration/LastTransitionTime) example failure from local run:
--- FAIL: TestMachineHealthCheck_Reconcile (441.16s)
machinehealthcheck_controller_test.go:232:  
Timed out after 5.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14000c9498c), CurrentHealthy:(*int32)(0x14000c94990), RemediationsAllowed:(*int32)(0x14000c94994), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x140008227d0)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001fa10ac), CurrentHealthy:(*int32)(0x14001fa10b0), RemediationsAllowed:(*int32)(0x14001fa10b4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x14001d9e6d0)}  

machinehealthcheck_controller_test.go:366:  
Timed out after 30.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001ebd99c), CurrentHealthy:(*int32)(0x14001ebd9a0), RemediationsAllowed:(*int32)(0x14001ebd9a4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061eb08)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}}, ExpectedMachines:(*int32)(0x14001ebc1cc), CurrentHealthy:(*int32)(0x14001ebc1d0), RemediationsAllowed:(*int32)(0x14001ebc1d4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061e240)}
  1. aggregated unhealthy messages didn’t consistently include the “Health check failed:” prefix:
--- FAIL: TestHealthCheckTargets (0.00s)
machinehealthcheck_targets_test.go:636:  
Expected  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

to contain elements  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

the missing elements were  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

@sbueringer
Copy link
Member

sbueringer commented Oct 14, 2025

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

I think gomega should already not care about the order if we use the right matcher

I also thought we have some matcher that allows ignoring timestamps for condition comparisons (grep for HaveSameStateOf, maybe it's useful)

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

if t.Node == nil {
if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {
// Startup timeout is disabled so no need to go any further.
// No node yet to check conditions, can return early here.
return false, 0
}
controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)
clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)
machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)
machineCreationTime := t.Machine.CreationTimestamp.Time
// Use the latest of the following timestamps.
comparisonTime := machineCreationTime
logger.V(5).Info("Determining comparison time",
"machineCreationTime", machineCreationTime,
"clusterInfraReadyTime", clusterInfraReady,
"controlPlaneInitializedTime", controlPlaneInitialized,
"machineInfraReadyTime", machineInfraReady,
)
if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {
comparisonTime = controlPlaneInitialized.Time
}
if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {
comparisonTime = clusterInfraReady.Time
}
if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {
comparisonTime = machineInfraReady.Time
}
logger.V(5).Info("Using comparison time", "time", comparisonTime)
timeoutDuration := timeoutForMachineToHaveNode.Duration
if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)
logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,
Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),
})
return true, time.Duration(0)
}
durationUnhealthy := now.Sub(comparisonTime)
nextCheck := timeoutDuration - durationUnhealthy + time.Second
return false, nextCheck
}

if t.nodeMissing {
logger.V(3).Info("Target is unhealthy: node is missing")
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeDeletedReason,
Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),
})
return true, time.Duration(0)
}

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

  • when node is not showing up at startup
  • when the node has been deleted at a later stage
  • when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-test-main

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Oct 17, 2025

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

if t.Node == nil {
if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {
// Startup timeout is disabled so no need to go any further.
// No node yet to check conditions, can return early here.
return false, 0
}
controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)
clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)
machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)
machineCreationTime := t.Machine.CreationTimestamp.Time
// Use the latest of the following timestamps.
comparisonTime := machineCreationTime
logger.V(5).Info("Determining comparison time",
"machineCreationTime", machineCreationTime,
"clusterInfraReadyTime", clusterInfraReady,
"controlPlaneInitializedTime", controlPlaneInitialized,
"machineInfraReadyTime", machineInfraReady,
)
if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {
comparisonTime = controlPlaneInitialized.Time
}
if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {
comparisonTime = clusterInfraReady.Time
}
if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {
comparisonTime = machineInfraReady.Time
}
logger.V(5).Info("Using comparison time", "time", comparisonTime)
timeoutDuration := timeoutForMachineToHaveNode.Duration
if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)
logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,
Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),
})
return true, time.Duration(0)
}
durationUnhealthy := now.Sub(comparisonTime)
nextCheck := timeoutDuration - durationUnhealthy + time.Second
return false, nextCheck
}

if t.nodeMissing {
logger.V(3).Info("Target is unhealthy: node is missing")
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeDeletedReason,
Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),
})
return true, time.Duration(0)
}

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

  • when node is not showing up at startup
  • when the node has been deleted at a later stage
  • when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

@fabriziopandini Thanks for the detailed feedback! You're absolutely right about the inconsistent behavior. I've refactored now the needsRemediation function to address all the concerns you raised.

Changes I made:

  1. We now always evaluate Machine Conditions first
    Machine conditions are now evaluated before any node-related logic, to ensure they're checked in ALL scenarios:
  • When node is missing t.nodeMissing
  • When node hasn't appeared yet t.Node == nil
  • When node exists t.Node != nil
  1. We should have a consistent message merging in ALL scenarios
    Error messages from both machine and node conditions are now properly merged in every failure scenario:
  • Node Missing: "Node X has been deleted; Condition Y on Machine is reporting status Z"
  • Node Startup Timeout: "Node failed to report startup in 5m; Condition Y on Machine is reporting status Z"
  • Node Exists: "Condition A on Node is reporting status B; Condition Y on Machine is reporting status Z"
  1. No More Early Returns Bypassing Machine Evaluation
    The problematic early return paths now occur after machine conditions are evaluated, not before.

Let me know what you think about the refactor.

@furkatgofurov7
Copy link
Member Author

@sbueringer, thanks for another round of feedback on conversion; hopefully all your suggestions should be incorporated now.

@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @furkatgofurov7 for this iteration!

I'm wondering if we can further simplify the code/improve readability by using two sub functions, one for machineChecks and the other for nodeChecks.

The resulting needsRemediation will look like:

func (t *healthCheckTarget) needsRemediation(logger logr.Logger, timeoutForMachineToHaveNode metav1.Duration) (bool, time.Duration) {
	// checks for HasRemediateMachineAnnotation, ClusterControlPlaneInitializedCondition, ClusterInfrastructureReadyCondition
        ...

	// Check machine conditions
	unhealthyMachineMessages, nextMachineCheck := t.machineChecks(logger)

	// Check node conditions
	nodeConditionReason, nodeV1beta1ConditionReason, unhealthyNodeMessages, nextNodeCheck := t.nodeChecks(logger, timeoutForMachineToHaveNode)

	// Combine results and set conditions
	...
}

Another benefit of this code struct, is that condition management is implemented only in one place.

In case it can help, this is a commit where I experimented a little bit about this idea

wdyt?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2025
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

furkatgofurov7 and others added 10 commits October 27, 2025 14:48
MachineHealthCheck currently only allows checking Node conditions to
validate if a machine is healthy. However, machine conditions capture
conditions that do not exist on nodes, for example, control plane node
conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate
if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation
during control plane upgrades.

This PR introduces a new field as part of the MachineHealthCheckChecks:
  - `UnhealthyMachineConditions`

This will mirror the behavior of `UnhealthyNodeConditions` but the
MachineHealthCheck controller will instead check the machine conditions.

This reimplements and extends earlier work originally proposed in a previous PR 12275.

Co-authored-by: Justin Miron <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
…iation() method

If both a node condition and machine condition are unhealthy, pick one reason but
combine all the messages

Signed-off-by: Furkat Gofurov <[email protected]>
Refactors `needsRemediation`, specifically following changes were made:
- Move machine condition evaluation to always execute first, regardless of node state
- Ensure machine conditions are checked in ALL scenarios:
  * When node is missing (t.nodeMissing)
  * When node hasn't appeared yet (t.Node == nil)
  * When node exists (t.Node != nil)
- Consistently merge node and machine condition messages in all failure scenarios
- Maintain backward compatibility with existing condition message formats
- Use appropriate condition reasons based on which conditions are unhealthy

Signed-off-by: Furkat Gofurov <[email protected]>
…ns: one for machineChecks and the other for nodeChecks.

Another benefit of this code struct, is that condition management is implemented only in one place.

Co-authored-by: Fabrizio Pandini
Signed-off-by: Furkat Gofurov <[email protected]>
@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Oct 27, 2025

Looks like I am hitting #12334 in e2e tests here?

capi-e2e: [It] When following the Cluster API quick-start with v1beta1 ClusterClass [ClusterClass] Should create a workload cluster [ClusterClass] expand_less | 3m54s
-- | --
{Failed after 13.722s. resourceVersions didn't stay stable The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:59 with:  Detected objects with changed resourceVersion  Object with changed resourceVersion KubeadmControlPlane/quick-start-i8nppc/quick-start-v5eonr-cp-gr9xz:   strings.Join({   	... // 6037 identical bytes   	"Replicas: {}\n        f:version: {}\n    manager: manager\n    oper",   	"ation: Update\n    subresource: status\n    time: \"2025-10-27T11:1", - 	"5:24", + 	"7:17",   	"Z\"\n  name: quick-start-v5eonr-cp-gr9xz\n  namespace: quick-start-",   	"i8nppc\n  ownerReferences:\n  - apiVersion: cluster.x-k8s.io/v1bet",   	"a2\n    blockOwnerDeletion: true\n    controller: true\n    kind: C",   	"luster\n    name: quick-start-v5eonr\n    uid: dc1be33f-0f0d-494a-",   	"a05c-49d99e2e956b\n  resourceVersion: \"2", - 	"6533", + 	"7270",   	"\"\n  uid: 366545c5-f6a2-441e-87ad-fd56f479464e\nspec:\n  kubeadmCon",   	"figSpec:\n    clusterConfiguration:\n      apiServer:\n        cert",   	... // 1156 identical bytes   	"    type: RollingUpdate\n  version: v1.34.0\nstatus:\n  availableRe",   	"plicas: 1\n  conditions:\n  - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n    message: \"\"\n    observedGeneration: 1\n    reason: Availab",   	"le\n    status: \"True\"\n    type: Available\n  - lastTransitionTime",   	... // 206 identical bytes   	" observedGeneration: 1\n    reason: Initialized\n    status: \"True",   	"\"\n    type: Initialized\n  - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n    message: \"\"\n    observedGeneration: 1\n    reason: Healthy",   	"\n    status: \"True\"\n    type: EtcdClusterHealthy\n  - lastTransit",   	... // 691 identical bytes   	" observedGeneration: 1\n    reason: NotScalingUp\n    status: \"Fal",   	"se\"\n    type: ScalingUp\n  - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n    message: \"\"\n    observedGeneration: 1\n    reason: Ready\n ",   	"   status: \"True\"\n    type: MachinesReady\n  - lastTransitionTime",   	... // 815 identical bytes   	"10-27T11:15:08Z\"\n        status: \"True\"\n        type: ControlPla",   	"neComponentsHealthy\n      - lastTransitionTime: \"2025-10-27T11:1", - 	"5:24", + 	"7:16",   	"Z\"\n        status: \"True\"\n        type: EtcdClusterHealthy\n     ",   	" - lastTransitionTime: \"2025-10-27T11:13:56Z\"\n        status: \"T",   	... // 552 identical bytes   }, "")

......

@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from c45f0fe to 4c1004f Compare October 27, 2025 14:32
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

1 similar comment
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor findings

Signed-off-by: Furkat Gofurov <[email protected]>
@sbueringer
Copy link
Member

Thank you very much!!!

/lgtm
/assign @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 28, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4ead6b65f629a47eeda3ee75246b77679a6ddc5a

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

@fabriziopandini
Copy link
Member

Nice!
/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2025
@k8s-ci-robot k8s-ci-robot merged commit f068d1c into kubernetes-sigs:main Oct 28, 2025
26 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.12 milestone Oct 28, 2025
@furkatgofurov7 furkatgofurov7 deleted the unhealthyMachineConditions-check-mhc branch October 28, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machinehealthcheck Issues or PRs related to machinehealthchecks cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MHC should provide support for checking Machine conditions

4 participants