OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404

clobrano · 2025-10-21T16:02:32Z

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)
Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart
Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

openshift-ci · 2025-10-21T16:04:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign jeff-roche for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/two_node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jaypoulz · 2025-10-21T16:17:47Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2025-10-21T16:18:08Z

@jaypoulz: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

jaypoulz · 2025-10-21T16:21:49Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-10-21T16:21:57Z

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c4ec1d0-ae9a-11f0-95ed-ad8d5e8a115f-0

clobrano · 2025-10-27T10:08:22Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-10-27T10:08:27Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/df063eb0-b31c-11f0-9588-ce4096893980-0

clobrano · 2025-10-27T14:05:52Z

Rebasing this to get #30385

/hold

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations: - Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot) - Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart - Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

clobrano · 2025-10-29T14:35:52Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-10-29T14:36:17Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92aaaa10-b4d4-11f0-8ae3-147c7322b463-0

clobrano · 2025-10-29T20:15:02Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-10-29T20:15:05Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f41854d0-b503-11f0-9c7a-52e63142ca96-0

Change BeforeEach health checks to skip tests instead of failing them when the cluster is not in a healthy state at the start of the test. Previously, the etcd recovery tests would fail if the cluster was not healthy before the test started. This is problematic because these tests are designed to validate recovery from intentional disruptions, not to debug pre-existing cluster issues. Changes: - Extract health validation functions to common.go for reusability - Add skipIfClusterIsNotHealthy() to consolidate all health checks - Implement internal retry logic in health check functions with timeouts - Add ensureEtcdHasTwoVotingMembers() to validate membership state - Skip tests early if cluster is degraded, pods aren't running, or members are unhealthy This ensures tests only run when the cluster is in a known-good state, reducing false failures due to pre-existing issues while maintaining test coverage for actual recovery scenarios.

clobrano · 2025-10-30T18:31:44Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-10-30T18:31:47Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0181f70-b5be-11f0-9b3d-143ce616f56d-0

openshift-ci · 2025-10-30T23:00:36Z

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-microshift-serial	`f918765`	link	true	`/test e2e-aws-ovn-microshift-serial`
ci/prow/e2e-vsphere-ovn	`f918765`	link	true	`/test e2e-vsphere-ovn`
ci/prow/e2e-gcp-ovn-upgrade	`f918765`	link	true	`/test e2e-gcp-ovn-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	`f918765`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-aws-ovn-microshift	`f918765`	link	true	`/test e2e-aws-ovn-microshift`
ci/prow/e2e-aws-ovn-fips	`f918765`	link	true	`/test e2e-aws-ovn-fips`
ci/prow/e2e-gcp-ovn	`f918765`	link	true	`/test e2e-gcp-ovn`
ci/prow/e2e-aws-ovn-serial-2of2	`f918765`	link	true	`/test e2e-aws-ovn-serial-2of2`
ci/prow/e2e-vsphere-ovn-upi	`f918765`	link	true	`/test e2e-vsphere-ovn-upi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

clobrano · 2025-10-31T06:18:06Z

payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

clobrano · 2025-10-31T07:16:32Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-10-31T07:16:35Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/878d1590-b629-11f0-97f1-f0b4b63689fd-0

openshift-trt · 2025-10-31T07:45:04Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: f918765

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	High - "[sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	High - "[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2	High - "[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: f918765

"[sig-api-machinery] Garbage collector should keep the rc around until all its pods are deleted if the deleteOptions says so [Serial] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Garbage collector should not delete dependents that have both valid owner and owner that's waiting for dependents to be deleted [Serial] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Garbage collector should orphan pods created by rc if delete options say so [Serial] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Namespaces [Serial] should apply a finalizer to a Namespace [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Namespaces [Serial] should apply an update to a Namespace [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Namespaces [Serial] should apply changes to a namespace status [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Namespaces [Serial] should ensure that all pods are removed when a namespace is deleted [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Namespaces [Serial] should ensure that all services are removed when a namespace is deleted [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-api-machinery] Namespaces [Serial] should patch a Namespace [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-apps] ControllerRevision [Serial] should manage the lifecycle of a ControllerRevision [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-apps] Daemon set [Serial] should list and delete a collection of DaemonSets [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-apps] Daemon set [Serial] should retry creating failed daemon pods [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]" [Total: 2, Pass: 1, Fail: 1, Flake: 0]
"[sig-apps] Daemon set [Serial] should run and stop complex daemon [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-apps] Daemon set [Serial] should run and stop simple daemon [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]" [Total: 2, Pass: 0, Fail: 2, Flake: 0]
"[sig-apps] Daemon set [Serial] should verify changes to a daemon set status [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-node] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds [Disruptive] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-node] NoExecuteTaintManager Single Pod [Serial] removing taint cancels eviction [Disruptive] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
"[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]

(...showing 20 of 29 tests)

openshift-ci-robot · 2025-10-31T09:40:05Z

@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)

Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart

Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from eggfoobar and qJkee October 21, 2025 16:04

feat: add VirshShutdownVM helper for graceful VM shutdown

7bd72bf

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2025

clobrano force-pushed the tnf-e2e-cold-boot-from-mixed-gns-ungns branch from c653990 to 9083969 Compare October 29, 2025 08:38

clobrano force-pushed the tnf-e2e-cold-boot-from-mixed-gns-ungns branch from 9083969 to b6e1384 Compare October 29, 2025 08:41

clobrano changed the title ~~NO JIRA: TNF add etcd cold boot recovery tests from graceful node shutdown~~ OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown Oct 31, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 31, 2025

OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404

Are you sure you want to change the base?

OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404

Uh oh!

Conversation

clobrano commented Oct 21, 2025

Uh oh!

openshift-ci bot commented Oct 21, 2025

Uh oh!

jaypoulz commented Oct 21, 2025

Uh oh!

openshift-ci bot commented Oct 21, 2025

Uh oh!

jaypoulz commented Oct 21, 2025

Uh oh!

openshift-ci bot commented Oct 21, 2025

Uh oh!

clobrano commented Oct 27, 2025

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

clobrano commented Oct 27, 2025

Uh oh!

clobrano commented Oct 29, 2025

Uh oh!

openshift-ci bot commented Oct 29, 2025

Uh oh!

clobrano commented Oct 29, 2025

Uh oh!

openshift-ci bot commented Oct 29, 2025

Uh oh!

clobrano commented Oct 30, 2025

Uh oh!

openshift-ci bot commented Oct 30, 2025

Uh oh!

openshift-ci bot commented Oct 30, 2025

Uh oh!

clobrano commented Oct 31, 2025

Uh oh!

clobrano commented Oct 31, 2025

Uh oh!

openshift-ci bot commented Oct 31, 2025

Uh oh!

openshift-trt bot commented Oct 31, 2025

Uh oh!

openshift-ci-robot commented Oct 31, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Oct 31, 2025 •

edited by openshift-ci bot

Loading