Skip to content

Conversation

@clobrano
Copy link
Contributor

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

  • Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)
  • Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart
  • Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

@openshift-ci openshift-ci bot requested review from eggfoobar and qJkee October 21, 2025 16:04
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign jeff-roche for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jaypoulz
Copy link
Contributor

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@jaypoulz: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@jaypoulz
Copy link
Contributor

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c4ec1d0-ae9a-11f0-95ed-ad8d5e8a115f-0

@clobrano
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 27, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/df063eb0-b31c-11f0-9588-ce4096893980-0

@clobrano
Copy link
Contributor Author

Rebasing this to get #30385

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2025
@clobrano clobrano force-pushed the tnf-e2e-cold-boot-from-mixed-gns-ungns branch from c653990 to 9083969 Compare October 29, 2025 08:38
Add three new test cases to validate etcd cluster recovery from cold
boot scenarios reached through different graceful/ungraceful shutdown
combinations:

- Cold boot from double GNS: both nodes gracefully shut down
  simultaneously, then both restart (full cluster cold boot)
- Cold boot from sequential GNS: first node gracefully shut down, then
  second node gracefully shut down, then both restart
- Cold boot from mixed GNS/UGNS: first node gracefully shut down,
  surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested
because in TNF clusters, an ungracefully shut down node is quickly
recovered, preventing the ability to wait and gracefully shut down the
second node later. The double UGNS scenario is already covered by
existing tests.
@clobrano clobrano force-pushed the tnf-e2e-cold-boot-from-mixed-gns-ungns branch from 9083969 to b6e1384 Compare October 29, 2025 08:41
@clobrano
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92aaaa10-b4d4-11f0-8ae3-147c7322b463-0

@clobrano
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f41854d0-b503-11f0-9c7a-52e63142ca96-0

Change BeforeEach health checks to skip tests instead of failing them
when the cluster is not in a healthy state at the start of the test.

Previously, the etcd recovery tests would fail if the cluster was not
healthy before the test started. This is problematic because these
tests are designed to validate recovery from intentional disruptions,
not to debug pre-existing cluster issues.

Changes:
- Extract health validation functions to common.go for reusability
- Add skipIfClusterIsNotHealthy() to consolidate all health checks
- Implement internal retry logic in health check functions with timeouts
- Add ensureEtcdHasTwoVotingMembers() to validate membership state
- Skip tests early if cluster is degraded, pods aren't running, or
  members are unhealthy

This ensures tests only run when the cluster is in a known-good state,
reducing false failures due to pre-existing issues while maintaining
test coverage for actual recovery scenarios.
@clobrano
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 30, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0181f70-b5be-11f0-9b3d-143ce616f56d-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 30, 2025

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-microshift-serial f918765 link true /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-vsphere-ovn f918765 link true /test e2e-vsphere-ovn
ci/prow/e2e-gcp-ovn-upgrade f918765 link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 f918765 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-microshift f918765 link true /test e2e-aws-ovn-microshift
ci/prow/e2e-aws-ovn-fips f918765 link true /test e2e-aws-ovn-fips
ci/prow/e2e-gcp-ovn f918765 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-serial-2of2 f918765 link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-vsphere-ovn-upi f918765 link true /test e2e-vsphere-ovn-upi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@clobrano
Copy link
Contributor Author

payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@clobrano
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 31, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/878d1590-b629-11f0-97f1-f0b4b63689fd-0

@openshift-trt
Copy link

openshift-trt bot commented Oct 31, 2025

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: f918765

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 High - "[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: f918765

  • "[sig-api-machinery] Garbage collector should keep the rc around until all its pods are deleted if the deleteOptions says so [Serial] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Garbage collector should not delete dependents that have both valid owner and owner that's waiting for dependents to be deleted [Serial] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Garbage collector should orphan pods created by rc if delete options say so [Serial] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Namespaces [Serial] should apply a finalizer to a Namespace [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Namespaces [Serial] should apply an update to a Namespace [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Namespaces [Serial] should apply changes to a namespace status [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Namespaces [Serial] should ensure that all pods are removed when a namespace is deleted [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Namespaces [Serial] should ensure that all services are removed when a namespace is deleted [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-api-machinery] Namespaces [Serial] should patch a Namespace [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-apps] ControllerRevision [Serial] should manage the lifecycle of a ControllerRevision [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should list and delete a collection of DaemonSets [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should retry creating failed daemon pods [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]" [Total: 2, Pass: 1, Fail: 1, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should run and stop complex daemon [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should run and stop simple daemon [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]" [Total: 2, Pass: 0, Fail: 2, Flake: 0]
  • "[sig-apps] Daemon set [Serial] should verify changes to a daemon set status [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-node] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds [Disruptive] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-node] NoExecuteTaintManager Single Pod [Serial] removing taint cancels eviction [Disruptive] [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • "[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance]" [Total: 2, Pass: 2, Fail: 0, Flake: 0]
  • (...showing 20 of 29 tests)

@clobrano clobrano changed the title NO JIRA: TNF add etcd cold boot recovery tests from graceful node shutdown OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown Oct 31, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 31, 2025

@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

  • Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)
  • Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart
  • Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants