-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clobrano The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery |
|
@jaypoulz: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c4ec1d0-ae9a-11f0-95ed-ad8d5e8a115f-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/df063eb0-b31c-11f0-9588-ce4096893980-0 |
|
Rebasing this to get #30385 /hold |
c653990 to
9083969
Compare
Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations: - Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot) - Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart - Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.
9083969 to
b6e1384
Compare
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92aaaa10-b4d4-11f0-8ae3-147c7322b463-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f41854d0-b503-11f0-9c7a-52e63142ca96-0 |
Change BeforeEach health checks to skip tests instead of failing them when the cluster is not in a healthy state at the start of the test. Previously, the etcd recovery tests would fail if the cluster was not healthy before the test started. This is problematic because these tests are designed to validate recovery from intentional disruptions, not to debug pre-existing cluster issues. Changes: - Extract health validation functions to common.go for reusability - Add skipIfClusterIsNotHealthy() to consolidate all health checks - Implement internal retry logic in health check functions with timeouts - Add ensureEtcdHasTwoVotingMembers() to validate membership state - Skip tests early if cluster is degraded, pods aren't running, or members are unhealthy This ensures tests only run when the cluster is in a known-good state, reducing false failures due to pre-existing issues while maintaining test coverage for actual recovery scenarios.
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0181f70-b5be-11f0-9b3d-143ce616f56d-0 |
|
@clobrano: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/878d1590-b629-11f0-97f1-f0b4b63689fd-0 |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: f918765
New tests seen in this PR at sha: f918765
|
|
@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:
Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.