-
Notifications
You must be signed in to change notification settings - Fork 59
[CI] Add more ported distributed cases #2082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0d9b54f
to
85fa6f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the test scope as CI scope and nightly full scope
|
||
inputs: | ||
ut_name: | ||
required: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required: true | |
required: false |
ze = xpu_list[i+1]; | ||
} else { | ||
ze = i; | ||
if [ "${{ inputs.ut_name }}" == "xpu_distributed" ];then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any assumptions in here? Can we detect topology directly and dynamically on the test node?
Please consider below scenarios:
- No Xelink group, return failed
- 1 Xelink group, launch 1 worker
- 2 Xelink group, launch 2 workers
- ...
.github/workflows/_linux_ut.yml
Outdated
runner: | ||
runs-on: ${{ inputs.runner }} | ||
name: get-runner | ||
name: get-runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have such change?
bd63535
to
61d9eef
Compare
61d9eef
to
7d62aaa
Compare
f8b4450
to
63799f6
Compare
63799f6
to
1bcbce2
Compare
firstly added cases for CI in this PR, will enable nightly test in another PR |
This PR intends to add more ported distributed cases in torch-xpu-ops CI. And add pytest-xdist for distributed UT
The distributed UT time will increase to 1h22min with 2 work groups
disable_e2e
disable_ut