-
Notifications
You must be signed in to change notification settings - Fork 315
[GB200][CLI] Fix ultraserver instance support #6944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Fix describe_capacity_block_status api call, filter cbr-id by cr-id - Add UltraserverCapacityBlockSizeValidator to validate capacity block sizes against allowed values - Move validation logic from cdk_builder_utils to dedicated validator in ec2_validators - Add caching mechanism for describe_capacity_block_status API calls in EC2 client - Update process_ultraserver_capacity_block_sizes to focus on processing - Fix test cases
cli/src/pcluster/aws/ec2.py
Outdated
- TotalUnavailableCapacity: Number of unavailable instances | ||
""" | ||
kwargs = {} | ||
# If no specific capacity_block_ids are requested, use cache key based on filters and max_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Code Style] We can move this comment to the pydoc of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done
|
||
@AWSExceptionHandler.handle_client_exception | ||
def describe_capacity_block_status( | ||
def describe_capacity_block_status( # noqa: C901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Code Style][Not Blocking] The check C901 alerts when a function is getting too complicated. Why not splitting the logic in two functions instead of bypassing the check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's a completed api call, it's complicated, yes, but don't like other cases, putting the whole logic in the same function here, I think it increases the readability. I don't want to split it.
cli/src/pcluster/aws/ec2.py
Outdated
statuses.extend(page.get("CapacityBlockStatuses", [])) | ||
|
||
return statuses | ||
result = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Code Style] The outcome of this function is a list of statuses. Both branches of this logic return a list of statuses, so we should define the array of results before the if..else branch.
This improves also the readability because it's immediately clear that even if we have different branches of logic they all contributes to the same type of result.
Examples:
function returnArray():
statuses = []
if condition:
statuses.extend(....)
else:
statuses.extend(....)
return statuses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Done.
inactive_blocks = [] | ||
unhealthy_details = [] | ||
|
||
for cb_id in capacity_reservation_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CodeStyle] We needed this PR because we got confused by the differences between capacity reservation ids and capacity block ids. We should prevent source of confusion around it in our code, even small ones. Example: here we are iterating over a list of CR ids, but the element is named cb_id
, which makes refers to CB ids. Can you please adapt the names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
try: | ||
# Try to get capacity block status first | ||
statuses = AWSApi.instance().ec2.describe_capacity_block_status([cb_id]) | ||
statuses = AWSApi.instance().ec2.describe_capacity_block_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BLOCKING][Logic] Why should we call describe_capacity_block_status()
withing the for loop?
We used to call describe_capacity_block_status(cb_id)
, for every cb_id
. which motivated the loop.
Now we are calling describe_capacity_block_status()
. So we are making here redundant calls to the same function. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching, fixed
if size is not None: | ||
if size not in allowed_sizes_list: | ||
invalid_capacity_blocks.append( | ||
f"{status.get('CapacityBlockId')} (size: {size}, allowed: {allowed_sizes_list})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[KUDOS:UX Best Practice | Responsive to Feedback] Good user experience here. With this logic we guide the user in better identifying all the specific blocks causing the issue. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
], | ||
}, | ||
], | ||
"The following capacity blocks have invalid block sizes: cbr-456 \\(size: 16, allowed: \\[9, 18\\]\\).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[KUDOS:TestBestPractice | Responsive to Feedback] Adding ids to test cases is a very good practice to leverage uni testing as a better way to document behaviors and troubleshoot them in case of failures. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
…ity_block_status api
Description of changes
Tests
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.