Skip to content

Conversation

@vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Jun 10, 2024

Description

Fixes #7202 #9110

This PR doesn't completely solve the issue. To completely resolving the issue, we will have to sacrifice the speed at which VMs can be deployed in parallel.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Jun 10, 2024

Codecov Report

Attention: Patch coverage is 11.59420% with 61 lines in your changes missing coverage. Please review.

Project coverage is 15.67%. Comparing base (3f2761e) to head (120ad26).
Report is 90 commits behind head on 4.19.

Files Patch % Lines
...cloudstack/affinity/HostAntiAffinityProcessor.java 0.00% 43 Missing ⚠️
...che/cloudstack/affinity/HostAffinityProcessor.java 44.44% 9 Missing and 1 partial ⚠️
.../cloudstack/affinity/dao/AffinityGroupDaoImpl.java 0.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #9201      +/-   ##
============================================
+ Coverage     14.96%   15.67%   +0.71%     
- Complexity    10993    10997       +4     
============================================
  Files          5373     5010     -363     
  Lines        469248   439976   -29272     
  Branches      58782    54849    -3933     
============================================
- Hits          70210    68967    -1243     
+ Misses       391266   363377   -27889     
+ Partials       7772     7632     -140     
Flag Coverage Δ
uitests ?
unittests 15.67% <11.59%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9842

@vishesh92
Copy link
Member Author

@blueorangutan test keepEnv

@blueorangutan
Copy link

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-10401)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43304 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9201-t10401-kvm-centos7.zip
Smoke tests completed. 130 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_redundant_vpc_site2site_vpn Failure 387.83 test_vpc_vpn.py

@weizhouapache
Copy link
Member

@vishesh92
have you tested it ?

@vishesh92
Copy link
Member Author

@vishesh92 have you tested it ?

I tested with simulator. I was able to reproduce and fix the issue for strict anti affinity.

@weizhouapache
Copy link
Member

@vishesh92 have you tested it ?

I tested with simulator. I was able to reproduce and fix the issue for strict anti affinity.

thanks @vishesh92
I will create a testing env to test it

@weizhouapache
Copy link
Member

@vishesh92
I tested host-affinity,

    command="cmk deploy virtualmachine zoneid=5a6557c9-b1e0-4fbf-bca4-7f376a96df65 templateid=344e6034-28ae-11ef-8462-1e00c200016f serviceofferingid=2eb32c51-d604-472b-895e-019f31a2f146 networkids=39323a57-1e12-4f3d-b030-36ef7e850e22"

    affinitygroup=$(cmk create affinitygroup type='host affinity' name=test-`uuidgen`)
    affinitygroupid=$(echo $affinitygroup |jq -r '.affinitygroup.id')

    # create 2 vms in parallel
    for i in `seq 1 2`;do $command affinitygroupids=$affinitygroupid & done

I tried twice. first time it works, second try it did not

image

@vishesh92
Copy link
Member Author

@vishesh92 I tested host-affinity,

    command="cmk deploy virtualmachine zoneid=5a6557c9-b1e0-4fbf-bca4-7f376a96df65 templateid=344e6034-28ae-11ef-8462-1e00c200016f serviceofferingid=2eb32c51-d604-472b-895e-019f31a2f146 networkids=39323a57-1e12-4f3d-b030-36ef7e850e22"

    affinitygroup=$(cmk create affinitygroup type='host affinity' name=test-`uuidgen`)
    affinitygroupid=$(echo $affinitygroup |jq -r '.affinitygroup.id')

    # create 2 vms in parallel
    for i in `seq 1 2`;do $command affinitygroupids=$affinitygroupid & done

I tried twice. first time it works, second try it did not

Yes. This will reduce the occurrence of failures but not completely resolve the issue.

@vishesh92 vishesh92 added this to the 4.19.1.0 milestone Jun 12, 2024
Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but can't confirm on stability of the feature/testing, or if current smoketests cover these changes

@weizhouapache
Copy link
Member

@vishesh92
I had a look of the process of vm deployment, would it make sense to lock the method below ?

public DeployDestination planDeployment(VirtualMachineProfile vmProfile, DeploymentPlan plan, ExcludeList avoids, DeploymentPlanner planner)
throws InsufficientServerCapacityException, AffinityConflictException {

rename it to planDeploymentInternal and create a new planDeployment using Transaction ?

@weizhouapache
Copy link
Member

@vishesh92 I had a look of the process of vm deployment, would it make sense to lock the method below ?

public DeployDestination planDeployment(VirtualMachineProfile vmProfile, DeploymentPlan plan, ExcludeList avoids, DeploymentPlanner planner)
throws InsufficientServerCapacityException, AffinityConflictException {

rename it to planDeploymentInternal and create a new planDeployment using Transaction ?

I am also wondering what processes should be locked, maybe until the host is determined ?

maybe we could introduce some mechanism, for example

the problem is, it could lead to high latency if the platform is large.

@rohityadavcloud
Copy link
Member

@vishesh92 can you review and advise further (PR marked waiting-for-author)?

@vishesh92
Copy link
Member Author

@vishesh92 can you review and advise further (PR marked waiting-for-author)?

@rohityadavcloud I am running the capc e2e test for affinity group. Let me share the results once the job is complete.

@vishesh92
Copy link
Member Author

@weizhouapache @sureshanaparti I ran e2e test 5 times on an env with this patch. Here are the results:
2 - Successfully passed
2- timed out
1 - failure

@DaanHoogland
Copy link
Contributor

@sureshanaparti , will we try to merge this in 4.19.1?

@vishesh92
Copy link
Member Author

@DaanHoogland This PR is targeted for 4.19.2

Copy link
Contributor

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested with the following steps


1. Create a affinity group of type "host anti-affinity (Strict)".

2. Deploy 2 vm's ok  without any affinity group attached 

3. Check the vm's host ( they can be deployed on the same host)

Note: Make sure the vm's are deployed on 1 single host in the cluster

3. Stop the vm's

4. Change the affinity group  of the vms to  "host anti-affinity (Strict)"

Note: Make sure the affinity group is attached to the vm

5. Start the vm (Made sure to uncheck the option to start on Last host )

6. The vm's got deployed  on different hosts 

@DaanHoogland DaanHoogland merged commit c98f1b8 into apache:4.19 Aug 12, 2024
@DaanHoogland DaanHoogland deleted the fix-affinity-group-parallel-deploy branch August 12, 2024 12:02
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

VMs with same host-affinity group are deployed to different hosts host anti-affinity is not working

7 participants