Skip to content

Conversation

@wido
Copy link
Contributor

@wido wido commented Aug 8, 2018

Description

The KVM Agent had two mechanisms for reporting its capabilities
and memory to the Management Server.

On startup it would ask libvirt the amount of Memory the Host has
and subtract and add the reserved and overcommit memory.

When the HostStats were however reported to the Management Server
these two configured values on the Agent were no longer reported
in the statistics thus showing all the available memory in the
Agent/Host to the Management Server.

This commit unifies this by using the same logic on Agent Startup
and during statistics reporting.

memory=3069636608, reservedMemory=1073741824

This was reported by a 4GB Hypervisor with this setting:

host.reserved.mem.mb=1024

The GUI (thus API) would then show:

Memory Total 2.86 GB

This way the Agent properly 'lies' to the Management Server about its
capabilities in terms of Memory.

This is very helpful if you want to overprovision or undercommit machines
for various reasons.

Overcommitting can be done when KSM or ZSwap or a fast SWAP device is
installed in the machine.

Underprovisioning is done when the Host might run other tasks then a KVM
hypervisor, for example when it runs in a hyperconverged setup with Ceph.

In addition internally many values have been changed from a Double to a Long
and also store the amount of bytes instead of Kilobytes.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

I tested this on a CloudStack 4.12 test setup running on my local system.

The KVM Agent will report the amount of memory it has minus any reservations. This way the Mgmt server will never know the exact amount of memory the Agent has.

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
    Testing
  • I have added tests to cover my changes.
  • All relevant new and existing integration tests have passed.
  • A full integration testsuite with all test that can run on my environment has passed.

@dhlaluku
Copy link
Contributor

dhlaluku commented Aug 8, 2018

@blueorangutan package

import java.util.ArrayList;
import java.util.List;

public class HostInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this class to KVMHostInfo, that way it will be more obvious what this class is expected of?


public class HostInfo {

private static final Logger s_logger = Logger.getLogger(HostInfo.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the s_, instead using LOG or LOGGER?

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2237

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@wido wido force-pushed the kvm-memory-calculation branch from 16f0739 to 4cb5c9f Compare August 9, 2018 08:54
@wido
Copy link
Contributor Author

wido commented Aug 9, 2018

@rhtyd Done, I've renamed the class and the logger variable. How does this look like?

@wido
Copy link
Contributor Author

wido commented Aug 9, 2018

Ping @RPDiep

@blueorangutan
Copy link

Trillian test result (tid-2925)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31642 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2795-t2925-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_templates.py
Intermitten failure detected: /marvin/tests/smoke/test_usage.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_volumes.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 60 look OK, 9 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.57 test_primary_storage.py
test_01_primary_storage_nfs Error 0.08 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.15 test_primary_storage.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 1104.41 test_privategw_acl.py
test_02_list_snapshots_with_removed_data_store Error 1.11 test_snapshots.py
test_04_extract_template Failure 128.24 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_01_secure_vm_migration Error 51.90 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 122.96 test_vm_life_cycle.py
test_08_migrate_vm Error 14.66 test_vm_life_cycle.py
test_06_download_detached_volume Failure 137.53 test_volumes.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 6.25 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 3.25 test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance Error 3.41 test_hostha_kvm.py

@dhlaluku
Copy link
Contributor

dhlaluku commented Aug 9, 2018

@blueorangutan package

@blueorangutan
Copy link

@dhlaluku a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2238

@dhlaluku
Copy link
Contributor

dhlaluku commented Aug 9, 2018

@blueorangutan test

@blueorangutan
Copy link

@dhlaluku a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2927)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32376 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2795-t2927-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_templates.py
Intermitten failure detected: /marvin/tests/smoke/test_usage.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_volumes.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 60 look OK, 9 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.56 test_primary_storage.py
test_01_primary_storage_nfs Error 0.10 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.18 test_primary_storage.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 1148.03 test_privategw_acl.py
test_02_list_snapshots_with_removed_data_store Error 1.12 test_snapshots.py
test_04_extract_template Failure 128.28 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_01_secure_vm_migration Error 87.67 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 68.02 test_vm_life_cycle.py
test_08_migrate_vm Error 14.66 test_vm_life_cycle.py
test_06_download_detached_volume Failure 137.59 test_volumes.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 6.28 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 2.25 test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance Error 4.47 test_hostha_kvm.py

final LibvirtCapXMLParser parser = new LibvirtCapXMLParser();
parser.parseCapabilitiesXML(conn.getCapabilities());
final ArrayList<String> oss = parser.getGuestOsType();
for (final String s : oss) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doing all of this process if you only report the host supporting HVM? I mean, KVM does not run if the CPU does not support HVM. Therefore, all KVM hosts will always have HVM support, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they will. But what I did is take the original code and use it. I didn't modify the logic behind the capabilities.

I would make that a different PR if we want to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you are only moving the old code around, but why not remove useless code now?

I mean, why would we keep a code that does not aggregate value, and can be misleading to newcomers?

@rohityadavcloud rohityadavcloud added this to the 4.12.0.0 milestone Aug 14, 2018
@GabrielBrascher
Copy link
Member

@rhtyd do you know if the failing tests are false positives? I have been testing some actions such as VM migrations and adding/removing hosts. I had no problem when using primary storage with nfs and a cluster with KVM servers.

@dhlaluku
Copy link
Contributor

@GabrielBrascher most of the failing tests have already been addressed and merged into master with this PRs #2763 #2756 #2754

@wido have this changes been rebased against the latest master?

@wido wido force-pushed the kvm-memory-calculation branch from 4cb5c9f to bc90f30 Compare August 16, 2018 14:10
@wido
Copy link
Contributor Author

wido commented Aug 16, 2018

@dhlaluku I've rebased against master

@dhlaluku
Copy link
Contributor

@wido thank you
@blueorangutan package

@blueorangutan
Copy link

@dhlaluku a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2256

@blueorangutan
Copy link

Packaging result: ✖centos6 ✖centos7 ✖debian. JID-2517

@dhlaluku
Copy link
Contributor

dhlaluku commented Jan 8, 2019

@blueorangutan package

@blueorangutan
Copy link

@dhlaluku a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@rohityadavcloud
Copy link
Member

@wido can you check and fix build failures, see job 1 of travis

@blueorangutan
Copy link

Packaging result: ✖centos6 ✖centos7 ✖debian. JID-2518

@wido wido force-pushed the kvm-memory-calculation branch from 48482b0 to 004e785 Compare January 8, 2019 15:49
@wido
Copy link
Contributor Author

wido commented Jan 8, 2019

I just force pushed. This seems to be a thing (Unit Test) which fails on Travis, but works on my local system.

Could be a problem where tests are running inside a Container?

@wido wido force-pushed the kvm-memory-calculation branch from 004e785 to 31ad338 Compare January 9, 2019 13:37
The KVM Agent had two mechanisms for reporting its capabilities
and memory to the Management Server.

On startup it would ask libvirt the amount of Memory the Host has
and subtract and add the reserved and overcommit memory.

When the HostStats were however reported to the Management Server
these two configured values on the Agent were no longer reported
in the statistics thus showing all the available memory in the
Agent/Host to the Management Server.

This commit unifies this by using the same logic on Agent Startup
and during statistics reporting.

  memory=3069636608, reservedMemory=1073741824

This was reported by a 4GB Hypervisor with this setting:

  host.reserved.mem.mb=1024

The GUI (thus API) would then show:

  Memory Total	2.86 GB

This way the Agent properly 'lies' to the Management Server about its
capabilities in terms of Memory.

This is very helpful if you want to overprovision or undercommit machines
for various reasons.

Overcommitting can be done when KSM or ZSwap or a fast SWAP device is
installed in the machine.

Underprovisioning is done when the Host might run other tasks then a KVM
hypervisor, for example when it runs in a hyperconverged setup with Ceph.

In addition internally many values have been changed from a Double to a Long
and also store the amount of bytes instead of Kilobytes.

Signed-off-by: Wido den Hollander <[email protected]>
@wido wido force-pushed the kvm-memory-calculation branch from 31ad338 to e58f4b1 Compare January 16, 2019 21:43
@wido
Copy link
Contributor Author

wido commented Jan 16, 2019

I just rebased against master and everything seems fine to me.

Are there any objections on this PR?

@GabrielBrascher
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2551

@GabrielBrascher
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@GabrielBrascher a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@rafaelweingartner
Copy link
Member

@wido we need to check the integration test results before being able to merge it.
Thanks for the PR.

@blueorangutan
Copy link

Trillian test result (tid-3333)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 26958 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2795-t3333-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 66 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.87 test_primary_storage.py
test_01_primary_storage_nfs Error 0.15 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.28 test_primary_storage.py
test_02_list_snapshots_with_removed_data_store Error 1.17 test_snapshots.py
test_11_migrate_vm Error 79.50 test_vm_life_cycle.py
test_14_secure_to_secure_vm_migration Error 56.54 test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 0.13 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 2.34 test_host_maintenance.py

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3335)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 21178 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2795-t3335-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Smoke tests completed. 66 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.71 test_primary_storage.py
test_01_primary_storage_nfs Error 0.12 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.24 test_primary_storage.py
test_02_list_snapshots_with_removed_data_store Error 1.16 test_snapshots.py
test_11_migrate_vm Error 16.91 test_vm_life_cycle.py
test_14_secure_to_secure_vm_migration Error 70.87 test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 0.14 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 2.39 test_host_maintenance.py

@GabrielBrascher
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2555

@GabrielBrascher
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@GabrielBrascher a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3338)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 21808 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2795-t3338-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 69 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_redundant_vpc_site2site_vpn Failure 206.83 test_vpc_vpn.py

@GabrielBrascher GabrielBrascher merged commit c496c84 into apache:master Jan 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants