Skip to content

Conversation

@DaanHoogland
Copy link
Contributor

@DaanHoogland DaanHoogland commented Jun 13, 2018

Description

When querying for a storage pool to create a config drive on, only pools for those volumes that have the right state should be considered.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

GitHub Issue/PRs

Screenshots (if appropriate):

How Has This Been Tested?

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
    Testing
  • I have added tests to cover my changes.
  • All relevant new and existing integration tests have passed.
  • A full integration testsuite with all test that can run on my environment has passed.

dataStore = _dataStoreMgr.getDataStore(volumes.get(0).getPoolId(), DataStoreRole.Primary);
}
final List<VolumeVO> volumes = _volumeDao.findByInstanceAndType(profile.getVirtualMachine().getId(), Volume.Type.ROOT);
if (volumes != null && volumes.size() > 0) {
Copy link
Member

@rafaelweingartner rafaelweingartner Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can (should?) use CollectionUtils here though.

case UploadError:
case UploadAbandoned:
default:
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All above cases can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but these are here for clarity.

@PaulAngus
Copy link
Member

Attempting to deploy a VM with a config drive on primary storage generates the following error:
(deployment works fine on secondary storage)

Asking ConfigDrive to release NicProfile[8-4-acef4b56-a86d-4519-affd-068869458240-null-null
2018-06-19 11:50:06,512 WARN  [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Unable to release some network resources.
java.lang.NullPointerException
  at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIso(ConfigDriveNetworkElement.java:423)

2018-06-19 11:50:06,007 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Deployment found  - P0=VM[User|i-2-4-VM], P0=Dest[Zone(Id)-Pod(Id)-Cluster(Id)-Host(Id)-Storage(Volume(Id|Type-->Pool(Id))] : Dest[Zone(1)-Pod(1)-Cluster(1)-Host(1)-Storage(Volume(6|ROOT-->Pool(2))]
2018-06-19 11:50:06,027 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) VM state transitted from :Starting to Starting with event: OperationRetryvm's original host id: null new host id: 1 host id before state transition: null
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Hosts's actual total CPU: 6900 and CPU after applying overprovisioning: 13800
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) We are allocating VM, increasing the used capacity of this host:1
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Current Used CPU: 1500 , Free CPU:12300 ,Requested CPU: 500
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Current Used RAM: 2147483648 , Free RAM:5368172544 ,Requested RAM: 536870912
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) CPU STATS after allocation: for host: 1, old used: 1500, old reserved: 0, actual total: 6900, total with overprovisioning: 13800; new used:2000, reserved:0; requested cpu:500,alloc_from_last:false
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) RAM STATS after allocation: for host: 1, old used: 2147483648, old reserved: 0, total: 7515656192; new used: 2684354560, reserved: 0; requested mem: 536870912,alloc_from_last:false
2018-06-19 11:50:06,040 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Network id=204 is already implemented
2018-06-19 11:50:06,100 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Service SecurityGroup is not supported in the network id=204
2018-06-19 11:50:06,106 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Changing active number of nics for network id=204 on 1
2018-06-19 11:50:06,114 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Asking ConfigDrive to prepare for Nic[8-4-acef4b56-a86d-4519-affd-068869458240-null]
2018-06-19 11:50:06,131 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Cleaning up resources for the vm VM[User|i-2-4-VM] in Starting state
2018-06-19 11:50:06,137 DEBUG [c.c.a.t.Request] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Seq 1-5003499186008621072: Sending  { Cmd , MgmtId: 7020325439565, via: 1(pr2709-t2766-kvm-centos7-kvm1), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"checkBeforeCleanup":false,"forceStop":false,"volumesToDisconnect":[],"vmName":"i-2-4-VM","executeInSequence":false,"wait":0}}] }
2018-06-19 11:50:06,484 DEBUG [c.c.a.t.Request] (AgentManager-Handler-3:null) (logid:) Seq 1-5003499186008621072: Processing:  { Ans: , MgmtId: 7020325439565, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.StopAnswer":{"result":true,"wait":0}}] }
2018-06-19 11:50:06,484 DEBUG [c.c.a.t.Request] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Seq 1-5003499186008621072: Received:  { Ans: , MgmtId: 7020325439565, via: 1(pr2709-t2766-kvm-centos7-kvm1), Ver: v1, Flags: 10, { StopAnswer } }
2018-06-19 11:50:06,493 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Service SecurityGroup is not supported in the network id=204
2018-06-19 11:50:06,496 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Changing active number of nics for network id=204 on -1
2018-06-19 11:50:06,506 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Asking ConfigDrive to release NicProfile[8-4-acef4b56-a86d-4519-affd-068869458240-null-null
2018-06-19 11:50:06,512 WARN  [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Unable to release some network resources.
java.lang.NullPointerException
  at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIso(ConfigDriveNetworkElement.java:423)
  at com.cloud.network.element.ConfigDriveNetworkElement.release(ConfigDriveNetworkElement.java:147)
  at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.releaseNic(NetworkOrchestrator.java:1919)
  at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.release(NetworkOrchestrator.java:1845)
  at com.cloud.vm.VirtualMachineManagerImpl.cleanup(VirtualMachineManagerImpl.java:1567)
  at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:1272)
  at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:4930)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:107)
  at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5093)
  at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:102)
  at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:581)
  at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
  at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
  at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:529)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2018-06-19 11:50:06,513 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Successfully cleanued up resources for the vm VM[User|i-2-4-VM] in Starting state
2018-06-19 11:50:06,523 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) VM state transitted from :Starting to Stopped with event: OperationFailedvm's original host id: null new host id: null host id before state transition: 1

}
final List<VolumeVO> volumes = _volumeDao.findByInstanceAndType(profile.getVirtualMachine().getId(), Volume.Type.ROOT);
if (CollectionUtils.isNotEmpty(volumes)) {
dataStore = pickDataStoreFromVolumes(volumes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need this code, it's because when VM is freshly deployed/started the dest can have storage disks which are not allocated but only in the destination plan; therefore we scan those planned disks to find the root disk and then find the datastore for the root disk. Note at this stage those planned disks may not have been persisted in the db.
If an existing (say stopped) VM is started, and destination plan may not have them then we simply query the volumes for the VM (which should be available from the db).

I think NPE is hit in case of freshly started VM because planned volume may not have been allocated and persisted in the db. IMO, this PR is not blocking RC and the refactoring PR can be targetted for 4.11.2.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is not blocking, but needs to be fixed as it is not taken into account that volumes may have a state of DESTROYED or EXPUNGING I'll be looking at a fix today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay @DaanHoogland, but if root volume is in destroyed or expunging that then VM is probably destroyed/expunged as well (probably corner case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhtyd I agree. I am really wondering if this is a red herring or a needed fix.

@rohityadavcloud
Copy link
Member

@DaanHoogland I've tested and reviewed the PR, please see my notes in the diff section.

@DaanHoogland
Copy link
Contributor Author

whether this works or not we can do with an integration test for it.

@DaanHoogland DaanHoogland changed the title only ask for the root volume, removing extensive query check volumes for state when retrieving pool for configDrive creation Jun 21, 2018
@DaanHoogland DaanHoogland force-pushed the configDriveOnlyRootVolume branch from b05c16d to 0e04e69 Compare June 25, 2018 10:44
@rohityadavcloud rohityadavcloud modified the milestones: 4.11.1.0, 4.11.2.0 Jun 28, 2018
return dataStore;
}

private DataStore savelyPickExistingRootVolumeDataStore(VirtualMachineProfile profile, DataStore dataStore) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? 'safely'

@rohityadavcloud
Copy link
Member

@borisstoyanov can you advise if it's passing manual config drive testing, wrt primary+secondary storage for hosting config drives? /cc @DaanHoogland - can you address any outstanding issues?

@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@apache apache deleted a comment from blueorangutan Jul 16, 2018
@DaanHoogland
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2174

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2848)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37645 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2709-t2848-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_router_dhcphosts.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_volumes.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_router_nics.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 61 look OK, 6 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestDeployVirtioSCSIVM>:setup Error 0.00 test_deploy_virtio_scsi_vm.py
test_02_vpc_privategw_static_routes Failure 168.08 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 1201.46 test_privategw_acl.py
test_02_unsecure_vm_migration Error 73.04 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 52.72 test_vm_life_cycle.py
test_02_redundant_VPC_default_routes Failure 219.17 test_vpc_redundant.py
test_02_VPC_default_routes Failure 141.50 test_vpc_router_nics.py
test_hostha_enable_ha_when_host_in_maintenance Error 2.45 test_hostha_kvm.py

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, based on test results and manual verification

@DaanHoogland DaanHoogland merged commit 38d0274 into apache:4.11 Jul 18, 2018
DaanHoogland pushed a commit that referenced this pull request Jul 20, 2018
* 4.11:
  register template kvm context ui fix (#2757)
  check volumes for state when retrieving pool for configDrive creation (#2709)
borisstoyanov pushed a commit to shapeblue/cloudstack that referenced this pull request Jul 23, 2018
…apache#2709)

* only ask for the root volume, removing extensive query

* better name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants