check volumes for state when retrieving pool for configDrive creation #2709

DaanHoogland · 2018-06-13T12:45:41Z

Description

When querying for a storage pool to create a config drive on, only pools for those volumes that have the right state should be considered.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

GitHub Issue/PRs

Screenshots (if appropriate):

How Has This Been Tested?

Checklist:

I have read the CONTRIBUTING document.
My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
Testing
I have added tests to cover my changes.
All relevant new and existing integration tests have passed.
A full integration testsuite with all test that can run on my environment has passed.

rafaelweingartner · 2018-06-14T07:40:45Z

server/src/com/cloud/network/element/ConfigDriveNetworkElement.java

-                    dataStore = _dataStoreMgr.getDataStore(volumes.get(0).getPoolId(), DataStoreRole.Primary);
-                }
+            final List<VolumeVO> volumes = _volumeDao.findByInstanceAndType(profile.getVirtualMachine().getId(), Volume.Type.ROOT);
+            if (volumes != null && volumes.size() > 0) {


You can (should?) use CollectionUtils here though.

rohityadavcloud · 2018-06-18T06:30:22Z

server/src/com/cloud/network/element/ConfigDriveNetworkElement.java

+            case UploadError:
+            case UploadAbandoned:
+            default:
+                break;


All above cases can be removed?

yes, but these are here for clarity.

PaulAngus · 2018-06-19T13:38:33Z

Attempting to deploy a VM with a config drive on primary storage generates the following error:
(deployment works fine on secondary storage)

Asking ConfigDrive to release NicProfile[8-4-acef4b56-a86d-4519-affd-068869458240-null-null
2018-06-19 11:50:06,512 WARN [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Unable to release some network resources.
java.lang.NullPointerException
at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIso(ConfigDriveNetworkElement.java:423)

2018-06-19 11:50:06,007 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Deployment found - P0=VM[User|i-2-4-VM], P0=Dest[Zone(Id)-Pod(Id)-Cluster(Id)-Host(Id)-Storage(Volume(Id|Type-->Pool(Id))] : Dest[Zone(1)-Pod(1)-Cluster(1)-Host(1)-Storage(Volume(6|ROOT-->Pool(2))]
2018-06-19 11:50:06,027 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) VM state transitted from :Starting to Starting with event: OperationRetryvm's original host id: null new host id: 1 host id before state transition: null
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Hosts's actual total CPU: 6900 and CPU after applying overprovisioning: 13800
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) We are allocating VM, increasing the used capacity of this host:1
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Current Used CPU: 1500 , Free CPU:12300 ,Requested CPU: 500
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Current Used RAM: 2147483648 , Free RAM:5368172544 ,Requested RAM: 536870912
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) CPU STATS after allocation: for host: 1, old used: 1500, old reserved: 0, actual total: 6900, total with overprovisioning: 13800; new used:2000, reserved:0; requested cpu:500,alloc_from_last:false
2018-06-19 11:50:06,033 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) RAM STATS after allocation: for host: 1, old used: 2147483648, old reserved: 0, total: 7515656192; new used: 2684354560, reserved: 0; requested mem: 536870912,alloc_from_last:false
2018-06-19 11:50:06,040 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Network id=204 is already implemented
2018-06-19 11:50:06,100 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Service SecurityGroup is not supported in the network id=204
2018-06-19 11:50:06,106 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Changing active number of nics for network id=204 on 1
2018-06-19 11:50:06,114 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Asking ConfigDrive to prepare for Nic[8-4-acef4b56-a86d-4519-affd-068869458240-null]
2018-06-19 11:50:06,131 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Cleaning up resources for the vm VM[User|i-2-4-VM] in Starting state
2018-06-19 11:50:06,137 DEBUG [c.c.a.t.Request] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Seq 1-5003499186008621072: Sending { Cmd , MgmtId: 7020325439565, via: 1(pr2709-t2766-kvm-centos7-kvm1), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"checkBeforeCleanup":false,"forceStop":false,"volumesToDisconnect":[],"vmName":"i-2-4-VM","executeInSequence":false,"wait":0}}] }
2018-06-19 11:50:06,484 DEBUG [c.c.a.t.Request] (AgentManager-Handler-3:null) (logid:) Seq 1-5003499186008621072: Processing: { Ans: , MgmtId: 7020325439565, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.StopAnswer":{"result":true,"wait":0}}] }
2018-06-19 11:50:06,484 DEBUG [c.c.a.t.Request] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Seq 1-5003499186008621072: Received: { Ans: , MgmtId: 7020325439565, via: 1(pr2709-t2766-kvm-centos7-kvm1), Ver: v1, Flags: 10, { StopAnswer } }
2018-06-19 11:50:06,493 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Service SecurityGroup is not supported in the network id=204
2018-06-19 11:50:06,496 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Changing active number of nics for network id=204 on -1
2018-06-19 11:50:06,506 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Asking ConfigDrive to release NicProfile[8-4-acef4b56-a86d-4519-affd-068869458240-null-null
2018-06-19 11:50:06,512 WARN [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Unable to release some network resources.
java.lang.NullPointerException
at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIso(ConfigDriveNetworkElement.java:423)
at com.cloud.network.element.ConfigDriveNetworkElement.release(ConfigDriveNetworkElement.java:147)
at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.releaseNic(NetworkOrchestrator.java:1919)
at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.release(NetworkOrchestrator.java:1845)
at com.cloud.vm.VirtualMachineManagerImpl.cleanup(VirtualMachineManagerImpl.java:1567)
at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:1272)
at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:4930)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:107)
at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5093)
at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:102)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:581)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:529)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-06-19 11:50:06,513 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) Successfully cleanued up resources for the vm VM[User|i-2-4-VM] in Starting state
2018-06-19 11:50:06,523 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-1:ctx-5b7ddba6 job-41/job-42 ctx-ef262bd0) (logid:1f612ec8) VM state transitted from :Starting to Stopped with event: OperationFailedvm's original host id: null new host id: null host id before state transition: 1

rohityadavcloud · 2018-06-21T05:38:12Z

server/src/com/cloud/network/element/ConfigDriveNetworkElement.java

-                }
+            final List<VolumeVO> volumes = _volumeDao.findByInstanceAndType(profile.getVirtualMachine().getId(), Volume.Type.ROOT);
+            if (CollectionUtils.isNotEmpty(volumes)) {
+                dataStore = pickDataStoreFromVolumes(volumes);


We may need this code, it's because when VM is freshly deployed/started the dest can have storage disks which are not allocated but only in the destination plan; therefore we scan those planned disks to find the root disk and then find the datastore for the root disk. Note at this stage those planned disks may not have been persisted in the db.
If an existing (say stopped) VM is started, and destination plan may not have them then we simply query the volumes for the VM (which should be available from the db).

I think NPE is hit in case of freshly started VM because planned volume may not have been allocated and persisted in the db. IMO, this PR is not blocking RC and the refactoring PR can be targetted for 4.11.2.0.

I agree that this is not blocking, but needs to be fixed as it is not taken into account that volumes may have a state of DESTROYED or EXPUNGING I'll be looking at a fix today.

okay @DaanHoogland, but if root volume is in destroyed or expunging that then VM is probably destroyed/expunged as well (probably corner case).

@rhtyd I agree. I am really wondering if this is a red herring or a needed fix.

rohityadavcloud · 2018-06-21T05:38:38Z

@DaanHoogland I've tested and reviewed the PR, please see my notes in the diff section.

DaanHoogland · 2018-06-21T08:14:07Z

whether this works or not we can do with an integration test for it.

borisstoyanov · 2018-07-04T07:47:26Z

server/src/com/cloud/network/element/ConfigDriveNetworkElement.java

+        return dataStore;
+    }
+
+    private DataStore savelyPickExistingRootVolumeDataStore(VirtualMachineProfile profile, DataStore dataStore) {


typo? 'safely'

rohityadavcloud · 2018-07-13T11:24:25Z

@borisstoyanov can you advise if it's passing manual config drive testing, wrt primary+secondary storage for hosting config drives? /cc @DaanHoogland - can you address any outstanding issues?

DaanHoogland · 2018-07-16T08:08:31Z

@blueorangutan package

blueorangutan · 2018-07-16T08:08:49Z

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-07-16T08:41:07Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2174

borisstoyanov · 2018-07-16T09:51:22Z

@blueorangutan test

blueorangutan · 2018-07-16T09:51:49Z

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2018-07-16T20:48:05Z

Trillian test result (tid-2848)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37645 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2709-t2848-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_router_dhcphosts.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_volumes.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_router_nics.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 61 look OK, 6 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
ContextSuite context=TestDeployVirtioSCSIVM>:setup	`Error`	0.00	test_deploy_virtio_scsi_vm.py
test_02_vpc_privategw_static_routes	`Failure`	168.08	test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup	`Failure`	1201.46	test_privategw_acl.py
test_02_unsecure_vm_migration	`Error`	73.04	test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration	`Error`	52.72	test_vm_life_cycle.py
test_02_redundant_VPC_default_routes	`Failure`	219.17	test_vpc_redundant.py
test_02_VPC_default_routes	`Failure`	141.50	test_vpc_router_nics.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	2.45	test_hostha_kvm.py

borisstoyanov

LGTM, based on test results and manual verification

* 4.11: register template kvm context ui fix (#2757) check volumes for state when retrieving pool for configDrive creation (#2709)

…apache#2709) * only ask for the root volume, removing extensive query * better name

DaanHoogland requested a review from rohityadavcloud June 13, 2018 13:24

DaanHoogland added this to the 4.11.1.0 milestone Jun 13, 2018

DaanHoogland requested review from fmaximus and rafaelweingartner June 14, 2018 07:37

rafaelweingartner approved these changes Jun 14, 2018

View reviewed changes

rafaelweingartner reviewed Jun 14, 2018

View reviewed changes

rohityadavcloud reviewed Jun 18, 2018

View reviewed changes

rohityadavcloud approved these changes Jun 18, 2018

View reviewed changes

DaanHoogland closed this Jun 18, 2018

DaanHoogland reopened this Jun 18, 2018

DaanHoogland closed this Jun 18, 2018

DaanHoogland reopened this Jun 18, 2018

PaulAngus added type:bug component:kvm status:testing status:waiting-for-reviewer labels Jun 19, 2018

PaulAngus self-requested a review June 19, 2018 10:57

PaulAngus added status:waiting-for-author status:work-in-progress and removed status:testing status:waiting-for-reviewer labels Jun 19, 2018

rohityadavcloud reviewed Jun 21, 2018

View reviewed changes

DaanHoogland removed the status:waiting-for-author label Jun 21, 2018

DaanHoogland removed the status:work-in-progress label Jun 21, 2018

DaanHoogland changed the title ~~only ask for the root volume, removing extensive query~~ check volumes for state when retrieving pool for configDrive creation Jun 21, 2018

only ask for the root volume, removing extensive query

0e04e69

DaanHoogland force-pushed the configDriveOnlyRootVolume branch from b05c16d to 0e04e69 Compare June 25, 2018 10:44

rohityadavcloud modified the milestones: 4.11.1.0, 4.11.2.0 Jun 28, 2018

borisstoyanov reviewed Jul 4, 2018

View reviewed changes

better name

b7290fe

apache deleted a comment from blueorangutan Jul 16, 2018

borisstoyanov approved these changes Jul 17, 2018

View reviewed changes

DaanHoogland merged commit 38d0274 into apache:4.11 Jul 18, 2018

DaanHoogland pushed a commit that referenced this pull request Jul 20, 2018

Merge release branch 4.11 to master

c3ff356

* 4.11: register template kvm context ui fix (#2757) check volumes for state when retrieving pool for configDrive creation (#2709)

borisstoyanov pushed a commit to shapeblue/cloudstack that referenced this pull request Jul 23, 2018

check volumes for state when retrieving pool for configDrive creation (…

db3af50

…apache#2709) * only ask for the root volume, removing extensive query * better name

check volumes for state when retrieving pool for configDrive creation #2709

check volumes for state when retrieving pool for configDrive creation #2709

Uh oh!

Conversation

DaanHoogland commented Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

GitHub Issue/PRs

Screenshots (if appropriate):

How Has This Been Tested?

Checklist:

Uh oh!

rafaelweingartner Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

DaanHoogland Jun 18, 2018

Choose a reason for hiding this comment

Uh oh!

PaulAngus commented Jun 19, 2018

Uh oh!

rohityadavcloud Jun 21, 2018

Choose a reason for hiding this comment

Uh oh!

DaanHoogland Jun 21, 2018

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud Jun 21, 2018

Choose a reason for hiding this comment

Uh oh!

DaanHoogland Jul 3, 2018

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud commented Jun 21, 2018

Uh oh!

DaanHoogland commented Jun 21, 2018

Uh oh!

borisstoyanov Jul 4, 2018

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud commented Jul 13, 2018

Uh oh!

DaanHoogland commented Jul 16, 2018

Uh oh!

blueorangutan commented Jul 16, 2018

Uh oh!

blueorangutan commented Jul 16, 2018

Uh oh!

borisstoyanov commented Jul 16, 2018

Uh oh!

blueorangutan commented Jul 16, 2018

Uh oh!

blueorangutan commented Jul 16, 2018

Uh oh!

borisstoyanov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DaanHoogland commented Jun 13, 2018 •

edited

Loading

rafaelweingartner Jun 14, 2018 •

edited

Loading