CLOUDSTACK-10310 Fix KVM reboot on storage issue #2722

Slair1 · 2018-06-25T15:12:56Z

Rebase of #2472 onto 4.11

If the KVM heartbeat file can't be written to, the host is rebooted, and thus taking down all VMs running on it. The code does try 5x times before the reboot, but the there is not a delay between the retires, so they are 5 simultaneous retries, which doesn't help. Standard SAN storage HA operations or quick network blip could cause this reboot to occur.

Some discussions on the dev mailing list revealed that some people are just commenting out the reboot line in their version of the CloudStack source.

A better option would be have it sleep between tries so it isn't 5x almost simultaneous tries. Plus, instead of rebooting, the cloudstack-agent could just be stopped on the host instead. This will cause alerts to be issued and if the host is disconnected long-enough, depending on the HA code in use, VM HA could handle the host failure.

The built-in reboot of the host seemed drastic

Slair1 · 2018-06-25T15:17:51Z

Looks like there is an additional Issue open #2657 and related PR #2658

borisstoyanov · 2018-06-26T12:26:01Z

@blueorangutan package

blueorangutan · 2018-06-26T12:26:43Z

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-06-26T12:56:34Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2154

borisstoyanov · 2018-06-26T12:59:00Z

@blueorangutan test

blueorangutan · 2018-06-26T13:00:44Z

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2018-06-26T21:52:56Z

Trillian test result (tid-2827)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30394 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2722-t2827-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 63 look OK, 4 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
ContextSuite context=TestDeployVirtioSCSIVM>:setup	`Error`	0.00	test_deploy_virtio_scsi_vm.py
test_03_vpc_privategw_restart_vpc_cleanup	`Failure`	1116.27	test_privategw_acl.py
test_05_rvpc_multi_tiers	`Failure`	319.98	test_vpc_redundant.py
test_05_rvpc_multi_tiers	`Error`	341.79	test_vpc_redundant.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	2.44	test_hostha_kvm.py

borisstoyanov · 2018-06-27T07:15:37Z

@Slair1 why not master instead of 4.11?
4.11.1 just went out.

Slair1 · 2018-06-27T14:15:54Z

@borisstoyanov I didn't know if there was going to be another 4.11.x release or not. If we prefer, i can rebase to master?

DaanHoogland · 2018-06-27T19:15:24Z

@Slair1 this is a bug fix. It can go into 4.11 and if we merge we also merge forward to master

borisstoyanov

LGTM

rohityadavcloud

LGTM. I'm only cautious about the change from restart/reboot to stopping of the agent - this may cause a potential regression or unknown side effects.

csquire · 2018-10-08T14:36:37Z

This PR doesn't seem to completely fix the problem (or maybe this is a completely new problem). We installed the RC release with this PR on a test system and are able to get the KVM host to be marked as Down by using iptables to drop outgoing requests to NFS. My investigation shows that the line storage = conn.storagePoolLookupByUUIDString(uuid); blocks indefinitely. So, kvmheartbeat.sh is never executed, a host investigation is started, the host with blocked NFS is marked as Down and finally all VMs on that host are rescheduled and result in duplicate VMs.

I pulled a thread dump and found the KVMHAMonitor thread will hang here until NFS is unblocked, didn't dig any deeper yet though.

   java.lang.Thread.State: RUNNABLE
        at com.sun.jna.Native.invokePointer(Native Method)
        at com.sun.jna.Function.invokePointer(Function.java:470)
        at com.sun.jna.Function.invoke(Function.java:404)
        at com.sun.jna.Function.invoke(Function.java:315)
        at com.sun.jna.Library$Handler.invoke(Library.java:212)
        at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown Source)
        at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
        at com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
        - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None```

rohityadavcloud · 2018-10-08T14:49:13Z

Good find @csquire, @Slair1 any comments? /cc @PaulAngus @borisstoyanov @DaanHoogland

csquire · 2018-10-08T14:56:31Z

Sorry, I misspoke in my last comment (edited to make it correct). The blocked host doesn't reboot, it just gets marked as Down and the VMs are actually still running on it when duplicate VMs get provisioned. Maybe it's a completely separate issue, but will still prevent us from using 4.11 in production. EDIT: Actually, looks like it may have been present before 4.11.

Slair1 · 2018-10-08T15:07:27Z

@csquire The VMs should eventually crash themselves since they won't have access to their disks. In our environment when a host gets marked as down we get alerted and an engineer can take a look at what is going on. Yes, we rely on VM HA to decide what to do with the problem host also. We make sure lockd is enabled so we don't get a split-brain situation.

https://libvirt.org/locking-lockd.html

csquire · 2018-10-08T15:55:28Z

@Slair1 thanks for the feedback, I will run more tests. The behavior is still incorrect, so I went ahead and submitted issue #2890

somejfn · 2018-10-17T16:44:28Z

@Slair1 Testing with virtlockd, if the host that crashed can't come back I had to manually clear the locks for VM HA to fire and restart the VM. At that point it's the same as disabling VM HA and manually restart VM's after a KVM host crash. How is your setup different ?

Slair1 · 2018-10-17T16:50:40Z

@somejfn maybe because I have PR #2474 also installed?

rohityadavcloud · 2018-10-30T08:33:05Z

I should n't have taken word on this, I tested and could not reproduce the behaviour @csquire reported. I think we should revert the change and let the KVM host reset. VM disk corruption is worse than VM downtime.

On actual testing, I could see that kvmheartbeat.sh script fails on NFS server failure and stops the agent only. Any HA VMs could be launched in different hosts, and recovery of NFS server could lead to a state where a HA enabled VM runs on two hosts and can potentially cause disk corruptions. In most cases, VM disk corruption will be worse than VM downtime. I've kept the sleep interval between check/rounds but reduced it to 10s. The change in behaviour was introduced in apache#2722. Signed-off-by: Rohit Yadav <[email protected]>

PaulAngus · 2018-10-30T09:27:04Z

I'll try to summarise the scenario so that we're all trying to fix the same thing...

A host cannot write to one of it's primary storage pools.
The some of the VMs on that host are on that pool, so their disks have gone read-only, but the VM is still running.
BUT there may be VMs on other primary storage pools that are absolutely fine

PaulAngus · 2018-10-30T09:32:29Z

IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be powered-off and restarted somewhere else, and if it isn't HA enabled, we shouldn't do anything with the running VM (as it's for the user of the VM to deal with it),
in either case we should probably set the host to 'alert' so then an admin can see it and do something about it.

On actual testing, I could see that kvmheartbeat.sh script fails on NFS server failure and stops the agent only. Any HA VMs could be launched in different hosts, and recovery of NFS server could lead to a state where a HA enabled VM runs on two hosts and can potentially cause disk corruptions. In most cases, VM disk corruption will be worse than VM downtime. I've kept the sleep interval between check/rounds but reduced it to 10s. The change in behaviour was introduced in #2722. Signed-off-by: Rohit Yadav <[email protected]>

somejfn · 2018-10-30T12:25:19Z

On precision about #2.... With primary storage on NFS hard mounts VMs don't go read-only (tested with OL5/6/7) and will resume writing to disk once NFS server become available again even after a 25 minutes outage

…

On Tue, Oct 30, 2018 at 5:27 AM Paul Angus ***@***.***> wrote: I'll try to summarise the scenario so that we're all trying to fix the same thing... 1. A host cannot write to one of it's primary storage pools. 2. The some of the VMs on that host are on that pool, so their disks have gone read-only, but the VM is still running. 3. BUT there may be VMs on other primary storage pools that are absolutely fine — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2722 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOflqMo1V8KeDv4PmOM4aahyCozBJMdcks5uqBttgaJpZM4U2Vve> .

somejfn · 2018-10-30T12:27:44Z

On problem is while NFS is unavailable, you wont be able to destroy the VM.... libvirt will just hang. So if you attempt to destroy the and start a new VM, when the NFS service comes back online you will get the duplicate VM. That's why I would rather just wait for the NFS issue to go away rather than fire VM-HA in that case.

…

On Tue, Oct 30, 2018 at 5:32 AM Paul Angus ***@***.***> wrote: IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be powered-off and restarted somewhere else, and if it isn't HA enabled, we shouldn't do anything with the running VM (as it's for the user of the VM to deal with it), in either case we should probably set the host to 'alert' so then an admin can see it and do something about it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2722 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOflqEYelB48SSZ6QeAJYCdjQsQoE014ks5uqBy6gaJpZM4U2Vve> .

Correção do _dashboard_ inicial na visão de projetos Closes apache#2722 See merge request scclouds/scclouds!1192

CLOUDSTACK-10310 Fix KVM reboot on storage issue

58d8d8b

This was referenced Jun 25, 2018

CLOUDSTACK-10310 Fix KVM reboot on storage issue #2472

Closed

KVM hosts are being reset by a script #2657

Closed

borisstoyanov approved these changes Jun 28, 2018

View reviewed changes

rohityadavcloud approved these changes Jun 29, 2018

View reviewed changes

DaanHoogland merged commit 023dcec into apache:4.11 Aug 20, 2018

csquire mentioned this pull request Oct 8, 2018

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

Closed

rohityadavcloud mentioned this pull request Oct 30, 2018

kvm: reset KVM host on heartbeat failure #2984

Merged

5 tasks

weizhouapache mentioned this pull request Feb 18, 2021

kvm: Handle storage issue on NFS/KVM in multiple ways #4708

Closed

12 tasks

weizhouapache mentioned this pull request Apr 8, 2024

KVMHAMonitor getting initialized without host ha enabled #8682

Closed

bernardodemarco pushed a commit to scclouds/cloudstack that referenced this pull request Jul 16, 2025

Merge branch 'fix-projects-dashboard' into '4.20.0.0-scclouds'

e2a7133

Correção do _dashboard_ inicial na visão de projetos Closes apache#2722 See merge request scclouds/scclouds!1192

CLOUDSTACK-10310 Fix KVM reboot on storage issue #2722

CLOUDSTACK-10310 Fix KVM reboot on storage issue #2722

Uh oh!

Conversation

Slair1 commented Jun 25, 2018

Uh oh!

Slair1 commented Jun 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borisstoyanov commented Jun 26, 2018

Uh oh!

blueorangutan commented Jun 26, 2018

Uh oh!

blueorangutan commented Jun 26, 2018

Uh oh!

borisstoyanov commented Jun 26, 2018

Uh oh!

blueorangutan commented Jun 26, 2018

Uh oh!

blueorangutan commented Jun 26, 2018

Uh oh!

borisstoyanov commented Jun 27, 2018

Uh oh!

Slair1 commented Jun 27, 2018

Uh oh!

DaanHoogland commented Jun 27, 2018

Uh oh!

borisstoyanov left a comment

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud left a comment

Choose a reason for hiding this comment

Uh oh!

csquire commented Oct 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohityadavcloud commented Oct 8, 2018

Uh oh!

csquire commented Oct 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Slair1 commented Oct 8, 2018

Uh oh!

csquire commented Oct 8, 2018

Uh oh!

somejfn commented Oct 17, 2018

Uh oh!

Slair1 commented Oct 17, 2018

Uh oh!

rohityadavcloud commented Oct 30, 2018

Uh oh!

PaulAngus commented Oct 30, 2018

Uh oh!

PaulAngus commented Oct 30, 2018

Uh oh!

somejfn commented Oct 30, 2018 via email

Uh oh!

somejfn commented Oct 30, 2018 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Slair1 commented Jun 25, 2018 •

edited

Loading

csquire commented Oct 8, 2018 •

edited

Loading

csquire commented Oct 8, 2018 •

edited

Loading