-
Notifications
You must be signed in to change notification settings - Fork 1.2k
CLOUDSTACK-10310 Fix KVM reboot on storage issue #2722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLOUDSTACK-10310 Fix KVM reboot on storage issue #2722
Conversation
|
@blueorangutan package |
|
@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2154 |
|
@blueorangutan test |
|
@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-2827)
|
|
@Slair1 why not master instead of 4.11? |
|
@borisstoyanov I didn't know if there was going to be another 4.11.x release or not. If we prefer, i can rebase to master? |
|
@Slair1 this is a bug fix. It can go into 4.11 and if we merge we also merge forward to master |
borisstoyanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
rohityadavcloud
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm only cautious about the change from restart/reboot to stopping of the agent - this may cause a potential regression or unknown side effects.
|
This PR doesn't seem to completely fix the problem (or maybe this is a completely new problem). We installed the RC release with this PR on a test system and are able to get the KVM host to be marked as I pulled a thread dump and found the KVMHAMonitor thread will hang here until NFS is unblocked, didn't dig any deeper yet though. |
|
Good find @csquire, @Slair1 any comments? /cc @PaulAngus @borisstoyanov @DaanHoogland |
|
Sorry, I misspoke in my last comment (edited to make it correct). The blocked host doesn't reboot, it just gets marked as |
|
@csquire The VMs should eventually crash themselves since they won't have access to their disks. In our environment when a host gets marked as down we get alerted and an engineer can take a look at what is going on. Yes, we rely on VM HA to decide what to do with the problem host also. We make sure lockd is enabled so we don't get a split-brain situation. |
|
@Slair1 Testing with virtlockd, if the host that crashed can't come back I had to manually clear the locks for VM HA to fire and restart the VM. At that point it's the same as disabling VM HA and manually restart VM's after a KVM host crash. How is your setup different ? |
|
I should n't have taken word on this, I tested and could not reproduce the behaviour @csquire reported. I think we should revert the change and let the KVM host reset. VM disk corruption is worse than VM downtime. |
On actual testing, I could see that kvmheartbeat.sh script fails on NFS server failure and stops the agent only. Any HA VMs could be launched in different hosts, and recovery of NFS server could lead to a state where a HA enabled VM runs on two hosts and can potentially cause disk corruptions. In most cases, VM disk corruption will be worse than VM downtime. I've kept the sleep interval between check/rounds but reduced it to 10s. The change in behaviour was introduced in apache#2722. Signed-off-by: Rohit Yadav <[email protected]>
|
I'll try to summarise the scenario so that we're all trying to fix the same thing...
|
|
IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be powered-off and restarted somewhere else, and if it isn't HA enabled, we shouldn't do anything with the running VM (as it's for the user of the VM to deal with it), |
On actual testing, I could see that kvmheartbeat.sh script fails on NFS server failure and stops the agent only. Any HA VMs could be launched in different hosts, and recovery of NFS server could lead to a state where a HA enabled VM runs on two hosts and can potentially cause disk corruptions. In most cases, VM disk corruption will be worse than VM downtime. I've kept the sleep interval between check/rounds but reduced it to 10s. The change in behaviour was introduced in #2722. Signed-off-by: Rohit Yadav <[email protected]>
|
On precision about #2.... With primary storage on NFS hard mounts VMs don't
go read-only (tested with OL5/6/7) and will resume writing to disk once NFS
server become available again even after a 25 minutes outage
…On Tue, Oct 30, 2018 at 5:27 AM Paul Angus ***@***.***> wrote:
I'll try to summarise the scenario so that we're all trying to fix the
same thing...
1. A host cannot write to one of it's primary storage pools.
2. The some of the VMs on that host are on that pool, so their disks
have gone read-only, but the VM is still running.
3. BUT there may be VMs on other primary storage pools that are
absolutely fine
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2722 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOflqMo1V8KeDv4PmOM4aahyCozBJMdcks5uqBttgaJpZM4U2Vve>
.
|
|
On problem is while NFS is unavailable, you wont be able to destroy the
VM.... libvirt will just hang. So if you attempt to destroy the and start
a new VM, when the NFS service comes back online you will get the
duplicate VM. That's why I would rather just wait for the NFS issue to go
away rather than fire VM-HA in that case.
…On Tue, Oct 30, 2018 at 5:32 AM Paul Angus ***@***.***> wrote:
IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be
powered-off and restarted somewhere else, and if it isn't HA enabled, we
shouldn't do anything with the running VM (as it's for the user of the VM
to deal with it),
in either case we should probably set the host to 'alert' so then an admin
can see it and do something about it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2722 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOflqEYelB48SSZ6QeAJYCdjQsQoE014ks5uqBy6gaJpZM4U2Vve>
.
|
Correção do _dashboard_ inicial na visão de projetos Closes apache#2722 See merge request scclouds/scclouds!1192
Rebase of #2472 onto 4.11
If the KVM heartbeat file can't be written to, the host is rebooted, and thus taking down all VMs running on it. The code does try 5x times before the reboot, but the there is not a delay between the retires, so they are 5 simultaneous retries, which doesn't help. Standard SAN storage HA operations or quick network blip could cause this reboot to occur.
Some discussions on the dev mailing list revealed that some people are just commenting out the reboot line in their version of the CloudStack source.
A better option would be have it sleep between tries so it isn't 5x almost simultaneous tries. Plus, instead of rebooting, the cloudstack-agent could just be stopped on the host instead. This will cause alerts to be issued and if the host is disconnected long-enough, depending on the HA code in use, VM HA could handle the host failure.
The built-in reboot of the host seemed drastic