Skip to content

Conversation

@Slair1
Copy link
Contributor

@Slair1 Slair1 commented Jun 25, 2018

Rebase of #2472 onto 4.11

If the KVM heartbeat file can't be written to, the host is rebooted, and thus taking down all VMs running on it. The code does try 5x times before the reboot, but the there is not a delay between the retires, so they are 5 simultaneous retries, which doesn't help. Standard SAN storage HA operations or quick network blip could cause this reboot to occur.

Some discussions on the dev mailing list revealed that some people are just commenting out the reboot line in their version of the CloudStack source.

A better option would be have it sleep between tries so it isn't 5x almost simultaneous tries. Plus, instead of rebooting, the cloudstack-agent could just be stopped on the host instead. This will cause alerts to be issued and if the host is disconnected long-enough, depending on the HA code in use, VM HA could handle the host failure.

The built-in reboot of the host seemed drastic

@Slair1
Copy link
Contributor Author

Slair1 commented Jun 25, 2018

Looks like there is an additional Issue open #2657 and related PR #2658

@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2154

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2827)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30394 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2722-t2827-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 63 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestDeployVirtioSCSIVM>:setup Error 0.00 test_deploy_virtio_scsi_vm.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 1116.27 test_privategw_acl.py
test_05_rvpc_multi_tiers Failure 319.98 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 341.79 test_vpc_redundant.py
test_hostha_enable_ha_when_host_in_maintenance Error 2.44 test_hostha_kvm.py

@borisstoyanov
Copy link
Contributor

@Slair1 why not master instead of 4.11?
4.11.1 just went out.

@Slair1
Copy link
Contributor Author

Slair1 commented Jun 27, 2018

@borisstoyanov I didn't know if there was going to be another 4.11.x release or not. If we prefer, i can rebase to master?

@DaanHoogland
Copy link
Contributor

@Slair1 this is a bug fix. It can go into 4.11 and if we merge we also merge forward to master

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm only cautious about the change from restart/reboot to stopping of the agent - this may cause a potential regression or unknown side effects.

@DaanHoogland DaanHoogland merged commit 023dcec into apache:4.11 Aug 20, 2018
@csquire
Copy link
Contributor

csquire commented Oct 8, 2018

This PR doesn't seem to completely fix the problem (or maybe this is a completely new problem). We installed the RC release with this PR on a test system and are able to get the KVM host to be marked as Down by using iptables to drop outgoing requests to NFS. My investigation shows that the line storage = conn.storagePoolLookupByUUIDString(uuid); blocks indefinitely. So, kvmheartbeat.sh is never executed, a host investigation is started, the host with blocked NFS is marked as Down and finally all VMs on that host are rescheduled and result in duplicate VMs.

I pulled a thread dump and found the KVMHAMonitor thread will hang here until NFS is unblocked, didn't dig any deeper yet though.

   java.lang.Thread.State: RUNNABLE
        at com.sun.jna.Native.invokePointer(Native Method)
        at com.sun.jna.Function.invokePointer(Function.java:470)
        at com.sun.jna.Function.invoke(Function.java:404)
        at com.sun.jna.Function.invoke(Function.java:315)
        at com.sun.jna.Library$Handler.invoke(Library.java:212)
        at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown Source)
        at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
        at com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
        - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None```

@rohityadavcloud
Copy link
Member

Good find @csquire, @Slair1 any comments? /cc @PaulAngus @borisstoyanov @DaanHoogland

@csquire
Copy link
Contributor

csquire commented Oct 8, 2018

Sorry, I misspoke in my last comment (edited to make it correct). The blocked host doesn't reboot, it just gets marked as Down and the VMs are actually still running on it when duplicate VMs get provisioned. Maybe it's a completely separate issue, but will still prevent us from using 4.11 in production. EDIT: Actually, looks like it may have been present before 4.11.

@Slair1
Copy link
Contributor Author

Slair1 commented Oct 8, 2018

@csquire The VMs should eventually crash themselves since they won't have access to their disks. In our environment when a host gets marked as down we get alerted and an engineer can take a look at what is going on. Yes, we rely on VM HA to decide what to do with the problem host also. We make sure lockd is enabled so we don't get a split-brain situation.

https://libvirt.org/locking-lockd.html

@csquire
Copy link
Contributor

csquire commented Oct 8, 2018

@Slair1 thanks for the feedback, I will run more tests. The behavior is still incorrect, so I went ahead and submitted issue #2890

@somejfn
Copy link

somejfn commented Oct 17, 2018

@Slair1 Testing with virtlockd, if the host that crashed can't come back I had to manually clear the locks for VM HA to fire and restart the VM. At that point it's the same as disabling VM HA and manually restart VM's after a KVM host crash. How is your setup different ?

@Slair1
Copy link
Contributor Author

Slair1 commented Oct 17, 2018

@somejfn maybe because I have PR #2474 also installed?

@rohityadavcloud
Copy link
Member

I should n't have taken word on this, I tested and could not reproduce the behaviour @csquire reported. I think we should revert the change and let the KVM host reset. VM disk corruption is worse than VM downtime.

rohityadavcloud added a commit to shapeblue/cloudstack that referenced this pull request Oct 30, 2018
On actual testing, I could see that kvmheartbeat.sh script fails on NFS
server failure and stops the agent only. Any HA VMs could be launched
in different hosts, and recovery of NFS server could lead to a state
where a HA enabled VM runs on two hosts and can potentially cause
disk corruptions. In most cases, VM disk corruption will be worse than
VM downtime. I've kept the sleep interval between check/rounds but
reduced it to 10s. The change in behaviour was introduced in apache#2722.

Signed-off-by: Rohit Yadav <[email protected]>
@PaulAngus
Copy link
Member

I'll try to summarise the scenario so that we're all trying to fix the same thing...

  1. A host cannot write to one of it's primary storage pools.
  2. The some of the VMs on that host are on that pool, so their disks have gone read-only, but the VM is still running.
  3. BUT there may be VMs on other primary storage pools that are absolutely fine

@PaulAngus
Copy link
Member

IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be powered-off and restarted somewhere else, and if it isn't HA enabled, we shouldn't do anything with the running VM (as it's for the user of the VM to deal with it),
in either case we should probably set the host to 'alert' so then an admin can see it and do something about it.

rohityadavcloud added a commit that referenced this pull request Oct 30, 2018
On actual testing, I could see that kvmheartbeat.sh script fails on NFS
server failure and stops the agent only. Any HA VMs could be launched
in different hosts, and recovery of NFS server could lead to a state
where a HA enabled VM runs on two hosts and can potentially cause
disk corruptions. In most cases, VM disk corruption will be worse than
VM downtime. I've kept the sleep interval between check/rounds but
reduced it to 10s. The change in behaviour was introduced in #2722.

Signed-off-by: Rohit Yadav <[email protected]>
@somejfn
Copy link

somejfn commented Oct 30, 2018 via email

@somejfn
Copy link

somejfn commented Oct 30, 2018 via email

bernardodemarco pushed a commit to scclouds/cloudstack that referenced this pull request Jul 16, 2025
Correção do _dashboard_ inicial na visão de projetos

Closes apache#2722

See merge request scclouds/scclouds!1192
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants