kvm: Aqcuire lock when running security group Python script #1408

wido · 2016-02-09T20:26:20Z

It could happen that when multiple instances are starting at the same
time on a KVM host the Agent spawns multiple instances of security_group.py
which both try to modify iptables/ebtables rules.

This fails with on of the two processes failing.

The instance is still started, but it doesn't have any IP connectivity due
to the failed programming of the security groups.

This modification lets the script aqcuire a exclusive lock on a file so that
only one instance of the scripts talks to iptables/ebtables at once.

Other instances of the script which start will poll every 500ms if they can
obtain the lock and otherwise execute anyway after 15 seconds.

jlk · 2016-02-09T21:25:12Z

Is 10 seconds enough time to wait for the lock to release?

Also, might want to have the loop print status that it's waiting for lock to release...

wido · 2016-02-10T08:03:50Z

@jlk What I've seen is that it all happens in a 2 to 3 second window. Usually the security group script doesn't run for a very long time, so that made me think 10 seconds was enough.

For safety reasons I'll bump it to 15 seconds and wait 500ms between each operation and log while waiting for the lock to be released.

wido · 2016-02-10T09:08:08Z

I updated the code and tested it. We came across this mainly when Terraform was being used.

Just checked the logs of two hypervisors:

2016-02-10 10:00:33,519 - Lock on /var/lock/cloudstack_security_group.lock is being held by other process. Waiting for release.

2016-02-10 09:58:39,187 - Lock on /var/lock/cloudstack_security_group.lock is being held by other process. Waiting for release.

They only had to wait one cycle of 500ms before the lock was released.

jlk · 2016-02-10T15:30:44Z

Cool - LGTM!

wido · 2016-02-10T15:57:26Z

To update, we ran more tests on our production platform spread out over 60 hypervisors we saw these lock message pop up in various occasions.

Internally we use Terraform a lot and that seems to hammer on the API

rohityadavcloud · 2016-02-11T00:55:07Z

scripts/vm/network/security_group.py

+    global lock_handle
+    lock_handle = open(path, 'w')
+    try:
+        fcntl.lockf(lock_handle, fcntl.LOCK_EX | fcntl.LOCK_NB)


Please rename the method to reflect what it's doing; file_is_locked sounds like we're checking if a lock exists, whereas it's actually trying to create a lock.

Any change you can use a with pattern, such as -- with (lock): do-stuff instead of if do this?
And how about we have a self-expiring locks, like here: http://stackoverflow.com/questions/5255220/fcntl-flock-how-to-implement-a-timeout ?

wido · 2016-04-26T12:57:39Z

Can you take another look @rhtyd ?

It works on our systems in production

rohityadavcloud · 2016-04-26T19:12:10Z

scripts/vm/network/security_group.py

    cmd = args[0]
    logging.debug("Executing command: " + str(cmd))
+
+    for i in range(0, 30):


this can be refactored separately as an acquire lock method that may optionally accept a timeout

rohityadavcloud · 2016-04-26T19:13:57Z

@wido in general alright, I've left few comments that can help us improve this
For master, looks like we'll need to build/publish a new 4.9+ a systemvm template -- in that case would you prefer to use a library included that we can use here like filelock or something else? (https://pypi.python.org/pypi/filelock/)

wido · 2016-04-26T20:01:28Z

@rhtyd Changed it to a with statement for Pythonic.

I'd rather not use a external lib since with Basic Networking and KVM this runs on the hypervisor.

rohityadavcloud · 2016-04-27T08:51:01Z

@wido thanks, can you comment on how an acquired lock is released or it's not critical to release the lock at all?

wido · 2016-04-27T16:24:50Z

@rhtyd The lock does not have to be released. It is done when the script exists. This is usually within 100ms or so.

rohityadavcloud · 2016-04-27T20:26:02Z

@wido I tested your original PR and the current updated PR, the locking never happened on my test env (Ubuntu x64).

I read the manpage, and found that when you open a file twice flock will treat both FDs independently while fcntl (the lockf usage) treats them as same FDs. This means that when two processes could run the python script at the same time, no locking would actually occur. We don't get the advisory exclusive lock we want.

I wrote a test program to demonstrate this: (for testing you may replace flock with lockf and see for yourself that it does not lock at all)
https://gist.github.com/rhtyd/5d9983e7d58311ba8e517eefd931a507

So, my suggestion is to:

use flock instead of lockf, assuming and likely the file exists on local filesystem we can use flock with no issues
use the previously used open instead of with open, as with open syntax closes file when the statements under it return
at the end call the lock_file.close() optionally, otherwise on exiting file should be GC-ed and closed

wido · 2016-04-28T07:51:12Z

@rhtyd Hmm, that's odd. My logs showed that the lock prevented the script from running twice and our problems were also solved with this.

Before we had race conditions on iptables programming, that is resolved with this patched version.

It could happen that when multiple instances are starting at the same time on a KVM host the Agent spawns multiple instances of security_group.py which both try to modify iptables/ebtables rules. This fails with on of the two processes failing. The instance is still started, but it doesn't have any IP connectivity due to the failed programming of the security groups. This modification lets the script aqcuire a exclusive lock on a file so that only one instance of the scripts talks to iptables/ebtables at once. Other instances of the script which start will poll every 500ms if they can obtain the lock and otherwise execute anyway after 15 seconds. The lock will be released as soon as the script exists, which is usually within a few hundred ms.

wido · 2016-04-28T08:16:55Z

@rhtyd: I tested and you are partially right. My test script: https://gist.github.com/wido/a236e9f065cef85b0497990e3d4798e6

When using the 'with open() as X' syntax it doesn't work. My Gist however does if I run it twice:

In terminal #1:

wido@wido-desktop:~/Desktop$ ./lock-test.py 
Obtained lock

In terminal #2:

wido@wido-desktop:~/Desktop$ ./lock-test.py 
Failed to obtain lock!
wido@wido-desktop:~/Desktop$

I updated the PR again with this change.

rohityadavcloud · 2016-04-28T08:37:44Z

@wido +1

rohityadavcloud · 2016-04-28T08:39:08Z

reviewed updated PR, LGTM
cc @swill

swill · 2016-05-02T17:03:51Z

@wido is this code currently deployed and working in your prod environment? @rhtyd any suggestions for validating this code before merge? Have you tested this?

rohityadavcloud · 2016-05-02T17:33:41Z

@swill I've validated the lock logic in a gist above, I've tested the elusiveness of critical code that runs when a lock is acquired; security groups is implemented using ebtables (rules on bridges) therefore this script is run at hypervisor host (kvm in this case, by the agent) -- the changes ensure that no two executions of this script are allowed at the same time

as for prod testing, @wido can you comment on testing the changes in this PR, thanks

wido · 2016-05-02T18:24:40Z

@swill and @rhtyd: The code in the PR is currently in production at PCextreme and running fine.

With a 'with' statement in Python I couldn't get a proper lock. But that's not really a functional change, it's merely a more Python-way of writing code.

swill · 2016-05-02T18:27:35Z

I am pretty comfortable with this because @wido is already using this. I don't have an SG environment setup right now for testing, so I would have to build one specifically to test this. Do you guys agree this is ready to merge?

rohityadavcloud · 2016-05-03T02:47:42Z

@swill no objection, let's merge this

tag:mergeready

kvm: Aqcuire lock when running security group Python scriptIt could happen that when multiple instances are starting at the same time on a KVM host the Agent spawns multiple instances of security_group.py which both try to modify iptables/ebtables rules. This fails with on of the two processes failing. The instance is still started, but it doesn't have any IP connectivity due to the failed programming of the security groups. This modification lets the script aqcuire a exclusive lock on a file so that only one instance of the scripts talks to iptables/ebtables at once. Other instances of the script which start will poll every 500ms if they can obtain the lock and otherwise execute anyway after 15 seconds. * pr/1408: kvm: Aqcuire lock when running security group Python script Signed-off-by: Will Stevens <[email protected]>

wido force-pushed the security-group-lock branch from 7e93f33 to e0656c1 Compare February 10, 2016 08:08

rohityadavcloud reviewed Feb 11, 2016
View reviewed changes

wido force-pushed the security-group-lock branch 2 times, most recently from f6629f8 to d0df0c3 Compare April 21, 2016 07:41

rohityadavcloud reviewed Apr 26, 2016
View reviewed changes

wido force-pushed the security-group-lock branch from d0df0c3 to b94f636 Compare April 26, 2016 19:59

wido force-pushed the security-group-lock branch from b94f636 to 26becef Compare April 28, 2016 08:16

asfgit merged commit 26becef into apache:master May 4, 2016

kvm: Aqcuire lock when running security group Python script #1408

kvm: Aqcuire lock when running security group Python script #1408

Uh oh!

Conversation

wido commented Feb 9, 2016

Uh oh!

jlk commented Feb 9, 2016

Uh oh!

wido commented Feb 10, 2016

Uh oh!

wido commented Feb 10, 2016

Uh oh!

jlk commented Feb 10, 2016

Uh oh!

wido commented Feb 10, 2016

Uh oh!

rohityadavcloud Feb 11, 2016

Choose a reason for hiding this comment

Uh oh!

wido commented Apr 26, 2016

Uh oh!

rohityadavcloud Apr 26, 2016

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud commented Apr 26, 2016

Uh oh!

wido commented Apr 26, 2016

Uh oh!

rohityadavcloud commented Apr 27, 2016

Uh oh!

wido commented Apr 27, 2016

Uh oh!

rohityadavcloud commented Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wido commented Apr 28, 2016

Uh oh!

wido commented Apr 28, 2016

Uh oh!

rohityadavcloud commented Apr 28, 2016

Uh oh!

rohityadavcloud commented Apr 28, 2016

Uh oh!

swill commented May 2, 2016

Uh oh!

rohityadavcloud commented May 2, 2016

Uh oh!

wido commented May 2, 2016

Uh oh!

swill commented May 2, 2016

Uh oh!

rohityadavcloud commented May 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohityadavcloud commented Apr 27, 2016 •

edited

Loading