-
Notifications
You must be signed in to change notification settings - Fork 1.2k
kvm: Aqcuire lock when running security group Python script #1408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Is 10 seconds enough time to wait for the lock to release? Also, might want to have the loop print status that it's waiting for lock to release... |
|
@jlk What I've seen is that it all happens in a 2 to 3 second window. Usually the security group script doesn't run for a very long time, so that made me think 10 seconds was enough. For safety reasons I'll bump it to 15 seconds and wait 500ms between each operation and log while waiting for the lock to be released. |
7e93f33 to
e0656c1
Compare
|
I updated the code and tested it. We came across this mainly when Terraform was being used. Just checked the logs of two hypervisors: 2016-02-10 10:00:33,519 - Lock on /var/lock/cloudstack_security_group.lock is being held by other process. Waiting for release. 2016-02-10 09:58:39,187 - Lock on /var/lock/cloudstack_security_group.lock is being held by other process. Waiting for release. They only had to wait one cycle of 500ms before the lock was released. |
|
Cool - LGTM! |
|
To update, we ran more tests on our production platform spread out over 60 hypervisors we saw these lock message pop up in various occasions. Internally we use Terraform a lot and that seems to hammer on the API |
scripts/vm/network/security_group.py
Outdated
| global lock_handle | ||
| lock_handle = open(path, 'w') | ||
| try: | ||
| fcntl.lockf(lock_handle, fcntl.LOCK_EX | fcntl.LOCK_NB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the method to reflect what it's doing; file_is_locked sounds like we're checking if a lock exists, whereas it's actually trying to create a lock.
Any change you can use a with pattern, such as -- with (lock): do-stuff instead of if do this?
And how about we have a self-expiring locks, like here: http://stackoverflow.com/questions/5255220/fcntl-flock-how-to-implement-a-timeout ?
f6629f8 to
d0df0c3
Compare
|
Can you take another look @rhtyd ? It works on our systems in production |
| cmd = args[0] | ||
| logging.debug("Executing command: " + str(cmd)) | ||
|
|
||
| for i in range(0, 30): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be refactored separately as an acquire lock method that may optionally accept a timeout
|
@wido in general alright, I've left few comments that can help us improve this |
d0df0c3 to
b94f636
Compare
|
@rhtyd Changed it to a with statement for Pythonic. I'd rather not use a external lib since with Basic Networking and KVM this runs on the hypervisor. |
|
@wido thanks, can you comment on how an acquired lock is released or it's not critical to release the lock at all? |
|
@rhtyd The lock does not have to be released. It is done when the script exists. This is usually within 100ms or so. |
|
@wido I tested your original PR and the current updated PR, the locking never happened on my test env (Ubuntu x64). I read the manpage, and found that when you open a file twice I wrote a test program to demonstrate this: (for testing you may replace flock with lockf and see for yourself that it does not lock at all) So, my suggestion is to:
|
|
@rhtyd Hmm, that's odd. My logs showed that the lock prevented the script from running twice and our problems were also solved with this. Before we had race conditions on iptables programming, that is resolved with this patched version. |
It could happen that when multiple instances are starting at the same time on a KVM host the Agent spawns multiple instances of security_group.py which both try to modify iptables/ebtables rules. This fails with on of the two processes failing. The instance is still started, but it doesn't have any IP connectivity due to the failed programming of the security groups. This modification lets the script aqcuire a exclusive lock on a file so that only one instance of the scripts talks to iptables/ebtables at once. Other instances of the script which start will poll every 500ms if they can obtain the lock and otherwise execute anyway after 15 seconds. The lock will be released as soon as the script exists, which is usually within a few hundred ms.
b94f636 to
26becef
Compare
|
@rhtyd: I tested and you are partially right. My test script: https://gist.github.com/wido/a236e9f065cef85b0497990e3d4798e6 When using the 'with open() as X' syntax it doesn't work. My Gist however does if I run it twice: In terminal #1: wido@wido-desktop:~/Desktop$ ./lock-test.py Obtained lock In terminal #2: wido@wido-desktop:~/Desktop$ ./lock-test.py Failed to obtain lock! wido@wido-desktop:~/Desktop$ I updated the PR again with this change. |
|
@wido +1 |
|
reviewed updated PR, LGTM |
|
@wido is this code currently deployed and working in your prod environment? @rhtyd any suggestions for validating this code before merge? Have you tested this? |
|
@swill I've validated the lock logic in a gist above, I've tested the elusiveness of critical code that runs when a lock is acquired; security groups is implemented using ebtables (rules on bridges) therefore this script is run at hypervisor host (kvm in this case, by the agent) -- the changes ensure that no two executions of this script are allowed at the same time as for prod testing, @wido can you comment on testing the changes in this PR, thanks |
|
@swill and @rhtyd: The code in the PR is currently in production at PCextreme and running fine. With a 'with' statement in Python I couldn't get a proper lock. But that's not really a functional change, it's merely a more Python-way of writing code. |
|
I am pretty comfortable with this because @wido is already using this. I don't have an SG environment setup right now for testing, so I would have to build one specifically to test this. Do you guys agree this is ready to merge? |
|
@swill no objection, let's merge this tag:mergeready |
kvm: Aqcuire lock when running security group Python scriptIt could happen that when multiple instances are starting at the same time on a KVM host the Agent spawns multiple instances of security_group.py which both try to modify iptables/ebtables rules. This fails with on of the two processes failing. The instance is still started, but it doesn't have any IP connectivity due to the failed programming of the security groups. This modification lets the script aqcuire a exclusive lock on a file so that only one instance of the scripts talks to iptables/ebtables at once. Other instances of the script which start will poll every 500ms if they can obtain the lock and otherwise execute anyway after 15 seconds. * pr/1408: kvm: Aqcuire lock when running security group Python script Signed-off-by: Will Stevens <[email protected]>
It could happen that when multiple instances are starting at the same
time on a KVM host the Agent spawns multiple instances of security_group.py
which both try to modify iptables/ebtables rules.
This fails with on of the two processes failing.
The instance is still started, but it doesn't have any IP connectivity due
to the failed programming of the security groups.
This modification lets the script aqcuire a exclusive lock on a file so that
only one instance of the scripts talks to iptables/ebtables at once.
Other instances of the script which start will poll every 500ms if they can
obtain the lock and otherwise execute anyway after 15 seconds.