Skip to content

Conversation

@wido
Copy link
Contributor

@wido wido commented Feb 9, 2016

It could happen that when multiple instances are starting at the same
time on a KVM host the Agent spawns multiple instances of security_group.py
which both try to modify iptables/ebtables rules.

This fails with on of the two processes failing.

The instance is still started, but it doesn't have any IP connectivity due
to the failed programming of the security groups.

This modification lets the script aqcuire a exclusive lock on a file so that
only one instance of the scripts talks to iptables/ebtables at once.

Other instances of the script which start will poll every 500ms if they can
obtain the lock and otherwise execute anyway after 15 seconds.

@jlk
Copy link

jlk commented Feb 9, 2016

Is 10 seconds enough time to wait for the lock to release?

Also, might want to have the loop print status that it's waiting for lock to release...

@wido
Copy link
Contributor Author

wido commented Feb 10, 2016

@jlk What I've seen is that it all happens in a 2 to 3 second window. Usually the security group script doesn't run for a very long time, so that made me think 10 seconds was enough.

For safety reasons I'll bump it to 15 seconds and wait 500ms between each operation and log while waiting for the lock to be released.

@wido wido force-pushed the security-group-lock branch from 7e93f33 to e0656c1 Compare February 10, 2016 08:08
@wido
Copy link
Contributor Author

wido commented Feb 10, 2016

I updated the code and tested it. We came across this mainly when Terraform was being used.

Just checked the logs of two hypervisors:

2016-02-10 10:00:33,519 - Lock on /var/lock/cloudstack_security_group.lock is being held by other process. Waiting for release.
2016-02-10 09:58:39,187 - Lock on /var/lock/cloudstack_security_group.lock is being held by other process. Waiting for release.

They only had to wait one cycle of 500ms before the lock was released.

@jlk
Copy link

jlk commented Feb 10, 2016

Cool - LGTM!

@wido
Copy link
Contributor Author

wido commented Feb 10, 2016

To update, we ran more tests on our production platform spread out over 60 hypervisors we saw these lock message pop up in various occasions.

Internally we use Terraform a lot and that seems to hammer on the API

global lock_handle
lock_handle = open(path, 'w')
try:
fcntl.lockf(lock_handle, fcntl.LOCK_EX | fcntl.LOCK_NB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the method to reflect what it's doing; file_is_locked sounds like we're checking if a lock exists, whereas it's actually trying to create a lock.

Any change you can use a with pattern, such as -- with (lock): do-stuff instead of if do this?
And how about we have a self-expiring locks, like here: http://stackoverflow.com/questions/5255220/fcntl-flock-how-to-implement-a-timeout ?

@wido wido force-pushed the security-group-lock branch 2 times, most recently from f6629f8 to d0df0c3 Compare April 21, 2016 07:41
@wido
Copy link
Contributor Author

wido commented Apr 26, 2016

Can you take another look @rhtyd ?

It works on our systems in production

cmd = args[0]
logging.debug("Executing command: " + str(cmd))

for i in range(0, 30):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be refactored separately as an acquire lock method that may optionally accept a timeout

@rohityadavcloud
Copy link
Member

@wido in general alright, I've left few comments that can help us improve this
For master, looks like we'll need to build/publish a new 4.9+ a systemvm template -- in that case would you prefer to use a library included that we can use here like filelock or something else? (https://pypi.python.org/pypi/filelock/)

@wido wido force-pushed the security-group-lock branch from d0df0c3 to b94f636 Compare April 26, 2016 19:59
@wido
Copy link
Contributor Author

wido commented Apr 26, 2016

@rhtyd Changed it to a with statement for Pythonic.

I'd rather not use a external lib since with Basic Networking and KVM this runs on the hypervisor.

@rohityadavcloud
Copy link
Member

@wido thanks, can you comment on how an acquired lock is released or it's not critical to release the lock at all?

@wido
Copy link
Contributor Author

wido commented Apr 27, 2016

@rhtyd The lock does not have to be released. It is done when the script exists. This is usually within 100ms or so.

@rohityadavcloud
Copy link
Member

rohityadavcloud commented Apr 27, 2016

@wido I tested your original PR and the current updated PR, the locking never happened on my test env (Ubuntu x64).

I read the manpage, and found that when you open a file twice flock will treat both FDs independently while fcntl (the lockf usage) treats them as same FDs. This means that when two processes could run the python script at the same time, no locking would actually occur. We don't get the advisory exclusive lock we want.

I wrote a test program to demonstrate this: (for testing you may replace flock with lockf and see for yourself that it does not lock at all)
https://gist.github.com/rhtyd/5d9983e7d58311ba8e517eefd931a507

So, my suggestion is to:

  • use flock instead of lockf, assuming and likely the file exists on local filesystem we can use flock with no issues
  • use the previously used open instead of with open, as with open syntax closes file when the statements under it return
  • at the end call the lock_file.close() optionally, otherwise on exiting file should be GC-ed and closed

@wido
Copy link
Contributor Author

wido commented Apr 28, 2016

@rhtyd Hmm, that's odd. My logs showed that the lock prevented the script from running twice and our problems were also solved with this.

Before we had race conditions on iptables programming, that is resolved with this patched version.

It could happen that when multiple instances are starting at the same
time on a KVM host the Agent spawns multiple instances of security_group.py
which both try to modify iptables/ebtables rules.

This fails with on of the two processes failing.

The instance is still started, but it doesn't have any IP connectivity due
to the failed programming of the security groups.

This modification lets the script aqcuire a exclusive lock on a file so that
only one instance of the scripts talks to iptables/ebtables at once.

Other instances of the script which start will poll every 500ms if they can
obtain the lock and otherwise execute anyway after 15 seconds.

The lock will be released as soon as the script exists, which is usually within
a few hundred ms.
@wido wido force-pushed the security-group-lock branch from b94f636 to 26becef Compare April 28, 2016 08:16
@wido
Copy link
Contributor Author

wido commented Apr 28, 2016

@rhtyd: I tested and you are partially right. My test script: https://gist.github.com/wido/a236e9f065cef85b0497990e3d4798e6

When using the 'with open() as X' syntax it doesn't work. My Gist however does if I run it twice:

In terminal #1:

wido@wido-desktop:~/Desktop$ ./lock-test.py 
Obtained lock

In terminal #2:

wido@wido-desktop:~/Desktop$ ./lock-test.py 
Failed to obtain lock!
wido@wido-desktop:~/Desktop$

I updated the PR again with this change.

@rohityadavcloud
Copy link
Member

@wido +1

@rohityadavcloud
Copy link
Member

reviewed updated PR, LGTM
cc @swill

@swill
Copy link
Contributor

swill commented May 2, 2016

@wido is this code currently deployed and working in your prod environment? @rhtyd any suggestions for validating this code before merge? Have you tested this?

@rohityadavcloud
Copy link
Member

@swill I've validated the lock logic in a gist above, I've tested the elusiveness of critical code that runs when a lock is acquired; security groups is implemented using ebtables (rules on bridges) therefore this script is run at hypervisor host (kvm in this case, by the agent) -- the changes ensure that no two executions of this script are allowed at the same time

as for prod testing, @wido can you comment on testing the changes in this PR, thanks

@wido
Copy link
Contributor Author

wido commented May 2, 2016

@swill and @rhtyd: The code in the PR is currently in production at PCextreme and running fine.

With a 'with' statement in Python I couldn't get a proper lock. But that's not really a functional change, it's merely a more Python-way of writing code.

@swill
Copy link
Contributor

swill commented May 2, 2016

I am pretty comfortable with this because @wido is already using this. I don't have an SG environment setup right now for testing, so I would have to build one specifically to test this. Do you guys agree this is ready to merge?

@rohityadavcloud
Copy link
Member

@swill no objection, let's merge this

tag:mergeready

@asfgit asfgit merged commit 26becef into apache:master May 4, 2016
asfgit pushed a commit that referenced this pull request May 4, 2016
kvm: Aqcuire lock when running security group Python scriptIt could happen that when multiple instances are starting at the same
time on a KVM host the Agent spawns multiple instances of security_group.py
which both try to modify iptables/ebtables rules.

This fails with on of the two processes failing.

The instance is still started, but it doesn't have any IP connectivity due
to the failed programming of the security groups.

This modification lets the script aqcuire a exclusive lock on a file so that
only one instance of the scripts talks to iptables/ebtables at once.

Other instances of the script which start will poll every 500ms if they can
obtain the lock and otherwise execute anyway after 15 seconds.

* pr/1408:
  kvm: Aqcuire lock when running security group Python script

Signed-off-by: Will Stevens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants