-
-
Notifications
You must be signed in to change notification settings - Fork 83
Fix for Finalizers not running for Pods already terminating on start #972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix for Finalizers not running for Pods already terminating on start #972
Conversation
This is because when the Operator starts, all objects are Adds, including those already in the Terminating state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be properly externalized into a method instead of copy paste?
|
Thanks for the collaboration @DaneRushtonVista, when you say "copied over", can it be properly engineered to reduce clutter and maintain a high code quality? |
|
Thanks for feedback. Have moved the duplicated Finalizer handling up to remove duplicate code. Retained the original Add vs Modified handling when not terminating as there are sufficient differences required. |
|
@buehler @DaneRushtonVista my understanding of Kubernetes api events is, that the added event should be skipped instead of running the finalizer, because the deleted event will be fetched by the watcher after that. the operator is fetching all occurred events since a point in time either defined by resource version (if known) or without (receiving the latest batch hold by the Kubernetes api service). so the event chain of added and deleted will be received. if running the finalizer from with the added event seems not appropriate in my opinion. the upcoming deleted event will then be skipped while not finding the entity. |
|
Hey @kimpenhaus |
|
About api events my understanding (and maybe i slightly misunderstood what kimpenhause said) is that: Events are not send about happenings in the past. So if a resource with a finalizer gets deleted when the operator is not running k8s will set a deletion timestamp. |
|
@ralf-cestusio no native speaker here :) but reading the Kubernetes docs I understand that there is a historical list of events (which is - admitted - not long 😄)
at least in this short period it should be fine to skip this event. while we would receive the delete event after that. but to be honest then I don't understand the main problem - how could those crd's, which we receive an added event for which contains a deletion timestamp, have a finalizer attached which points to this operator when it was not added by the operator? is that added by convention by third party? why is the assumption to run the finalizer and not to skip? |
|
@kimpenhaus Yes i think we have the same not native speaker problem. In my work i deal a lot with the worst case edge case. So this is the problem i try to solve not the happy path. So if we manage to fall into the short list of historical events all would be good because we would fall into the happy path (but in my understanding that only really happens on resuming a watch after a short outage - a new watch always starts at the latest resource version) The problem is that operators can be down for extended periods (they should not, but they could) so the goal is to not leave the system in a unrecoverable broken state. |
|
@ralf-cestusio does this actually mean if an operator connects to the api server by watching events - giving no resource version to start from - it will receive an add event for all existing entities? that would be the only reasonable possibility in my opinion why you would receive an add event with deletion timestamp while you already added the entity long time ago. it's a pity that the docs are not clear on that - or at least I haven't found it. 😄 if it behaves like that my concerns were wrong and I agree it is correct to call the finalizer. shouldn't I be able to see the events when shutting down the operator - wait for at least 5 minutes - and start it again? |
|
@kimpenhaus i agree that this is horrendously documented. (and also depends on how the watcher is implemented). But my experience with this operator shows that i get added events for all existing crd's of that kind. But maybe someone else has tested this one properly. (i just know that when i start the operator in debug i get bombarded by added events :) e.g example log |
|
The scenario I have, which resulting in me writing this PR is:
|
Have observed in my test cluster (where we shutdown workloads overnight), that if the Operator shuts down first, the Terminating pods with the finalizer are left hanging even once the Operator restarts the following morning.
This is because the first time the Operator sees a Pod, it treats it as an Add, and doesn't check if it was already in the Terminating state.
Have copied what was being done on the Modified event type so that the Finalizers get run.