-
Notifications
You must be signed in to change notification settings - Fork 117
Unit Tests for KubernetesClusterSchedulerBackend #459
Unit Tests for KubernetesClusterSchedulerBackend #459
Conversation
|
Thank you for this 👍 |
|
@varunkatta I changed some things in the scheduler backend related to #244. Some of them are style things and variable renames, but there was a case I identified where we didn't call |
|
Integration test failure is legit from #454. |
9f9b432 to
65496d2
Compare
91d5415 to
d7453c4
Compare
|
This is ready for review, but is not strictly complete. I'm certain that some corner cases are missing in the scheduler backend tests. As well, we don't yet unit test the components that were factored out in #454 and #452. I'll probably revisit these but I would like to keep these PRs small so that we can incrementally follow the cases that we've been covering in these tests. |
|
|
||
| import KubernetesClusterSchedulerBackend._ | ||
|
|
||
| private val EXECUTOR_ID_COUNTER = new AtomicLong(0L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this here so there's less static state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - it's much more difficult to unit test if the counter is global, because between different tests one needs to know what the counter is set to.
| } | ||
| } | ||
|
|
||
| def deleteExecutorFromApiAndDataStructures(executorId: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleteExecutorFromClusterAndDataStructures
| val exitReason = ExecutorExited(getExecutorExitStatus(pod), exitCausedByApp = false, | ||
| "Pod " + pod.getMetadata.getName + " deleted or lost.") | ||
| failedPods.put(pod.getMetadata.getName, exitReason) | ||
| val alreadyReleased = isPodAlreadyReleased(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline alreadyReleased
| executorsToRemove.add(executorId) | ||
| RUNNING_EXECUTOR_PODS_LOCK.synchronized { | ||
| runningExecutorsToPods.get(executorId).foreach { pod => | ||
| disconnectedPodsByExecutorIdPendingRemoval(executorId) = pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain a bit why it's safe to remove the executor directly here, rather than going through the executorsToRemove set first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't remove the executor directly here. Some of the logic has changed and variables are renamed, so executorsToRemove doesn't exactly exist anymore anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are marking the executor as disconnected and the allocator thread will clean it up once the exit reason is known. I believe this is the same semantics as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. No change in semantics just that the marking is done after verifying that the executor is still running/tracked.
|
Assigning @varunkatta and @foxish to verify correctness of the nuanced but important logic changes done here. |
| val exitReason = ExecutorExited(getExecutorExitStatus(pod), exitCausedByApp = false, | ||
| "Pod " + pod.getMetadata.getName + " deleted or lost.") | ||
| failedPods.put(pod.getMetadata.getName, exitReason) | ||
| val exitMessage = if (isPodAlreadyReleased(pod)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change..Makes it consistent with the case of ErroredPod handling
| val reasonCheckCount = executorReasonCheckAttemptCounts.getOrElse(executorId, 0) | ||
| if (reasonCheckCount >= MAX_EXECUTOR_LOST_REASON_CHECKS) { | ||
| removeExecutor(executorId, SlaveLost("Executor lost for unknown reasons.")) | ||
| deleteExecutorFromClusterAndDataStructures(executorId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the change. Thanks for catching the omission here.
| @@ -0,0 +1,383 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks for this. This unit test one of most important additions to keep the cluster backend healthy, correct and maintainable.
varunkatta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. Makes the code more readable. Most importantly this change fixes a subtle bug, and adds unit-tests which is a huge win! Thanks Matt!
a55b28e to
e7a460e
Compare
e1d008c to
bc48dd2
Compare
458082f to
2dcaa52
Compare
|
Rebased |
|
rerun unit tests please |
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor nit that might not be worth a change, but otherwise really happy to see tests coming in to this class!
| val allocatorExecutor = ThreadUtils | ||
| .newDaemonSingleThreadScheduledExecutor("kubernetes-pod-allocator") | ||
| val requestExecutorsService = ThreadUtils.newDaemonCachedThreadPool( | ||
| "kubernetes-request-executors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to change this name from kubernetes-executor-requests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly.
|
K reverted it -- will merge when builds are green! |
|
It's merged -- was there more you thought should be done here? |
|
I've seen some problems myself with executor recovery, but haven't dug into why yet. Please do open new issues with any observations you see of bad behavior! |
…#459) * Start unit tests for the scheduler backend. * More tests for the scheduler backend. * Unit tests and possible preemptive corrections to failover logic. * Address PR comments. * Resolve merge conflicts. Move MiB change to ExecutorPodFactory. * Revert accidental thread pool name change
…#459) * Start unit tests for the scheduler backend. * More tests for the scheduler backend. * Unit tests and possible preemptive corrections to failover logic. * Address PR comments. * Resolve merge conflicts. Move MiB change to ExecutorPodFactory. * Revert accidental thread pool name change
Requires #454, which in turn requires #452.