[SPARK-14930][SPARK-13693] Fix race condition in CheckpointWriter.stop() #12712
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CheckpointWriter.stop() is prone to a race condition: if one thread calls
stop()right as a checkpoint write task begins to execute, that write task may become blocked when trying to accessfs, the shared Hadoop FileSystem, since both thefsgetter andstopmethod synchronize on the same lock. Here's a thread-dump excerpt which illustrates the problem:We can fix this problem by having
stopandfsbe synchronized on different locks: the synchronization onstoponly needs to guard against multiple threads callingstopat the same time, whereas the synchronization onfsis only necessary for cross-thread visibility. There's only ever a single active checkpoint writer thread at a time, so we don't need to guard against concurrent access tofs. Thus,fscan simply become a@volatilevar, similar tolastCheckpointTime.This change should fix SPARK-13693, a flaky
MapWithStateSuitetest suite which has recently been failing several times per day. It also results in a huge test speedup: prior to this patch,MapWithStateSuitetook about 80 seconds to run, whereas it now runs in less than 10 seconds. For thestreamingproject's tests as a whole, they now run in ~220 seconds vs. ~354 before./cc @zsxwing and @tdas for review.