[SPARK-21942][CORE] Fix DiskBlockManager crashing when a root local folder has been externally deleted #19154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The problem:
DiskBlockManager
has a notion of a "scratch" local folder(s), which can be configured viaspark.local.dir
option, and which defaults to the system's/tmp
. The hierarchy is two-level, e.g./blockmgr-XXX.../YY
, where theYY
part is a hash bit, to spread files evenly.Function
DiskBlockManager.getFile
expects the top level directories (blockmgr-XXX...
) to always exist (they get created once, when the spark context is first created), otherwise it would fail with a message like:However, this may not always be the case, in particular if it's the default
/tmp
folder - on certain operating systems it can be cleaned on a regular basis (e.g. once per day via a system cron job).The symptom is that after the process using spark is running for a while (a few days), it may not be able to load files anymore, since the top-level scratch directories are not there and
DiskBlockManager.getFile
crashes.The change/mitigation is simple: use
File.mkdirs
instead ofFile.mkdir
insidegetFile
, so that we create the full path there, which will handle the case that parent directory is not there anymore.How was this patch tested?
I have added a falsifying unit test inside
DiskBlockManagerSuite
, which gets fixed via this patch.