Skip to content

Conversation

rshest
Copy link

@rshest rshest commented Sep 7, 2017

What changes were proposed in this pull request?

The problem:

DiskBlockManager has a notion of a "scratch" local folder(s), which can be configured via spark.local.dir option, and which defaults to the system's /tmp. The hierarchy is two-level, e.g. /blockmgr-XXX.../YY, where the YY part is a hash bit, to spread files evenly.

Function DiskBlockManager.getFile expects the top level directories (blockmgr-XXX...) to always exist (they get created once, when the spark context is first created), otherwise it would fail with a message like:

... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY

However, this may not always be the case, in particular if it's the default /tmp folder - on certain operating systems it can be cleaned on a regular basis (e.g. once per day via a system cron job).

The symptom is that after the process using spark is running for a while (a few days), it may not be able to load files anymore, since the top-level scratch directories are not there and DiskBlockManager.getFile crashes.

The change/mitigation is simple: use File.mkdirs instead of File.mkdir inside getFile, so that we create the full path there, which will handle the case that parent directory is not there anymore.

How was this patch tested?

I have added a falsifying unit test inside DiskBlockManagerSuite, which gets fixed via this patch.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Sep 7, 2017

I don't think it's reasonable to handle the case where people arbitrarily delete data from under Spark. This can may be easy to fix; others won't. This also isn't how changes are proposed: http://spark.apache.org/contributing.html

@rshest
Copy link
Author

rshest commented Sep 7, 2017

Please note that it's not people deleting files, it's the operating system doing this automatically, inside the /tmp folder. This can happen with a high probability, after some time.

I did read the guidelines above before submitting the PR, and I believe I went through the all steps aside of creating the JIRA issue (I had trouble logging into the system for some reason). Could you please point me to what else needs to be done, exactly? Thanks!

@rshest rshest changed the title Fix DiskBlockManager crashing when a root local folder has been externally deleted [CORE] Fix DiskBlockManager crashing when a root local folder has been externally deleted Sep 7, 2017
@rshest rshest changed the title [CORE] Fix DiskBlockManager crashing when a root local folder has been externally deleted [SPARK-21942][CORE] Fix DiskBlockManager crashing when a root local folder has been externally deleted Sep 7, 2017
@rshest
Copy link
Author

rshest commented Sep 7, 2017

I have managed to create the JIRA task and updated the pull request's title correspondingly:
https://issues.apache.org/jira/browse/SPARK-21942

Since this does not look so far that it's going to be let through, for the posterity and for people who might come here with a similar problem, the suggestions to work this around are (according to the Sean's comment in the issue):

  • manually configure your scratch directory (spark.local.dir) to be elsewhere
  • stop your system from cleaning up your temp folder

srowen added a commit to srowen/spark that referenced this pull request Sep 12, 2017
@srowen srowen mentioned this pull request Sep 12, 2017
@asfgit asfgit closed this in dd88fa3 Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants