- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed #15819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Can you add some documentation? The current code is very difficult to follow. | 
| do you have a unit test to reproduce this bug? | 
| Actually, I do not have the unit test, but the code list below (same as we The related code would be this way: Our customer and ourself also have manually reproduced this bug for spark For the unit test, because we do not know how to find the hive directory The solution is that we reuse three functions in the 2.0.2 to create the On Wed, Nov 9, 2016 at 10:26 PM, Wenchen Fan [email protected] 
 | 
| val rand: Random = new Random | ||
| val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS") | ||
| val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) | ||
| return executionId | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the return statement in scala code be removed please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @fidato13 this is ok, since the part of this code is reused from spark 2.0.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merlintang  Can we take this opportunity to rectify at other places as well, Adding a return statement at the end of a simple method where no complex control flows are introduced would rather make it seem like java style coding. Check on the below link for Scala style guide for reference:
https://github.com/databricks/scala-style-guide#return-statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers.
| "Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e) | ||
|  | ||
| } | ||
| return dir | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the return statement in scala code be removed please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your comment, I will update this push it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| @cloud-fan @rxin can you review this code? since several customers are complaining about the hive generated empty staging files in the HDFS. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see how this removes the staging dir. Just the deleteOnExit? does it need this complexity then?
| private def executionId: String = { | ||
| val rand: Random = new Random | ||
| val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS") | ||
| val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all this -- just us a UUID? you also have a redundant return and types here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is. I am working on this way because I want to code is exactly the same as the spark 2.0.x version.
| } | ||
| catch { | ||
| case e: IOException => | ||
| throw new RuntimeException( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use RuntimeException; why even handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find the reason that we use this code is because (1) the old version need to use the hive package to create the staging directory, in the hive code, this staging directory is storied in a hash map, and then these staging directories would be removed when the session is closed. however, our spark code do not trigger the hive session close, then, these directories will not be removed. (2) you can find the pushed code just simulate the hive way to create the staging directory inside the spark rather than based on the hive. Then, the staging directory will be removed. (3) I will fix the return type issue, thanks for your comments @srowen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all the codes in this PR are copied from the existing master. This PR is just for branch 1.6
| yes, exactly. This path is only for spark 1.x. what i proposed here is that
we need to use the code of spark 2.0.x o fix the bug of spark 1.x. you can
see this message from the my previous replies. I do not want to change the
code, since it will make the 1.x and 2.x in great different.… On Sun, Dec 4, 2016 at 10:08 AM, Xiao Li ***@***.***> wrote:
 ***@***.**** commented on this pull request.
 ------------------------------
 In sql/hive/src/main/scala/org/apache/spark/sql/hive/
 execution/InsertIntoHiveTable.scala
 <#15819>:
 > +      } else {
 +        inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length)
 +      }
 +    val dir: Path =
 +      fs.makeQualified(
 +        new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID))
 +    logDebug("Created staging dir = " + dir + " for path = " + inputPath)
 +    try {
 +      if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
 +        throw new IllegalStateException("Cannot create staging directory  '" + dir.toString + "'")
 +      }
 +      fs.deleteOnExit(dir)
 +    }
 +    catch {
 +      case e: IOException =>
 +        throw new RuntimeException(
 Almost all the codes in this PR are copied from the existing master. This
 PR is just for branch 1.6
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#15819>, or mute the thread
 <https://github.com/notifications/unsubscribe-auth/ABXY-aaIs7Wx6ha3mvqrTVIxehcxGkaYks5rEwGKgaJpZM4KtFSt>
 .
 | 
| @merlintang Could you please add [Branch-1.6] in your PR title? | 
| it is updated.… On Sun, Dec 4, 2016 at 11:23 AM, Xiao Li ***@***.***> wrote:
 @merlintang <https://github.com/merlintang> Could you please add
 [Branch-1.6] in your PR title?
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#15819 (comment)>, or mute
 the thread
 <https://github.com/notifications/unsubscribe-auth/ABXY-Q749lhH4-ePuwIlqR_-AjMhdlDIks5rExNAgaJpZM4KtFSt>
 .
 | 
| OK so the problem becomes, do we want to backport this to 1.6? cc @rxin | 
| If it is a bug fix and low risk, sure. | 
| this bug is related to 1.5.x as well as 1.6.x.  please backport to 1.5.x as
well.… On Sun, Dec 4, 2016 at 6:20 PM, Reynold Xin ***@***.***> wrote:
 If it is a bug fix and low risk, sure.
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#15819 (comment)>, or mute
 the thread
 <https://github.com/notifications/unsubscribe-auth/ABXY-fX25g3sjKumkkWJPXjq1Wq2jMqvks5rE3T8gaJpZM4KtFSt>
 .
 | 
| We have stopped making new releases for 1.5 so it makes no sense to backport. | 
| Ok.… On Sun, Dec 4, 2016 at 6:25 PM, Reynold Xin ***@***.***> wrote:
 We have stopped making new releases for 1.5 so it makes no sense to
 backport.
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#15819 (comment)>, or mute
 the thread
 <https://github.com/notifications/unsubscribe-auth/ABXY-VjuhfvucqSwiSitncO_gIX_7G-wks5rE3YogaJpZM4KtFSt>
 .
 | 
| ok @merlintang can you find out which PR adds these codes to 2.0? Then other people can know what we are backporting in this PR | 
| @cloud-fan  this is related to this PR in the 2.0.x | 
| I'm using spark 2.0.2 I got a really big hive-stage folder. | 
| @lichenglin Could you post the layout of that staging folder? | 
| here is some result for  I run a sql like  Can I delete the folders manually?? | 
| do you exit the spark shell ?  I have tested on this, and this staging file
would be removed after we exit the spark shell under spark 2.0.x.
meanwhile, the staging file are used for hive to write data, and if one
hive insert data fail in the middle, the staging file could be used.… On Tue, Dec 6, 2016 at 5:09 PM, lichenglin ***@***.***> wrote:
 here is some result for du -h --max-depth=1 .
 3.3G ./.hive-staging_hive_2016-12-06_18-17-48_899_1400956608265117052-5
 13G ./.hive-staging_hive_2016-12-06_15-43-35_928_6647980494630196053-5
 8.6G ./.hive-staging_hive_2016-12-06_17-05-51_951_8422682528744006964-5
 9.7G ./.hive-staging_hive_2016-12-06_17-14-44_748_6947381677226271245-5
 9.2G ./day=2016-12-01
 8.5G ./day=2016-11-19
 I run a sql like insert overwrite db.table partition(day='2016-12-06')
 select * from tmpview everyday
 each sql create a "hive-staging folder".
 Can I delete the folders manually??
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#15819 (comment)>, or mute
 the thread
 <https://github.com/notifications/unsubscribe-auth/ABXY-cCebx3piETzStocxtvovCRPX6Ukks5rFgdYgaJpZM4KtFSt>
 .
 | 
| In fact,I'm using zeppelin to run sql. | 
| @lichenglin Another PR #16134 is trying to delete the staging directory and the temporary data files (which is pretty big in your case) after each insert. | 
| @gatorsmile what is going on this patch? this is a backport code, thus, can you merge this patch into 1.6.x ? more than one users are running into this issue in the spark 1.6.x. | 
| The current fix does not resolve the issue when users hitting abnormal termination of JVM. In addition, if the JVM does not stop, these temporary files could consume a lot of spaces. Thus, I think #16134 needs to be added too. This is just my opinion. Also need to get the feedbacks from the other Committers. | 
| yea, I think we should backport a complete staging dir cleanup functionality to 1.6, let's wait for #16134 | 
| … On Tue, Dec 13, 2016 at 12:18 AM, Wenchen Fan ***@***.***> wrote:
 yea, I think we should backport a complete staging dir cleanup
 functionality to 1.6, let's wait for #16134
 <#16134>
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#15819 (comment)>, or mute
 the thread
 <https://github.com/notifications/unsubscribe-auth/ABXY-VSxRlOyt2H4ySKmNJm4j4q5facoks5rHlS8gaJpZM4KtFSt>
 .
 | 
| retest this please | 
| val rand: Random = new Random | ||
| val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS") | ||
| val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) | ||
| return executionId | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| |location '${tmpDir.toURI.toString}' | ||
| """.stripMargin) | ||
|  | ||
| sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can create a temporary view, instead of creating another table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the temporary view supported in the 1.6.x? I just used the hivecontext to create the view, but it does not work. because this is small test case, the created table here would be ok. please advise. thanks so much, Tao.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 1.6, the function is registerTempTable. The name was changed in 2.0 to temp view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Xiao, I have created a dataframe, then create registerTempTable as following.
val df = sqlContext.createDataFrame((1 to 2).map(i => (i, "a"))).toDF("key", "value")
df.select("value").repartition(1).registerTempTable("tbl")
it can work, but it looks like fuzzy. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following line?
Seq((1, "a")).toDF("key", "value").registerTempTable("tbl")BTW, I am Xiao Li. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just want one column. Then, you can do it by
Seq(Tuple1("a")).toDF("value").registerTempTable("tbl")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Xiao, since one of my best friend is Tao. :). Sorry. It is updated. Thanks again.
| Test build #70785 has finished for PR 15819 at commit  
 | 
| @gatorsmile can you retest the patch, then we can merge. Sorry to ping you multiple times since several users are asking this. | 
| retest this please | 
| withTable("tab", "tbl") { | ||
| sqlContext.sql( | ||
| s""" | ||
| |CREATE TABLE tab(c1 string) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: two spaces -> one space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, it is updated.
| val rand: Random = new Random | ||
| val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS") | ||
| val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) | ||
| executionId | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: an indent issue. Please remove one more space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! thanks xiao.
| Test build #70907 has finished for PR 15819 at commit  
 | 
| |location '${tmpDir.toURI.toString}' | ||
| """.stripMargin) | ||
|  | ||
| import sqlContext.implicits._ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move this import to line 231.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| retest this please | 
| Test build #70908 has finished for PR 15819 at commit  
 | 
| Weird... Not sure why the build failed. The build works in my local environment. cc @srowen @JoshRosen | 
| retest this please | 
| Test build #70964 has finished for PR 15819 at commit  
 | 
| } | ||
|  | ||
| test(s"$version: Delete the temporary staging directory and files after each insert") { | ||
| import sqlContext.implicits._ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us roll back to the way you did in the last run, instead of using the temp table. I am not sure whether this trigger the build issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, xiao, I have reverted that and test locally.
| retest this please | 
| LGTM pending test | 
| Test build #70990 has finished for PR 15819 at commit  
 | 
## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR #12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <[email protected]> Author: Mingjie Tang <[email protected]> Closes #15819 from merlintang/branch-1.6.
| Thanks! Merging to 1.6 | 
| @merlintang Can you close this PR? Thanks! | 
| Many thanks, Xiao. I learnt lots. | 
## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR apache#12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <[email protected]> Author: Mingjie Tang <[email protected]> Closes apache#15819 from merlintang/branch-1.6. (cherry picked from commit 2303887)
## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR apache#12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <[email protected]> Author: Mingjie Tang <[email protected]> Closes apache#15819 from merlintang/branch-1.6.
What changes were proposed in this pull request?
This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end.
This is backport from spark 2.0.x code, and is related to PR #12770
How was this patch tested?
manual tests
Author: Mingjie Tang [email protected]