-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| try { | ||
| finished = !reader.nextKeyValue | ||
| } catch { | ||
| case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behavior change to NewHadoopRDD, which may surprise the existing 1.6 users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am slightly worried about this change of behavior too.
Though I think it should be fine.
| } catch { | ||
| case eof: EOFException => | ||
| finished = true | ||
| case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't use IOException to keep the default behavior is same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is should HadoopRDD and NewHadoopRDD be consistent. If so, it means we have to break the current behavior.
|
Test build #66853 has finished for PR 15454 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } catch { | ||
| case eof: EOFException => | ||
| finished = true | ||
| case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
| try { | ||
| finished = !reader.nextKeyValue | ||
| } catch { | ||
| case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am slightly worried about this change of behavior too.
Though I think it should be fine.
## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <[email protected]> Closes #15454 from zsxwing/SPARK-17850-1.6.
## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <[email protected]> Closes apache#15454 from zsxwing/SPARK-17850-1.6. (cherry picked from commit 585c565)
|
Since I have not yet heard complaints about this for 1.6, and this one may break some user's job, I'm going to close it now. |
What changes were proposed in this pull request?
This is the patch for 1.6. It only adds Spark conf
spark.files.ignoreCorruptFilesbecause SQL just uses HadoopRDD directly in 1.6.spark.files.ignoreCorruptFilesistrueby default.How was this patch tested?
The added test.