[SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException #43543

viirya · 2023-10-26T21:00:30Z

What changes were proposed in this pull request?

This patch proposes to wrap BufferReleasingInputStream.available/reset under tryOrFetchFailedException. So IOException during available/reset call will be rethrown as FetchFailedException.

Why are the changes needed?

We have encountered shuffle data corruption issue:

Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:543)
at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450)
at org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497)
at org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356)

Spark shuffle has capacity to detect corruption for a few stream op like read and skip, such IOException in the stack trace will be rethrown as FetchFailedException that will re-try the failed shuffle task. But in the stack trace it is available that is not covered by the mechanism. So no-retry has been happened and the Spark application just failed.

As the available/reset op will also involve data decompression and throw IOException, we should be able to check it like read and skip do.

Does this PR introduce any user-facing change?

Yes. Data corruption during available/reset op is now causing FetchFailedException like read and skip that can be retried instead of IOException.

How was this patch tested?

Added test.

Was this patch authored or co-authored using generative AI tooling?

No

…ption

viirya · 2023-10-27T01:54:19Z

cc @dongjoon-hyun @sunchao

mridulm

Looks good to me.
For completeness sake, do you want to do it for reset as well ? We dont use it right now though.

viirya · 2023-10-27T05:31:46Z

For completeness sake, do you want to do it for reset as well ? We dont use it right now though.

Ah, I missed it. reset could possibly throw IOException too.

Thanks @mridulm.

mridulm

Thanks for fixing this @viirya !
Given CI is still running, please feel free to merge it once green :-)

viirya · 2023-10-27T05:58:57Z

Thank you @mridulm :)

sunchao

LGTM

…under tryOrFetchFailedException ### What changes were proposed in this pull request? This patch proposes to wrap `BufferReleasingInputStream.available/reset` under `tryOrFetchFailedException`. So `IOException` during `available`/`reset` call will be rethrown as `FetchFailedException`. ### Why are the changes needed? We have encountered shuffle data corruption issue: ``` Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504) at org.xerial.snappy.Snappy.uncompress(Snappy.java:543) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450) at org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497) at org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356) ``` Spark shuffle has capacity to detect corruption for a few stream op like `read` and `skip`, such `IOException` in the stack trace will be rethrown as `FetchFailedException` that will re-try the failed shuffle task. But in the stack trace it is `available` that is not covered by the mechanism. So no-retry has been happened and the Spark application just failed. As the `available`/`reset` op will also involve data decompression and throw `IOException`, we should be able to check it like `read` and `skip` do. ### Does this PR introduce _any_ user-facing change? Yes. Data corruption during `available`/`reset` op is now causing `FetchFailedException` like `read` and `skip` that can be retried instead of `IOException`. ### How was this patch tested? Added test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43543 from viirya/add_available. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Chao Sun <[email protected]>

sunchao · 2023-10-28T02:24:53Z

Merged to master/branch-3.4/branch-3.5. Thanks @viirya @mridulm !

viirya · 2023-10-28T02:26:43Z

Thank you @mridulm @sunchao !

dongjoon-hyun · 2023-10-28T03:05:16Z

+1, late LGTM.

viirya · 2023-10-28T04:15:20Z

Thank you @dongjoon-hyun !

…under tryOrFetchFailedException ### What changes were proposed in this pull request? This patch proposes to wrap `BufferReleasingInputStream.available/reset` under `tryOrFetchFailedException`. So `IOException` during `available`/`reset` call will be rethrown as `FetchFailedException`. ### Why are the changes needed? We have encountered shuffle data corruption issue: ``` Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504) at org.xerial.snappy.Snappy.uncompress(Snappy.java:543) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450) at org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497) at org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356) ``` Spark shuffle has capacity to detect corruption for a few stream op like `read` and `skip`, such `IOException` in the stack trace will be rethrown as `FetchFailedException` that will re-try the failed shuffle task. But in the stack trace it is `available` that is not covered by the mechanism. So no-retry has been happened and the Spark application just failed. As the `available`/`reset` op will also involve data decompression and throw `IOException`, we should be able to check it like `read` and `skip` do. ### Does this PR introduce _any_ user-facing change? Yes. Data corruption during `available`/`reset` op is now causing `FetchFailedException` like `read` and `skip` that can be retried instead of `IOException`. ### How was this patch tested? Added test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43543 from viirya/add_available. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Chao Sun <[email protected]>

Cover BufferReleasingInputStream.available under tryOrFetchFailedExce…

bef28ff

…ption

viirya changed the title ~~SPARK-45678]Cover BufferReleasingInputStream.available under tryOrFetchFailedException~~ [SPARK-45678][CORE] Cover BufferReleasingInputStream.available under tryOrFetchFailedException Oct 26, 2023

github-actions bot added the CORE label Oct 26, 2023

Add test

f9b0559

mridulm approved these changes Oct 27, 2023

View reviewed changes

For review

455aeaa

viirya changed the title ~~[SPARK-45678][CORE] Cover BufferReleasingInputStream.available under tryOrFetchFailedException~~ [SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException Oct 27, 2023

mridulm approved these changes Oct 27, 2023

View reviewed changes

sunchao approved these changes Oct 28, 2023

View reviewed changes

sunchao closed this in 57e73da Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException #43543

[SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException #43543

Uh oh!

viirya commented Oct 26, 2023 •

edited

Loading

Uh oh!

viirya commented Oct 27, 2023

Uh oh!

mridulm left a comment

Uh oh!

viirya commented Oct 27, 2023

Uh oh!

mridulm left a comment •

edited

Loading

Uh oh!

viirya commented Oct 27, 2023

Uh oh!

sunchao left a comment

Uh oh!

sunchao commented Oct 28, 2023

Uh oh!

viirya commented Oct 28, 2023

Uh oh!

dongjoon-hyun commented Oct 28, 2023

Uh oh!

viirya commented Oct 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException #43543

[SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException #43543

Uh oh!

Conversation

viirya commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

viirya commented Oct 27, 2023

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 27, 2023

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 27, 2023

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Oct 28, 2023

Uh oh!

viirya commented Oct 28, 2023

Uh oh!

dongjoon-hyun commented Oct 28, 2023

Uh oh!

viirya commented Oct 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented Oct 26, 2023 •

edited

Loading

mridulm left a comment •

edited

Loading