Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Dec 3, 2021

What changes were proposed in this pull request?

This pr update default spark.io.compression.codec to zstd.

Why are the changes needed?

To workaround Stream is corrupted issue:

org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:830)
	at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:926)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899)
	at org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733)
	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:494)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:50)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:730)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:255)
	at org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:266)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:913)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:913)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:388)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:315)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:129)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:486)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1379)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:489)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Stream is corrupted
	at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:259)
	at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
	at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:922)
	... 32 more

Please see SPARK-18105 for more details.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

@HyukjinKwon
Copy link
Member

cc @dongjoon-hyun FYI

@wangyum
Copy link
Member Author

wangyum commented Dec 3, 2021

image

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50373/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50373/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Test build #145898 has finished for PR 34798 at commit ce477f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 3, 2021

Just for my context - is this default change to match other default changes we already made?
Is the idea that lz4 support has a bug, so default to zstd?
or both?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you keep the original JIRA ID instead of filing a new one, @wangyum ?
  1. Also, we need a test case if this claims for a bug fix.

@wangyum
Copy link
Member Author

wangyum commented Dec 5, 2021

Thank you all. The root cause may be a hardware issue.

@wangyum
Copy link
Member Author

wangyum commented Dec 7, 2021

It turns out that Stream is corrupted is a hardware problem. Zstd has similar issues.

@wangyum wangyum closed this Dec 7, 2021
@wangyum wangyum deleted the SPARK-37535 branch December 7, 2021 10:29
@dongjoon-hyun
Copy link
Member

Thank you for the info, @wangyum .

@sleep1661
Copy link
Contributor

@wangyum @dongjoon-hyun Why root cause may be a hardware issue, could you provide more information about this. Finally, how to fix

@pan3793
Copy link
Member

pan3793 commented Mar 31, 2022

@sleep1661 You can get more information at #32385 (comment), and #32385 (comment) may be a direction to fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants