[SPARK-4105] retry the fetch or stage if shuffle block is corrupt #15923

davies · 2016-11-18T00:45:40Z

What changes were proposed in this pull request?

There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption.

Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines.

Checksum for broadcast will be added by another PR.

How was this patch tested?

Added unit tests

rxin · 2016-11-18T01:25:54Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

-          logDebug("Number of requests in flight " + reqsInFlight)
-        }
-      case _ =>
+    var result: FetchResult = null


add documentation explaining what's going on here.

btw is there a way to refactor this function so it is testable? i do worry some of the logic here won't be tested at all.

SparkQA · 2016-11-18T02:22:27Z

Test build #68813 has finished for PR 15923 at commit 5c93aaf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T21:49:19Z

Test build #68870 has finished for PR 15923 at commit c85a216.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-11-18T22:26:26Z

@JoshRosen @zsxwing Could you help to review this one ?

zsxwing · 2016-11-18T22:58:44Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+
+          input = streamWrapper(blockId, in)
+          // Only copy the stream if it's wrapped by compression or encryption, also the size of
+          // block is small (the decompressed block is smaller than maxBytesInFlight)


Does this issue only happen for small blocks? Otherwise, only check small blocks seems not very helpful. Why not add shuffle block checksum instead? Then we can just check the compressed block and retry.

The purpose of this PR is to reduce the possibility that failed job caused by network/disk corruption, without introduce other regression (OOM). Typically, the shuffle blocks are small, so we can have parallel fetching even with this maxBytesInFlight limit. For those few blocks (for example, data skew), we does not check that for now (at least, it's not worse than before).

I tried to add checksum for shuffle blocks in #15894, that will have much more complexity and overhead, so in favor of this lighter one.

Once we start explicitly managing the memory and support spilling, will it be safe to do this for large blocks, too?

@JoshRosen I think so.

JoshRosen

I left a couple of comments regarding cleanup of decompression buffers and logging of exceptions.

JoshRosen · 2016-11-19T01:08:34Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

 private[spark]
 final class ShuffleBlockFetcherIterator(
    context: TaskContext,
    shuffleClient: ShuffleClient,


Could you update the Scaladoc to document the two new parameters here? I understand what streamWrapper means from context but it might be useful for new readers of this code.

JoshRosen · 2016-11-19T01:13:33Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+              // TODO: manage the memory used here, and spill it into disk in case of OOM.
+              Utils.copyStream(input, out)
+              out.close()
+              input = out.toChunkedByteBuffer.toInputStream(true)


Could you put dispose = true here to make the boolean parameter clearer?

JoshRosen · 2016-11-19T01:14:30Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+                  || corruptedBlocks.contains(blockId)) {
+                  throwFetchFailedException(blockId, address, e)
+                } else {
+                  logWarning(s"got an corrupted block $blockId from $address, fetch again")


Can we log the IOException here? It looks like the exception isn't logged or rethrown from this branch and I think we'll need that information to help debug problems here.

I think the IOException would be set as the cause of the FetchFailedException.

@lins05 It's already set for FetchFailedException

JoshRosen · 2016-11-19T01:15:27Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+              // Decompress the whole block at once to detect any corruption, which could increase
+              // the memory usage tne potential increase the chance of OOM.
+              // TODO: manage the memory used here, and spill it into disk in case of OOM.
+              Utils.copyStream(input, out)


Do we need to close the input stream here? There might be resources in the decompressor which need to be freed.

JoshRosen · 2016-11-19T01:20:14Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+
+          input = streamWrapper(blockId, in)
+          // Only copy the stream if it's wrapped by compression or encryption, also the size of
+          // block is small (the decompressed block is smaller than maxBytesInFlight)


Once we start explicitly managing the memory and support spilling, will it be safe to do this for large blocks, too?

lins05 · 2016-11-24T10:57:00Z

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

-      serializerManager.wrapStream(blockId, inputStream)
-    }
+      SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
+      SparkEnv.get.conf.getBoolean("spark.shuffle.tryDecompress", true))


nit: maybe detectCorrupt is slightly better than tryDecompress ?

lins05 · 2016-11-24T11:00:02Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

 import org.apache.spark.network.shuffle.{BlockFetchingListener, ShuffleClient}
 import org.apache.spark.shuffle.FetchFailedException
 import org.apache.spark.util.Utils
+import org.apache.spark.util.io.{ChunkedByteBufferInputStream, ChunkedByteBufferOutputStream}


seems ChunkedByteBufferInputStream is not used here.

lins05 · 2016-11-24T12:14:45Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

  /** Current number of requests in flight */
  private[this] var reqsInFlight = 0

+  /** The blocks that can't be decompressed successfully */


What about add more explanation, for example:

/** The blocks that can't be decompressed successfully. ** It is used to guarantee that we retry at most once for those corrupted blocks. **/

lins05 · 2016-11-24T12:20:16Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+                  || corruptedBlocks.contains(blockId)) {
+                  throwFetchFailedException(blockId, address, e)
+                } else {
+                  logWarning(s"got an corrupted block $blockId from $address, fetch again")


I think the IOException would be set as the cause of the FetchFailedException.

davies · 2016-11-28T21:30:05Z

Manually test this patch with a job that usually failed because of corrupt stream, as the logging said:

16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_613_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_688_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_2434_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_878_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_1042_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_2301_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_2546_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_3160_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_3601_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
...
16/11/20 08:32:41 INFO Executor: Finished task 275.0 in stage 26.0 (TID 22187). 5219 bytes result sent to driver

The shuffle fetcher got some corrupt blocks for partition 275, it retried once, then the task finally succeeded.

But the retry can not protect all the tasks, some failed as FetchFailed, then the stage is retried:

26	   2016/11/20 08:31:24	1.0min	403/1000 (2 failed)			205.6 GB	29.5 GB org.apache.spark.shuffle.FetchFailedException: Stream is corrupted

26 (retry 1)  2016/11/20 08:34:00	34s	200/629 (2 failed)			102.0 GB	14.6 GB	org.apache.spark.shuffle.FetchFailedException: Stream is corrupted

26 (retry 2)  2016/11/20 08:35:25	1.8min	461/461			235.1 GB	33.7 GB

The stage 26 succeeded after retried twice.

Another thing is that all the corruption happened only in 2 nodes out of 26. Also a few broadcast block is corrupt on them. They seems that the corruption happens on the receive (fetcher) side of network.

I will update the patch to address comments.

SparkQA · 2016-11-29T01:52:39Z

Test build #69261 has finished for PR 15923 at commit b3e1786.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T19:12:20Z

Test build #3443 has started for PR 15923 at commit b3e1786.

SparkQA · 2016-11-30T01:48:39Z

Test build #3444 has started for PR 15923 at commit b3e1786.

SparkQA · 2016-11-30T21:08:03Z

Test build #69422 has finished for PR 15923 at commit 28340ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T00:15:05Z

Test build #3448 has finished for PR 15923 at commit 28340ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-03T05:20:18Z

cc @zsxwing @JoshRosen does this look good?

davies · 2016-12-06T18:49:08Z

ping @JoshRosen

JoshRosen · 2016-12-06T21:06:54Z

This looks good overall, but one nit: it looks like we don't have any test coverage for the case where detectCorrupt is false. We should probably add a test to make sure that the feature flag works properly.

davies · 2016-12-07T23:05:51Z

@JoshRosen Added a test for detectCorrupt is false.

SparkQA · 2016-12-07T23:50:56Z

Test build #69825 has finished for PR 15923 at commit b43d384.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-08T02:29:35Z

Test build #3475 has finished for PR 15923 at commit b43d384.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-08T22:12:54Z

Test build #3480 has started for PR 15923 at commit b43d384.

SparkQA · 2016-12-09T02:01:41Z

Test build #3481 has finished for PR 15923 at commit b43d384.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-12-09T23:31:02Z

LGTM

zsxwing · 2016-12-09T23:44:00Z

Merging to master.

## What changes were proposed in this pull request? There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption. Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines. Checksum for broadcast will be added by another PR. ## How was this patch tested? Added unit tests Author: Davies Liu <[email protected]> Closes apache#15923 from davies/detect_corrupt.

iinegve · 2016-12-22T16:20:57Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+    var result: FetchResult = null
+    var input: InputStream = null
+    // Take the next fetched result and try to decompress it to detect data corruption,
+    // then fetch it one more time if it's corrupt, throw FailureFetchResult if the second fetch


@davies Could you elaborate a bit here? In my mind TCP provides pretty robust data transfer, which means that if there is an error, then it's been written to disk corrupted and fetch it one more time won't help.

We have observed in production a few failures related to this on virtualized environments. It is entirely possible there is a bug in the underlying networking stack, or a bug in Spark's networking stack. But either this way eliminates those issues.

@fathersson The checksum in TCP is only 16 bits, it's not strong enough for large traffic, usually DFS or other system with heavy TCP traffic will have another application level checksum. Adding to @rxin 's point, we did see this retry helped in production to work around temporary corrupt.

Is netty/shuffle data being compressed using Snappy algorithm by default? If so, might be good to idea to enable checksum checking at Netty level too?

https://netty.io/4.0/api/io/netty/handler/codec/compression/SnappyFramedDecoder.html

Note that by default, validation of the checksum header in each chunk is DISABLED for performance improvements. If performance is less of an issue, or if you would prefer the safety that checksum validation brings, please use the SnappyFramedDecoder(boolean) constructor with the argument set to true.

@Tagar Spark doesn't use Netty's Snappy compression.

## What changes were proposed in this pull request? There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption. Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines. Checksum for broadcast will be added by another PR. ## How was this patch tested? Added unit tests Author: Davies Liu <[email protected]> Closes apache#15923 from davies/detect_corrupt.

…lock is corrupt (backport from upstream master to databricks branch 2.1) ## What changes were proposed in this pull request? There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption. Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines. Checksum for broadcast will be added by another PR. ## How was this patch tested? Added unit tests Author: Davies Liu <daviesdatabricks.com> Closes apache#15923 from davies/detect_corrupt. Author: Davies Liu <[email protected]> Closes apache#159 from ericl/sc-5362.

## What changes were proposed in this pull request? There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption. Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines. Checksum for broadcast will be added by another PR. ## How was this patch tested? Added unit tests Author: Davies Liu <[email protected]> Closes apache#15923 from davies/detect_corrupt.

retry the fetch or stage if shuffle block is corrupt

5c93aaf

davies mentioned this pull request Nov 18, 2016

[SPARK-18188] Add checksum for shuffle blocks #15894

Closed

rxin reviewed Nov 18, 2016

View reviewed changes

add tests

c85a216

zsxwing reviewed Nov 18, 2016

View reviewed changes

JoshRosen requested changes Nov 19, 2016

View reviewed changes

lins05 reviewed Nov 24, 2016

View reviewed changes

address comments

b3e1786

fix tests

28340ef

add test for disabled detection

b43d384

asfgit closed this in cf33a86 Dec 9, 2016

iinegve reviewed Dec 22, 2016

View reviewed changes

cloud-fan mentioned this pull request Mar 6, 2018

[SPARK-23524] Big local shuffle blocks should not be checked for corruption. #20685

Closed

[SPARK-4105] retry the fetch or stage if shuffle block is corrupt #15923

[SPARK-4105] retry the fetch or stage if shuffle block is corrupt #15923

Uh oh!

Conversation

davies commented Nov 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 18, 2016

Uh oh!

SparkQA commented Nov 18, 2016

Uh oh!

davies commented Nov 18, 2016

Uh oh!

zsxwing Nov 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Nov 28, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

rxin commented Dec 3, 2016

Uh oh!

davies commented Dec 6, 2016

Uh oh!

JoshRosen commented Dec 6, 2016

Uh oh!

davies commented Dec 7, 2016

Uh oh!

davies commented Nov 18, 2016 •

edited

Loading

zsxwing Nov 18, 2016 •

edited

Loading

iinegve Dec 22, 2016 •

edited

Loading