[SPARK-27677][Core] Serve local disk persisted blocks by the external service after releasing executor by dynamic allocation #24499

attilapiros · 2019-04-30T16:18:49Z

What changes were proposed in this pull request?

Problem statement

An executor which has persisted blocks does not consider to be idle and this way ready to be released by dynamic allocation after the regular timeout spark.dynamicAllocation.executorIdleTimeout but there is separate configuration spark.dynamicAllocation.cachedExecutorIdleTimeout which defaults to Integer.MAX_VALUE. This is because releasing the executor also means losing the persisted blocks (as the metadata for individual blocks called BlockInfo are kept in memory) and when the RDD is referenced latter on this lost blocks will be recomputed.
On the other hand keeping the executors too long without any task to work on is also a waste of resources (as executors are reserved for the application by the resource manager).

Solution

This PR focuses on the first part of SPARK-25888: it extends the external shuffle service with the capability to serve RDD blocks which are persisted on the local disk store by the executors. Moreover when this feature is enabled by setting the spark.shuffle.service.fetch.rdd.enabled config to true and a block is reported to be persisted on to disk the external shuffle service instance running on the same host as the executor is also registered (along with the reporting block manager) as a possible location for fetching it.

Some implementation detail

Some explanation about the decisions made during the development:

the location list to fetch a block was randomized but the groups (same host, same rack, others) order was kept. In this PR the order of groups are kept and external shuffle service added to the end of the each group.
BlockManagerInfo is not introduced for external shuffle service but only a lightweight solution is taken. A hash map from BlockId to BlockStatus is introduced. A type alias would make the source more readable but I know it is discouraged. On the other hand a new class wrapping this hash map would introduce unnecessary indirection.
when this feature is on the cleanup triggered during removing of executors (which is handled in ExternalShuffleBlockResolver) is modified to keep the disk persisted RDD blocks. This cleanup is triggered in standalone mode when the spark.storage.cleanupFilesAfterExecutorExit config is set.
the unpersisting of an RDD is extended to use the external shuffle service for disk persisted RDD blocks when the original executor which created the blocks are already released. New block transport messages are introduced to support this: RemoveBlocks and BlocksRemoved.

How was this patch tested?

Unit tests

ExternalShuffleServiceSuite

Here the complete use case is tested by the "SPARK-25888: using external shuffle service fetching disk persisted blocks" with a tiny difference: here the executor is killed manually, this way the test is a bit faster than waiting for the idle timeout.

ExternalShuffleBlockHandlerSuite

Tests the fetching of the RDD blocks via the external shuffle service.

BlockManagerInfoSuite

This a new suite. As the BlockManagerInfo behaviour depends very much on whether the external shuffle service enabled or not all the tests are executed with and without it.

BlockManagerSuite

Tests the sorting of the block locations.

Manually on YARN

Spark App was:

package com.mycompany

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel

object TestAppDiskOnlyLevel {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("test-app")

    println("Attila: START")
    val sc = new SparkContext(conf)
    val rdd = sc.parallelize(0 until 100, 10)
      .map { i =>
        println(s"Attila: calculate first rdd i=$i")
        Thread.sleep(1000)
        i
      }

    rdd.persist(StorageLevel.DISK_ONLY)
    rdd.count()

    println("Attila: First RDD is processed, waiting for 60 sec")

    Thread.sleep(60 * 1000)

    println("Attila: Num executors must be 0 as executorIdleTimeout is way over")

    val rdd2 = sc.parallelize(0 until 10, 1)
      .map(i => (i, 1))
      .persist(StorageLevel.DISK_ONLY)

    rdd2.count()

    println("Attila: Second RDD with one partition (only one executors must be alive)")
    
    // reduce runs as user code to detect the empty seq (empty blocks)  
    println("Calling collect on the first RDD: " + rdd.collect().reduce(_ + _))

    println("Attila: STOP")
  }
}

I have submitted with the following configuration:

spark-submit --master yarn \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.executorIdleTimeout=30 \
  --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90 \
  --class com.mycompany.TestAppDiskOnlyLevel dyn_alloc_demo-core_2.11-0.1.0-SNAPSHOT-jar-with-dependencies.jar

Checked the result by filtering for the side effect of the task calculations:

[user@server ~]$ yarn logs -applicationId application_1556299359453_0001 | grep "Attila: calculate" | wc -l
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
19/04/26 10:31:59 INFO client.RMProxy: Connecting to ResourceManager at apiros-1.gce.company.com/172.31.115.165:8032
100

So it is only 100 task execution and not 200 (which would be the case for re-computation).

Moreover from the submit/launcher log we can see executors really stopped in between (see the new total is 0 before the last line):

[user@server ~]$ grep "Attila: Num executors must be 0" -B 2 spark-submit.log
19/04/26 10:24:27 INFO cluster.YarnScheduler: Executor 9 on apiros-3.gce.company.com killed by driver.
19/04/26 10:24:27 INFO spark.ExecutorAllocationManager: Existing executor 9 has been removed (new total is 0)
Attila: Num executors must be 0 as executorIdleTimeout is way over

Full spark submit log

I have done a test also after changing the DISK_ONLY storage level to MEMORY_ONLY for the first RDD. After this change during the 60sec waiting no executor was removed.

vanzin

FYI, I didn't look at the tests yet.

The one thing that I noticed is that the logic here is based on where the storage level says the block may be stored on disk, not whether the block is actually stored on disk.

Wouldn't this break (as in you'd lose cached data) if you have MEMORY_AND_DISK persistence, but a particular block is currently just sitting in memory?

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java

core/src/main/scala/org/apache/spark/network/BlockTransferClientSync.scala

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

SparkQA · 2019-04-30T18:42:15Z

Test build #105037 has finished for PR 24499 at commit 3e7797a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public abstract class ShuffleClient implements BlockTransferClient, Closeable
case class HasExclusiveCachedBlocks(executorId: String) extends ToBlockManagerMaster

attilapiros · 2019-04-30T20:55:52Z

@vanzin Thanks for the review!

Yes, the storage level is used in two contexts:

desired state (coming from the user)
actual state

I have checked and in UpdateBlockInfo message the actual state is reflected so useDisk is only true when the disk store really contains the block.

SparkQA · 2019-04-30T21:03:09Z

Test build #105041 has finished for PR 24499 at commit d641805.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-30T21:39:11Z

Test build #105045 has finished for PR 24499 at commit 82c7bd9.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-04-30T21:41:17Z

@srowen Could it be something off with the Java style tests? That error reported in the 2nd commit was even present in my 1st commit (a space before the package). Could you please help how can I trigger the check locally? Do we have shell script for that?

SparkQA · 2019-05-01T00:10:41Z

Test build #105048 has finished for PR 24499 at commit df3a80d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

core/src/main/scala/org/apache/spark/network/SyncBlockTransferClient.scala

SparkQA · 2019-05-01T21:59:34Z

Test build #105069 has finished for PR 24499 at commit 5476f6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-02T21:57:06Z

Test build #105089 has finished for PR 24499 at commit 5933ef0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

core/src/test/scala/org/apache/spark/ExternalShuffleServiceSuite.scala

core/src/test/scala/org/apache/spark/storage/BlockManagerInfoSuite.scala

SparkQA · 2019-05-03T04:13:36Z

Test build #105095 has finished for PR 24499 at commit e66fe96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-05-03T14:54:35Z

The last commit is tested manually (in standalone mode):

$ grep "ExternalShuffleBlockResolver: Clean" /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.worker.Worker-1-apiros-MBP.local.out
19/05/03 16:34:05 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
19/05/03 16:34:05 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=0}'s 1 local dirs
19/05/03 16:34:07 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 1
19/05/03 16:34:07 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=1}'s 1 local dirs
19/05/03 16:34:07 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 2
19/05/03 16:34:07 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=2}'s 1 local dirs
19/05/03 16:34:09 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 5
19/05/03 16:34:09 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=5}'s 1 local dirs
19/05/03 16:34:09 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 4
19/05/03 16:34:09 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=4}'s 1 local dirs
19/05/03 16:34:10 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 3
19/05/03 16:34:10 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=3}'s 1 local dirs
19/05/03 16:34:10 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 6
19/05/03 16:34:10 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=6}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 7
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle and non-RDD files in executor AppExecId{appId=app-20190503163259-0000, execId=7}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=0}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=1}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=6}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=7}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=4}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=5}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=2}'s 1 local dirs
19/05/03 16:35:00 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20190503163259-0000, execId=3}'s 1 local dirs

And there were no block recalculation:

$ grep "Attila: calculate" work/app-20190503163259-0000/*/stdout | wc -l
     100

SparkQA · 2019-05-03T15:09:11Z

Test build #105104 has finished for PR 24499 at commit 1d7f100.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-03T18:49:58Z

Test build #105109 has finished for PR 24499 at commit 612c4f3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-05-03T20:09:15Z

Jenkins retest this please

Failure unrelated, locally it was running fine (although by python2.7):


 ~/github/spark/python (SPARK-25888-final) $ ./run-tests --testnames pyspark.streaming.tests.test_dstream
Running PySpark tests. Output is in /Users/attilapiros/github/spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python tests: ['pyspark.streaming.tests.test_dstream']
Starting test(python2.7): pyspark.streaming.tests.test_dstream
Finished test(python2.7): pyspark.streaming.tests.test_dstream (104s)
Tests passed in 104 seconds

attilapiros · 2019-05-03T20:16:07Z

Jenkins retest this please

SparkQA · 2019-05-03T22:26:57Z

Test build #105114 has finished for PR 24499 at commit 612c4f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Just some small things.

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

core/src/test/scala/org/apache/spark/storage/BlockManagerInfoSuite.scala

attilapiros · 2019-05-06T18:19:43Z

cc @squito @dbtsai

SparkQA · 2019-05-06T19:55:58Z

Test build #105161 has finished for PR 24499 at commit f0e141d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

just a really brief review so far

core/src/test/scala/org/apache/spark/storage/BlockManagerInfoSuite.scala

core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala

SparkQA · 2019-05-07T16:13:32Z

Test build #105217 has finished for PR 24499 at commit cc7aea0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-07T19:43:11Z

Test build #105226 has finished for PR 24499 at commit 0d6ed51.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito

still need to go through tests, but implementation makes sense to me

...src/test/java/org/apache/spark/network/shuffle/CleanupNonShuffleServiceServedFilesSuite.java

core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

...etwork-common/src/test/java/org/apache/spark/network/server/OneForOneStreamManagerSuite.java

squito · 2019-05-07T21:34:03Z

Jenkins, retest this please

SparkQA · 2019-05-07T23:51:57Z

Test build #105232 has finished for PR 24499 at commit 0d6ed51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-05-08T16:44:47Z

I think it is important to mention why in the previous commit the spark.shuffle.io.maxRetries is set to 0 for testing:

spark/common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleIntegrationSuite.java

Line 103 in dfeeda2

config.put("spark.shuffle.io.maxRetries", "0");

Without this settings the runtime of the test with the corrupt file (ExternalShuffleIntegrationSuite#testFetchCorruptRddBlock) increases dramatically (with 0 retries it is only takes 0.2 seconds but with 3 retries it goes up to 15sec). I think it is because the error is detected at a very deep level within Netty and the channel is closed right here:

spark/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java

Lines 131 to 133 in cc7aea0

    
           logger.error(String.format("Error sending result %s to %s; closing connection", 
        
             result, remoteAddress), future.cause()); 
        
           channel.close();

So not the quick ChunkFetchFailure is sent right away for this request.

squito · 2019-05-08T17:42:18Z

moving the discussion about file deletion and null buffers to the top-level so it doesn't get folded on code updates:

what if the file existed when FileSegmentManagedBuffer is constructed with a non-zero length but when the file is about to be put on the wire or right before that it is removed by the owning executor.

Just to understand what you're protecting against here -- is there an expected path where the file gets deleted? Or are you just trying to have the behavior be a little more understandable when something bad happens on the host and the file goes missing?

attilapiros · 2019-05-08T17:49:36Z

There is no code path that I know of where the files are deleted. I just try to make this as robust as possible (considering of course the price of this robustness) and I would like to know in advance how it will behave if something similar happens. This is why I added the new test.

vanzin

Some more nits. I need to do another more careful pass on the whole patch.

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/Constant.java

...src/test/java/org/apache/spark/network/shuffle/CleanupNonShuffleServiceServedFilesSuite.java

...n/network-shuffle/src/test/java/org/apache/spark/network/shuffle/TestShuffleDataContext.java

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

SparkQA · 2019-05-21T22:47:26Z

Test build #105638 has finished for PR 24499 at commit e3adc05.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Constant

SparkQA · 2019-05-22T11:03:14Z

Test build #105680 has finished for PR 24499 at commit bf9ec92.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-05-22T11:53:26Z

Jenkins retest this please

attilapiros · 2019-05-22T20:18:22Z

Jenkins retest this please

vanzin

Looks good, just a small test nit (and a revert of a previous suggestion).

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

...src/test/java/org/apache/spark/network/shuffle/CleanupNonShuffleServiceServedFilesSuite.java

core/src/test/scala/org/apache/spark/ExternalShuffleServiceSuite.scala

core/src/test/scala/org/apache/spark/storage/BlockManagerInfoSuite.scala

SparkQA · 2019-05-22T22:29:44Z

Test build #105700 has finished for PR 24499 at commit bf9ec92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-23T13:36:04Z

Test build #105720 has finished for PR 24499 at commit faa583f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-05-23T20:15:16Z

Alright, good to go. Merging to master.

attilapiros · 2019-05-23T21:03:52Z

@vanzin, @squito thanks for all the reviews

cloud-fan · 2019-07-04T08:25:26Z

...n/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/BlocksRemoved.java

+  public boolean equals(Object other) {
+    if (other != null && other instanceof BlocksRemoved) {
+      BlocksRemoved o = (BlocksRemoved) other;
+      return Objects.equal(numRemovedBlocks, o.numRemovedBlocks);


why do we compare 2 ints with Objects.equal?

IDEs are usually good at generating equals and hashCode for java classes, maybe we can use the generated version.

You are right (I followed the pattern used within the same package like in ExecutorShuffleInfo).
I can open a minor PR with fixing these two or I can add it this tiny change into my next PR which might be opened next week or the week after. It is about avoiding the network at fetching shuffle blocks from the block manager running on the same host, so it is just loosely related.

Which one is preferred by you?

both are fine, this is really trivial

… service after releasing executor by dynamic allocation An executor which has persisted blocks does not consider to be idle and this way ready to be released by dynamic allocation after the regular timeout `spark.dynamicAllocation.executorIdleTimeout` but there is separate configuration `spark.dynamicAllocation.cachedExecutorIdleTimeout` which defaults to `Integer.MAX_VALUE`. This is because releasing the executor also means losing the persisted blocks (as the metadata for individual blocks called `BlockInfo` are kept in memory) and when the RDD is referenced latter on this lost blocks will be recomputed. On the other hand keeping the executors too long without any task to work on is also a waste of resources (as executors are reserved for the application by the resource manager). This PR focuses on the first part of SPARK-25888: it extends the external shuffle service with the capability to serve RDD blocks which are persisted on the local disk store by the executors. Moreover when this feature is enabled by setting the `spark.shuffle.service.fetch.rdd.enabled` config to true and a block is reported to be persisted on to disk the external shuffle service instance running on the same host as the executor is also registered (along with the reporting block manager) as a possible location for fetching it. Some explanation about the decisions made during the development: - the location list to fetch a block was randomized but the groups (same host, same rack, others) order was kept. In this PR the order of groups are kept and external shuffle service added to the end of the each group. - `BlockManagerInfo` is not introduced for external shuffle service but only a lightweight solution is taken. A hash map from `BlockId` to `BlockStatus` is introduced. A type alias would make the source more readable but I know it is discouraged. On the other hand a new class wrapping this hash map would introduce unnecessary indirection. - when this feature is on the cleanup triggered during removing of executors (which is handled in `ExternalShuffleBlockResolver`) is modified to keep the disk persisted RDD blocks. This cleanup is triggered in standalone mode when the `spark.storage.cleanupFilesAfterExecutorExit` config is set. - the unpersisting of an RDD is extended to use the external shuffle service for disk persisted RDD blocks when the original executor which created the blocks are already released. New block transport messages are introduced to support this: `RemoveBlocks` and `BlocksRemoved`. Here the complete use case is tested by the "SPARK-25888: using external shuffle service fetching disk persisted blocks" with a tiny difference: here the executor is killed manually, this way the test is a bit faster than waiting for the idle timeout. Tests the fetching of the RDD blocks via the external shuffle service. This a new suite. As the `BlockManagerInfo` behaviour depends very much on whether the external shuffle service enabled or not all the tests are executed with and without it. Tests the sorting of the block locations. Spark App was: ~~~scala package com.mycompany import org.apache.spark.rdd.RDD import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.storage.StorageLevel object TestAppDiskOnlyLevel { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("test-app") println("Attila: START") val sc = new SparkContext(conf) val rdd = sc.parallelize(0 until 100, 10) .map { i => println(s"Attila: calculate first rdd i=$i") Thread.sleep(1000) i } rdd.persist(StorageLevel.DISK_ONLY) rdd.count() println("Attila: First RDD is processed, waiting for 60 sec") Thread.sleep(60 * 1000) println("Attila: Num executors must be 0 as executorIdleTimeout is way over") val rdd2 = sc.parallelize(0 until 10, 1) .map(i => (i, 1)) .persist(StorageLevel.DISK_ONLY) rdd2.count() println("Attila: Second RDD with one partition (only one executors must be alive)") // reduce runs as user code to detect the empty seq (empty blocks) println("Calling collect on the first RDD: " + rdd.collect().reduce(_ + _)) println("Attila: STOP") } } ~~~ I have submitted with the following configuration: ~~~bash spark-submit --master yarn \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=30 \ --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90 \ --class com.mycompany.TestAppDiskOnlyLevel dyn_alloc_demo-core_2.11-0.1.0-SNAPSHOT-jar-with-dependencies.jar ~~~ Checked the result by filtering for the side effect of the task calculations: ~~~bash [userserver ~]$ yarn logs -applicationId application_1556299359453_0001 | grep "Attila: calculate" | wc -l WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 19/04/26 10:31:59 INFO client.RMProxy: Connecting to ResourceManager at apiros-1.gce.company.com/172.31.115.165:8032 100 ~~~ So it is only 100 task execution and not 200 (which would be the case for re-computation). Moreover from the submit/launcher log we can see executors really stopped in between (see the new total is 0 before the last line): ~~~ [userserver ~]$ grep "Attila: Num executors must be 0" -B 2 spark-submit.log 19/04/26 10:24:27 INFO cluster.YarnScheduler: Executor 9 on apiros-3.gce.company.com killed by driver. 19/04/26 10:24:27 INFO spark.ExecutorAllocationManager: Existing executor 9 has been removed (new total is 0) Attila: Num executors must be 0 as executorIdleTimeout is way over ~~~ [Full spark submit log](https://github.com/attilapiros/spark/files/3122465/spark-submit.log) I have done a test also after changing the `DISK_ONLY` storage level to `MEMORY_ONLY` for the first RDD. After this change during the 60sec waiting no executor was removed. Closes apache#24499 from attilapiros/SPARK-25888-final. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

… service after releasing executor by dynamic allocation An executor which has persisted blocks does not consider to be idle and this way ready to be released by dynamic allocation after the regular timeout `spark.dynamicAllocation.executorIdleTimeout` but there is separate configuration `spark.dynamicAllocation.cachedExecutorIdleTimeout` which defaults to `Integer.MAX_VALUE`. This is because releasing the executor also means losing the persisted blocks (as the metadata for individual blocks called `BlockInfo` are kept in memory) and when the RDD is referenced latter on this lost blocks will be recomputed. On the other hand keeping the executors too long without any task to work on is also a waste of resources (as executors are reserved for the application by the resource manager). This PR focuses on the first part of SPARK-25888: it extends the external shuffle service with the capability to serve RDD blocks which are persisted on the local disk store by the executors. Moreover when this feature is enabled by setting the `spark.shuffle.service.fetch.rdd.enabled` config to true and a block is reported to be persisted on to disk the external shuffle service instance running on the same host as the executor is also registered (along with the reporting block manager) as a possible location for fetching it. Some explanation about the decisions made during the development: - the location list to fetch a block was randomized but the groups (same host, same rack, others) order was kept. In this PR the order of groups are kept and external shuffle service added to the end of the each group. - `BlockManagerInfo` is not introduced for external shuffle service but only a lightweight solution is taken. A hash map from `BlockId` to `BlockStatus` is introduced. A type alias would make the source more readable but I know it is discouraged. On the other hand a new class wrapping this hash map would introduce unnecessary indirection. - when this feature is on the cleanup triggered during removing of executors (which is handled in `ExternalShuffleBlockResolver`) is modified to keep the disk persisted RDD blocks. This cleanup is triggered in standalone mode when the `spark.storage.cleanupFilesAfterExecutorExit` config is set. - the unpersisting of an RDD is extended to use the external shuffle service for disk persisted RDD blocks when the original executor which created the blocks are already released. New block transport messages are introduced to support this: `RemoveBlocks` and `BlocksRemoved`. Here the complete use case is tested by the "SPARK-25888: using external shuffle service fetching disk persisted blocks" with a tiny difference: here the executor is killed manually, this way the test is a bit faster than waiting for the idle timeout. Tests the fetching of the RDD blocks via the external shuffle service. This a new suite. As the `BlockManagerInfo` behaviour depends very much on whether the external shuffle service enabled or not all the tests are executed with and without it. Tests the sorting of the block locations. Spark App was: ~~~scala package com.mycompany import org.apache.spark.rdd.RDD import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.storage.StorageLevel object TestAppDiskOnlyLevel { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("test-app") println("Attila: START") val sc = new SparkContext(conf) val rdd = sc.parallelize(0 until 100, 10) .map { i => println(s"Attila: calculate first rdd i=$i") Thread.sleep(1000) i } rdd.persist(StorageLevel.DISK_ONLY) rdd.count() println("Attila: First RDD is processed, waiting for 60 sec") Thread.sleep(60 * 1000) println("Attila: Num executors must be 0 as executorIdleTimeout is way over") val rdd2 = sc.parallelize(0 until 10, 1) .map(i => (i, 1)) .persist(StorageLevel.DISK_ONLY) rdd2.count() println("Attila: Second RDD with one partition (only one executors must be alive)") // reduce runs as user code to detect the empty seq (empty blocks) println("Calling collect on the first RDD: " + rdd.collect().reduce(_ + _)) println("Attila: STOP") } } ~~~ I have submitted with the following configuration: ~~~bash spark-submit --master yarn \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=30 \ --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90 \ --class com.mycompany.TestAppDiskOnlyLevel dyn_alloc_demo-core_2.11-0.1.0-SNAPSHOT-jar-with-dependencies.jar ~~~ Checked the result by filtering for the side effect of the task calculations: ~~~bash [userserver ~]$ yarn logs -applicationId application_1556299359453_0001 | grep "Attila: calculate" | wc -l WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 19/04/26 10:31:59 INFO client.RMProxy: Connecting to ResourceManager at apiros-1.gce.company.com/172.31.115.165:8032 100 ~~~ So it is only 100 task execution and not 200 (which would be the case for re-computation). Moreover from the submit/launcher log we can see executors really stopped in between (see the new total is 0 before the last line): ~~~ [userserver ~]$ grep "Attila: Num executors must be 0" -B 2 spark-submit.log 19/04/26 10:24:27 INFO cluster.YarnScheduler: Executor 9 on apiros-3.gce.company.com killed by driver. 19/04/26 10:24:27 INFO spark.ExecutorAllocationManager: Existing executor 9 has been removed (new total is 0) Attila: Num executors must be 0 as executorIdleTimeout is way over ~~~ [Full spark submit log](https://github.com/attilapiros/spark/files/3122465/spark-submit.log) I have done a test also after changing the `DISK_ONLY` storage level to `MEMORY_ONLY` for the first RDD. After this change during the 60sec waiting no executor was removed. Closes apache#24499 from attilapiros/SPARK-25888-final. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit e9f3f62)

vanzin reviewed Apr 30, 2019

View reviewed changes

vanzin reviewed May 1, 2019

View reviewed changes

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/network/SyncBlockTransferClient.scala Outdated Show resolved Hide resolved

vanzin reviewed May 2, 2019

View reviewed changes

vanzin reviewed May 6, 2019

View reviewed changes

squito reviewed May 7, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/storage/BlockManagerInfoSuite.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala Show resolved Hide resolved

squito reviewed May 7, 2019

View reviewed changes

attilapiros added 5 commits May 21, 2019 16:59

introduce spark.shuffle.service.fetch.rdd.enabled

39c9914

fix checkstyle

16ab64f

keep the old logic: sending RemoveRdd to every live executor

43fac8b

javastyle fix

5a1a15a

applying review comments of Vanzin

e3adc05

attilapiros force-pushed the SPARK-25888-final branch from cac5d71 to e3adc05 Compare May 21, 2019 20:13

vanzin reviewed May 21, 2019

View reviewed changes

applying review comments

bf9ec92

vanzin reviewed May 22, 2019

View reviewed changes

fixing: test NITs

faa583f

attilapiros mentioned this pull request May 23, 2019

[SPARK-27622][Core] Avoiding the network when block manager fetches disk persisted RDD blocks from the same host #24554

Closed

vanzin closed this in e9f3f62 May 23, 2019

xuanyuanking mentioned this pull request May 27, 2019

[SPARK-27665][Core] Split fetch shuffle blocks protocol from OpenBlocks #24565

Closed

cloud-fan reviewed Jul 4, 2019

View reviewed changes

cloud-fan mentioned this pull request Jul 16, 2019

[SPARK-28188] Materialize Dataframe API #24991

Closed

attilapiros mentioned this pull request Jul 30, 2019

[SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host #25299

Closed

attilapiros mentioned this pull request May 25, 2021

[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight #32114

Closed

attilapiros mentioned this pull request Jun 22, 2021

[SPARK-35543][CORE] Fix memory leak in BlockManagerMasterEndpoint removeRdd #32790

Closed

Kimahriman mentioned this pull request Aug 30, 2024

[SPARK-37618][CORE] Remove shuffle blocks using the shuffle service for released executors #35085

Closed

[SPARK-27677][Core] Serve local disk persisted blocks by the external service after releasing executor by dynamic allocation #24499

[SPARK-27677][Core] Serve local disk persisted blocks by the external service after releasing executor by dynamic allocation #24499

Uh oh!

Conversation

attilapiros commented Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Problem statement

Solution

Some implementation detail

How was this patch tested?

Unit tests

ExternalShuffleServiceSuite

ExternalShuffleBlockHandlerSuite

BlockManagerInfoSuite

BlockManagerSuite

Manually on YARN

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

attilapiros commented Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

attilapiros commented Apr 30, 2019

Uh oh!

SparkQA commented May 1, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented May 1, 2019

Uh oh!

SparkQA commented May 2, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented May 3, 2019

Uh oh!

attilapiros commented May 3, 2019

Uh oh!

SparkQA commented May 3, 2019

Uh oh!

SparkQA commented May 3, 2019

Uh oh!

attilapiros commented May 3, 2019

Uh oh!

attilapiros commented May 3, 2019

Uh oh!

SparkQA commented May 3, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

attilapiros commented May 6, 2019

Uh oh!

SparkQA commented May 6, 2019

attilapiros commented Apr 30, 2019 •

edited

Loading

attilapiros commented Apr 30, 2019 •

edited

Loading

attilapiros commented May 8, 2019 •

edited

Loading