[DCOS-45840] CNI support for shuffle jobs #44

akirillov · 2019-01-10T01:04:11Z

What changes were proposed in this pull request?

Previously, executors were instructed to bind to 0.0.0.0 address in case virtual network was specified via spark.mesos.network.name which made this address advertised to other executors and led to failures during shuffle jobs. Now responsibility to discover bind address is moved to the executor in case the virtual network is enabled thus allowing to bind to an address within this network
Docker containerizer is fixed in order to allow running in virtual/overlay network by providing additional properties to DockerInfo builder

How was this patch tested?

unit tests added to validate the changes proposed in this PR
integration tests for both Docker and Mesos containerizers from mesosphere/spark-build

samvantran

The code looks good to me but the tests fail in some interesting ways.

Looks like here the driver cannot bind to the localhost IP and I noticed both SparkUI and ServerConnector still bind to 0.0.0.0

samvantran · 2019-01-10T18:50:41Z

...esos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala

nit: add space b/w foreach {

thanks, fixed it

…ad of 0.0.0.0 to properly advertise executors during shuffle

akirillov · 2019-01-10T19:47:36Z

@samvantran it looks like the master branch of mesosphere/spark-build is used for tests because the failing tests in this build were renamed and e.g. test_cni doesn't exist anymore. Was the branch mapping logic applied to these builds to use the similarly named branch from mesosphere/spark-build? Otherwise, I can trigger the rebuild pointing to a matching branch.

samvantran · 2019-01-10T20:09:15Z

Looks like a bug in the CI code: https://teamcity.mesosphere.io/viewLog.html?buildId=1459230&buildTypeId=DataServices_Spark_PR_Spark_master_strict&tab=buildLog&state=&expand=none#_state=534,525

cutting for git branch doesn't work as intended. I'll fix

mpereira · 2019-01-15T17:16:58Z

Re-triggered the failed CI test.

…ad of 0.0.0.0 to properly advertise executors during shuffle (#44)

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) # Conflicts: # dev/deps/spark-deps-hadoop-2.6 # dev/deps/spark-deps-hadoop-2.7 # dev/deps/spark-deps-hadoop-3.1 # pom.xml * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

akirillov requested review from imaxxs, mpereira, samvantran, soumasish and vishnu2kmohan January 10, 2019 01:04

akirillov mentioned this pull request Jan 10, 2019

[WIP][DCOS-37643] Use bootstrap to get Executor IP in Executor command #31

Closed

samvantran reviewed Jan 10, 2019

View reviewed changes

CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP inste…

0587099

…ad of 0.0.0.0 to properly advertise executors during shuffle

akirillov force-pushed the DCOS-45840-CNI-support-for-shuffle-jobs branch from 8d09427 to 0587099 Compare January 10, 2019 19:42

akirillov merged this pull request into custom-branch-2.3.x Jan 15, 2019

akirillov added a commit that referenced this pull request Jan 16, 2019

CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP inste…

78d6c37

…ad of 0.0.0.0 to properly advertise executors during shuffle (#44)

akirillov added a commit that referenced this pull request Feb 8, 2019

CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP inste…

cec7c65

…ad of 0.0.0.0 to properly advertise executors during shuffle (#44)

vishnu2kmohan deleted the DCOS-45840-CNI-support-for-shuffle-jobs branch February 19, 2019 19:17

alembiewski pushed a commit that referenced this pull request Jun 12, 2019

CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP inste…

b86608b

…ad of 0.0.0.0 to properly advertise executors during shuffle (#44)

alembiewski mentioned this pull request Jun 13, 2019

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

Merged

akirillov restored the DCOS-45840-CNI-support-for-shuffle-jobs branch August 16, 2019 21:12

akirillov deleted the DCOS-45840-CNI-support-for-shuffle-jobs branch August 16, 2019 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DCOS-45840] CNI support for shuffle jobs #44

[DCOS-45840] CNI support for shuffle jobs #44

Uh oh!

akirillov commented Jan 10, 2019

Uh oh!

samvantran left a comment

Uh oh!

samvantran Jan 10, 2019

Uh oh!

akirillov Jan 10, 2019

Uh oh!

akirillov commented Jan 10, 2019

Uh oh!

samvantran commented Jan 10, 2019

Uh oh!

mpereira commented Jan 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DCOS-45840] CNI support for shuffle jobs #44

[DCOS-45840] CNI support for shuffle jobs #44

Uh oh!

Conversation

akirillov commented Jan 10, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

samvantran left a comment

Choose a reason for hiding this comment

Uh oh!

samvantran Jan 10, 2019

Choose a reason for hiding this comment

Uh oh!

akirillov Jan 10, 2019

Choose a reason for hiding this comment

Uh oh!

akirillov commented Jan 10, 2019

Uh oh!

samvantran commented Jan 10, 2019

Uh oh!

mpereira commented Jan 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants