[DCOS-51158] Improved Task ID assignment for Executor tasks #52

akirillov · 2019-04-02T23:22:01Z

What changes were proposed in this pull request?

Bugfix for Spark Executor ID being different from Mesos Task ID
Bugfix: thread-safe sequential task IDs for Executors

How was this patch tested?

unit tests from this repo
Integration tests from mesosphere/spark-build

Release notes

[Bugfix] Fixed bug for Spark Executor ID being different from Mesos Task ID

akirillov · 2019-04-04T00:49:00Z

Closing this PR due to incompatibility with DCOS CLI. Whenever we try to use scheduler task ID as a part of child task IDs we’re going to hit this CLI issue:

dcos task log --completed --lines=1000 driver-20190404003245-0001
found more than one task with the same name: [
'ae94906c-af9c-4cb2-8976-8be248c401cf-0009-driver-20190404003245-0001-1', 
'ae94906c-af9c-4cb2-8976-8be248c401cf-0009-driver-20190404003245-0001-2', 
'ae94906c-af9c-4cb2-8976-8be248c401cf-0009-driver-20190404003245-0001-0', 
'driver-20190404003245-0001']. 

Please provide a unique task name.

akirillov · 2019-04-04T01:02:25Z

Reopened the PR due to Executor ID and thread safety bugfixes

samvantran

LGTM!

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) # Conflicts: # dev/deps/spark-deps-hadoop-2.6 # dev/deps/spark-deps-hadoop-2.7 # dev/deps/spark-deps-hadoop-3.1 # pom.xml * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

akirillov requested a review from samvantran April 2, 2019 23:22

akirillov force-pushed the DCOS-51158-improved-task-id-generation-and-assignment branch from a0ec2fe to e037ff9 Compare April 4, 2019 00:46

akirillov closed this Apr 4, 2019

akirillov reopened this Apr 4, 2019

akirillov force-pushed the DCOS-51158-improved-task-id-generation-and-assignment branch from e037ff9 to d0a19d9 Compare April 4, 2019 01:01

[DCOS-51158] Improved Task ID assignment for Executor tasks

ac97db0

akirillov force-pushed the DCOS-51158-improved-task-id-generation-and-assignment branch from d0a19d9 to ac97db0 Compare April 4, 2019 17:08

samvantran approved these changes Apr 4, 2019

View reviewed changes

akirillov merged commit 7794d38 into custom-branch-2.4.x Apr 4, 2019

vishnu2kmohan deleted the DCOS-51158-improved-task-id-generation-and-assignment branch May 14, 2019 09:17

alembiewski pushed a commit that referenced this pull request Jun 12, 2019

[DCOS-51158] Improved Task ID assignment for Executor tasks (#52)

d91f193

alembiewski mentioned this pull request Jun 13, 2019

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

Merged

This was referenced Oct 18, 2020

[DCOS-51158] Improved Task ID assignment for Executor tasks #83

Closed

[DCOS-39150][SPARK] Support unique Executor IDs in cluster managers #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DCOS-51158] Improved Task ID assignment for Executor tasks #52

[DCOS-51158] Improved Task ID assignment for Executor tasks #52

Uh oh!

akirillov commented Apr 2, 2019 •

edited

Loading

Uh oh!

akirillov commented Apr 4, 2019

Uh oh!

akirillov commented Apr 4, 2019

Uh oh!

samvantran left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DCOS-51158] Improved Task ID assignment for Executor tasks #52

[DCOS-51158] Improved Task ID assignment for Executor tasks #52

Uh oh!

Conversation

akirillov commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Release notes

Uh oh!

akirillov commented Apr 4, 2019

Uh oh!

akirillov commented Apr 4, 2019

Uh oh!

samvantran left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akirillov commented Apr 2, 2019 •

edited

Loading