Skip to content
This repository was archived by the owner on Oct 23, 2024. It is now read-only.

Conversation

@akirillov
Copy link

@akirillov akirillov commented Oct 11, 2018

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. When running on DCOS that means that Mesos Task IDs for Spark Executors can have duplicates too. This change introduces unique identifiers for both CoarseGrainedSchedulerBackend and YarnAllocator which are prepended to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f: "initialize executorIdCounter after ApplicationMaster killed for max n…)" realted to Jira Issue SPARK-12864 because it addressed the similar but YARN-case specific issue.

How was this patch tested?

The change is verified by a unit test and external integration test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@akirillov akirillov force-pushed the DCOS-39150-cluster-wide-unique-ids-for-spark-executors branch from d1f5e86 to 9457d6f Compare October 11, 2018 21:36
@akirillov akirillov requested review from elezar, imaxxs, soumasish and vishnu2kmohan and removed request for imaxxs and vishnu2kmohan October 11, 2018 21:37
Copy link

@samvantran samvantran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but you'll need to fix the broken test as it looks for the old taskId=0

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs.
This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a
CoarseGrainedSchedulerBackend instance in front of numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"
@akirillov akirillov force-pushed the DCOS-39150-cluster-wide-unique-ids-for-spark-executors branch from 9457d6f to e81871c Compare October 13, 2018 00:39
@akirillov akirillov merged commit fa0ad77 into custom-branch-2.2.1-X Oct 16, 2018
@akirillov akirillov deleted the DCOS-39150-cluster-wide-unique-ids-for-spark-executors branch October 19, 2018 16:45
akirillov added a commit that referenced this pull request Oct 22, 2018
)

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"
akirillov added a commit that referenced this pull request Feb 8, 2019
)

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"
alembiewski pushed a commit that referenced this pull request Jun 12, 2019
)

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"
alembiewski added a commit that referenced this pull request Aug 19, 2019
* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler

* File Based Secrets support

* [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver

- Counters: The total number of times that submissions have entered states
- Timers: The duration from submit or launch until a submission entered a given state
- Histogram: The retry counts at time of retry

* Fixes to handling finished drivers

- Rename 'failed' case to 'exception'
- When a driver is 'finished', record its final MesosTaskState
- Fix naming consistency after seeing how they look in practice

* Register "finished" counters up-front

Otherwise their values are never published.

* [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33)

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name

Port from SPARK#21014

** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files

==============

* Shell escape only appName, mainClass, default and driverConf

Specifically, we do not want to shell-escape the --py-files. What we've
seen IRL is that for spark jobs that use docker images coupled w/ python
files, the $MESOS_SANDBOX path is escaped and results in
FileNotFoundErrors during py4j.SparkSession.getOrCreate

* [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36)

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"

* Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43)

List of upgrades for 3rd-party libraries having CVEs:

- Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009
- Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720
- ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html)

# Conflicts:
#	dev/deps/spark-deps-hadoop-2.6
#	dev/deps/spark-deps-hadoop-2.7
#	dev/deps/spark-deps-hadoop-3.1
#	pom.xml

* CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44)

* Spark Dispatcher support for launching applications in the same virtual network by default (#45)

* [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46)

This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job.

* Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults."

This reverts commit 1024875.

The change introduced in the reverted commit is breaking:
- breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit
- reverts the default behavior for Spark Standalone from REST to legacy RPC
- contains misleading messages in `require` assertion blocks
- prevents users from running jobs without specifying `spark.master.rest.enabled`

* [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49)

* [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51)

* [DCOS-51158] Improved Task ID assignment for Executor tasks (#52)

* [DCOS-51454] Remove irrelevant Mesos REPL test (#54)

* [DCOS-51453] Added Hadoop 2.9 profile (#53)

* [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55)

* Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58)

* [DCOS-34549] Mesos label NPE fix (#60)
rpalaznik pushed a commit that referenced this pull request Feb 24, 2020
* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler

* File Based Secrets support

* [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver

- Counters: The total number of times that submissions have entered states
- Timers: The duration from submit or launch until a submission entered a given state
- Histogram: The retry counts at time of retry

* Fixes to handling finished drivers

- Rename 'failed' case to 'exception'
- When a driver is 'finished', record its final MesosTaskState
- Fix naming consistency after seeing how they look in practice

* Register "finished" counters up-front

Otherwise their values are never published.

* [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33)

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name

Port from SPARK#21014

** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files

==============

* Shell escape only appName, mainClass, default and driverConf

Specifically, we do not want to shell-escape the --py-files. What we've
seen IRL is that for spark jobs that use docker images coupled w/ python
files, the $MESOS_SANDBOX path is escaped and results in
FileNotFoundErrors during py4j.SparkSession.getOrCreate

* [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36)

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"

* Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43)

List of upgrades for 3rd-party libraries having CVEs:

- Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009
- Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720
- ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html)

* CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44)

* Spark Dispatcher support for launching applications in the same virtual network by default (#45)

* [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46)

This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job.

* Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults."

This reverts commit 1024875.

The change introduced in the reverted commit is breaking:
- breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit
- reverts the default behavior for Spark Standalone from REST to legacy RPC
- contains misleading messages in `require` assertion blocks
- prevents users from running jobs without specifying `spark.master.rest.enabled`

* [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49)

* [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51)

* [DCOS-51158] Improved Task ID assignment for Executor tasks (#52)

* [DCOS-51454] Remove irrelevant Mesos REPL test (#54)

* [DCOS-51453] Added Hadoop 2.9 profile (#53)

* [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55)

* Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58)

* [DCOS-34549] Mesos label NPE fix (#60)
farhan5900 pushed a commit that referenced this pull request Aug 7, 2020
* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler

* File Based Secrets support

* [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver

- Counters: The total number of times that submissions have entered states
- Timers: The duration from submit or launch until a submission entered a given state
- Histogram: The retry counts at time of retry

* Fixes to handling finished drivers

- Rename 'failed' case to 'exception'
- When a driver is 'finished', record its final MesosTaskState
- Fix naming consistency after seeing how they look in practice

* Register "finished" counters up-front

Otherwise their values are never published.

* [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33)

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name

Port from SPARK#21014

** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files

==============

* Shell escape only appName, mainClass, default and driverConf

Specifically, we do not want to shell-escape the --py-files. What we've
seen IRL is that for spark jobs that use docker images coupled w/ python
files, the $MESOS_SANDBOX path is escaped and results in
FileNotFoundErrors during py4j.SparkSession.getOrCreate

* [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36)

Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers.

This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"

* Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43)

List of upgrades for 3rd-party libraries having CVEs:

- Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009
- Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720
- ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html)

* CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44)

* Spark Dispatcher support for launching applications in the same virtual network by default (#45)

* [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46)

This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job.

* Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults."

This reverts commit 1024875.

The change introduced in the reverted commit is breaking:
- breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit
- reverts the default behavior for Spark Standalone from REST to legacy RPC
- contains misleading messages in `require` assertion blocks
- prevents users from running jobs without specifying `spark.master.rest.enabled`

* [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49)

* [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51)

* [DCOS-51158] Improved Task ID assignment for Executor tasks (#52)

* [DCOS-51454] Remove irrelevant Mesos REPL test (#54)

* [DCOS-51453] Added Hadoop 2.9 profile (#53)

* [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55)

* Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58)

* [DCOS-34549] Mesos label NPE fix (#60)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants