[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

alembiewski · 2019-06-13T10:27:53Z

What changes were proposed in this pull request?

Spark version updated to 2.4.3

Cherry-picked commits:

[DCOS-34549] Mesos label NPE fix ([DCOS-34549] Mesos label NPE fix #60)
Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (Improved naming for Mesos metrics and metric sources #58)
[DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos ([DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos #55)
[DCOS-51453] Added Hadoop 2.9 profile ([DCOS-51453][HADOOP] Add Hadoop 2.9 profile #53)
[DCOS-51454] Remove irrelevant Mesos REPL test ([DCOS-51454] Remove irrelevant Mesos REPL test #54)
[DCOS-51158] Improved Task ID assignment for Executor tasks ([DCOS-51158] Improved Task ID assignment for Executor tasks #52)
[DCOS-40974] Mesos checkpointing support for Spark Drivers ([DCOS-40974] Mesos checkpointing support for Spark Drivers #51)
[DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos ([DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos #49)
Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults."
[DCOS-46585] Fix supervised driver retry logic for outdated tasks ([DCOS-46585] Fix supervised driver retry logic for outdated tasks #46)
Spark Dispatcher support for launching applications in the same virtual network by default ([DCOS-46369] Spark Dispatcher should run Spark Drivers and Executors in the same virtual network if specified #45)
CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle ([DCOS-45840] CNI support for shuffle jobs #44)
Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. ([DCOS-45820] Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs #43)
[DCOS-39150][SPARK] Support unique Executor IDs in cluster managers ([DCOS-39150][SPARK] Support unique Executor IDs in cluster managers #36)
[SPARK-23941][MESOS] Mesos task failed on specific spark app name ([SPARK-23941][MESOS] Mesos task failed on specific spark app name #33)
[SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs
[SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver
File Based Secrets support
Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler

How was this patch tested?

integration tests;
unit test (spark-243-maven-test-log.log)

…MesosClusterScheduler

- Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published.

…Executor CPUs

* [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate

) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)"

…tes for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) # Conflicts: # dev/deps/spark-deps-hadoop-2.6 # dev/deps/spark-deps-hadoop-2.7 # dev/deps/spark-deps-hadoop-3.1 # pom.xml

…ad of 0.0.0.0 to properly advertise executors during shuffle (#44)

…al network by default (#45)

This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job.

…faults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled`

… Mesos (#49)

…river when running on Mesos (#55)

…ng with existing Spark metrics (#58)

samvantran

Nice work @alembiewski! Thanks for adding the logs for unit tests and I can see all the ITs passed too. You've added the more recent commits to the master list of cherry picks to include for every bump?

@akirillov we should start that process for upstreaming as it looks like the number is starting to increase in our fork

alembiewski · 2019-06-13T19:32:07Z

Thanks @samvantran! Yes, the list in the PR's description includes all the recent commits.

master list of cherry picks to include for every bump

Just to check, do we have a special docs, like wiki page, that should be updated as well?

akirillov

Thanks, @alembiewski!

Although the fact that both unit and integration tests have passed serves as good evidence that everything works, it's really hard to provide a review for 270 commits and 400+ changed files. What I'd suggest: can you please try creating a copy of the upstream 2.4.3 release branch, name it custom-branch-2.4.3 (no x) and create a PR against it so that the PR contains only cherry-picks from Mesosphere?

This way the review is possible as we apply only Mesosphere changes on top of the upstream release (in fact this is the same case here but there's a chance we could have missed something because the actual review is so complicated). As all the merges and fixes have already taken place in the scope of this PR, this shouldn't be a huge problem.

WDYT?
cc @samvantran

akirillov · 2019-08-14T15:16:09Z

Thanks for updating the PR @alembiewski. Cherry-picked commits and the PR overall LGTM. I re-ran tests and CI looks good, however, unit tests fail on the same test (ran the full suite twice):

KafkaSourceStressForDontFailOnDataLossSuite:
- stress test for failOnDataLoss=false *** FAILED ***
  org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 8a8114de-e47a-453e-954e-71056152e4db, runId = 94c165ac-8ba9-49ad-bb90-09d3869bcda4] terminated with exception: Timeout of 3000ms expired before the position for partition failOnDataLoss-3-0 could be determined
  at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
  at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
  ...
  Cause: org.apache.kafka.common.errors.TimeoutException: Timeout of 3000ms expired before the position for partition failOnDataLoss-3-0 could be determined

Please find a build log attached: spark_mvn_build.log

alembiewski · 2019-08-15T13:41:14Z

Hmm, I would say, that it's quite strange, I ran them today on a private AWS instance, following the instructions from our wiki, and they seem to pass. The box's description below:

Instance type: m5.xlarge
Availability zone: us-west-2a
AMI ID: ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)

Built from the proper commit:

ubuntu@ip-172-31-18-40:~/spark$ git rev-parse HEAD
b5a9366f693e8c2ef363680948c06c43cb423950

Build log: mvn.log

Based on exception type, could it be related to some network issues?

akirillov

After rerunning the tests in a different AWS region and instance type, everything looks good!

Build log attached: spark-build-mvn.log

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

* [DCOS-54813] Base tech update from 2.4.0 to 2.4.3 (#62) * [DCOS-52207][Spark] Make Mesos Agent Blacklisting behavior configurable and more tolerant of failures. (#63) * [DCOS-58386] Node draining support for supervised drivers; Mesos Java bump to 1.9.0 (#65) * [DCOS-58389] Role propagation and enforcement support for Mesos Dispatcher (#66) * DCOS-57560: Suppress/revive support for Mesos (#67) * D2IQ-64778: Unregister progress listeners before stopping executor threads. * [SPARK-32675][MESOS] --py-files option is appended without passing value for it ### What changes were proposed in this pull request? The PR checks for the emptiness of `--py-files` value and uses it only if it is not empty. ### Why are the changes needed? There is a bug in Mesos cluster mode REST Submission API. It is using `--py-files` option without specifying any value for the conf `spark.submit.pyFiles` by the user. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? * Submitting an application to a Mesos cluster: `curl -X POST http://localhost:7077/v1/submissions/create --header "Content-Type:application/json" --data '{ "action": "CreateSubmissionRequest", "appResource": "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar", "clientSparkVersion": "3.0.0", "appArgs": ["30"], "environmentVariables": {}, "mainClass": "org.apache.spark.examples.SparkPi", "sparkProperties": { "spark.jars": "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar", "spark.driver.supervise": "false", "spark.executor.memory": "512m", "spark.driver.memory": "512m", "spark.submit.deployMode": "cluster", "spark.app.name": "SparkPi", "spark.master": "mesos://localhost:5050" }}'` * It should be able to pick the correct class and run the job successfully. Closes apache#29499 from farhan5900/SPARK-32675. Authored-by: farhan5900 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8749f2e) * Applied all the remaining custom commits * Removes duplicate org.scala-lang.modules dependency * Removes duplicate gpus method * Adds second parameter of deprecated annotation * Fix typemismatch * Removes extra shellEscape function calls * Adds env variable for file based auth * Remove generic shell escape and put it in specific places * fix ambiguous import * fix shell escaping for `--conf` options * fix option type in test * Fixes scalastyle and unit test issues Co-authored-by: Alexander Lembiewski <[email protected]> Co-authored-by: Farhan <[email protected]> Co-authored-by: Anton Kirillov <[email protected]> Co-authored-by: Anton Kirillov <[email protected]> Co-authored-by: Roman Palaznik <[email protected]> Co-authored-by: Roman Palaznik <[email protected]>

Michael Gummelt and others added 19 commits June 12, 2019 12:59

Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in …

203ca29

…MesosClusterScheduler

File Based Secrets support

d97b04e

[SPARK-692] Added spark.mesos.executor.gpus to specify the number of …

fcfe237

…Executor CPUs

CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP inste…

b86608b

…ad of 0.0.0.0 to properly advertise executors during shuffle (#44)

Spark Dispatcher support for launching applications in the same virtu…

5165aad

…al network by default (#45)

[DCOS-49020] Specify user in CommandInfo for Spark Driver launched on…

b52222e

… Mesos (#49)

[DCOS-40974] Mesos checkpointing support for Spark Drivers (#51)

f4d7757

[DCOS-51158] Improved Task ID assignment for Executor tasks (#52)

d91f193

[DCOS-51454] Remove irrelevant Mesos REPL test (#54)

74dd009

[DCOS-51453] Added Hadoop 2.9 profile (#53)

333f532

[DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the D…

ebab0c0

…river when running on Mesos (#55)

Refactoring of metrics naming to add mesos semantics and avoid clashi…

f220863

…ng with existing Spark metrics (#58)

[DCOS-34549] Mesos label NPE fix (#60)

b5a9366

alembiewski requested review from akirillov, farhan5900, rpalaznik and samvantran June 13, 2019 10:27

samvantran approved these changes Jun 13, 2019

View reviewed changes

akirillov reviewed Aug 12, 2019

View reviewed changes

alembiewski changed the base branch from custom-branch-2.4.x to custom-branch-2.4.3 August 13, 2019 09:47

akirillov changed the title ~~DCOS-54813 - Base tech update from 2.4.0 to 2.4.3~~ [DCOS-54813] Base tech update from 2.4.0 to 2.4.3 Aug 13, 2019

akirillov approved these changes Aug 16, 2019

View reviewed changes

alembiewski merged commit 401fd7b into custom-branch-2.4.3 Aug 19, 2019

alexeygorobets pushed a commit that referenced this pull request Oct 13, 2020

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 (#62)

b2b67cd

alexeygorobets mentioned this pull request Oct 13, 2020

Merge custom commits 3.0.1 #69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

Uh oh!

alembiewski commented Jun 13, 2019

Uh oh!

samvantran left a comment

Uh oh!

alembiewski commented Jun 13, 2019 •

edited

Loading

Uh oh!

akirillov left a comment

Uh oh!

akirillov commented Aug 14, 2019

Uh oh!

alembiewski commented Aug 15, 2019

Uh oh!

akirillov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

Uh oh!

Conversation

alembiewski commented Jun 13, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

samvantran left a comment

Choose a reason for hiding this comment

Uh oh!

alembiewski commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akirillov left a comment

Choose a reason for hiding this comment

Uh oh!

akirillov commented Aug 14, 2019

Uh oh!

alembiewski commented Aug 15, 2019

Uh oh!

akirillov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

alembiewski commented Jun 13, 2019 •

edited

Loading