[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. #9179

sun-rui · 2015-10-20T11:20:16Z

Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.

The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also launch R worker script.

BTW, envrionment variable "SPARKR_DRIVER_R" is used to locate R shell on the local host.

For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is python).
PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code here to determine the python executable for a python script.

tgravescs · 2015-10-20T13:09:52Z

thanks for working on this. The changes look fine to me. Obviously need someone more familiar with the R stuff to review too.

Is there any documentation for the R configs that should be updated for this?

SparkQA · 2015-10-20T13:57:34Z

Test build #43974 has finished for PR 9179 at commit 3de695c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-10-20T16:35:56Z

+1 on a documentation update, otherwise this won't be discoverable. I'll defer to a reviewer with more R experience for final sign-off and merge.

shivaram · 2015-10-20T16:58:59Z

Yeah this can go under the SparkR section at http://spark.apache.org/docs/latest/configuration.html#sparkr

shivaram · 2015-10-20T17:00:42Z

core/src/main/scala/org/apache/spark/deploy/RRunner.scala

Can this actually happen ? I thought we only used RRunner when we launched non-shell jobs (i.e. when we are executing scripts).

Yes, we only use RRunner when we launched non-shell jobs (i.e. when we are executing scripts). But R scripts can be executed in either client mode or cluster mode. In client mode, the R script is to be executed on the local host, while in cluster mode, it is to be executed on a selected worker node.

felixcheung · 2015-10-20T20:53:21Z

Does SPARKR_DRIVER_R map to spark.sparkr.r.driver.command?

felixcheung · 2015-10-20T21:16:55Z

@davies can comment, I believe in PySpark the PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON is passed from the driver to the executer
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L181
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2384
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L68

sun-rui · 2015-10-21T01:45:54Z

@felixcheung, no, SPARKR_DRIVER_R is for the R shell executable on the local host, while "spark.sparkr.r.driver.command" is intended for the Rscript script engine on the local host.

sun-rui · 2015-10-21T01:49:43Z

There seems some in-consistency for the configurations, some of which are spark conf options, the other is an env variable. I don't have strong opinion on this, and try to keep them as is for backward compatibility

sun-rui · 2015-10-21T01:50:27Z

Will update the doc until this new conf option is agreed.

sun-rui · 2015-10-21T02:02:23Z

@felixcheung, you are correct, the value of PYSPARK_PYTHON is passed to workers. In sparkR, "spark.sparkr.r.command" is a spark conf option, and will be passed to workers. They are passed to worker in different ways.

sun-rui · 2015-10-21T02:30:28Z

Add safeguard that 'spark.submit.deployMode' may be missing. This is for consistency with SPARK-10711

felixcheung · 2015-10-21T03:20:58Z

Got it, I don't know what is the plan with env vs. spark conf, but maybe we want consistency here.

SparkQA · 2015-10-21T04:29:49Z

Test build #44030 has finished for PR 9179 at commit 817cc58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-21T04:41:28Z

FWIW Spark config variables are preferred to environment variables if we are adding something new.

shivaram · 2015-10-21T04:46:03Z

@sun-rui Code change looks pretty good. Could we also make the docs change in this same PR or were you planning on a separate PR ?

sun-rui · 2015-10-21T06:09:20Z

@shivaram, I will update the documentation

sun-rui · 2015-10-22T05:04:55Z

@shivaram, documentation updated.

shivaram · 2015-10-22T05:08:09Z

LGTM. Thanks @sun-rui -- Will merge after Jenkins passes

SparkQA · 2015-10-22T07:06:51Z

Test build #44130 has finished for PR 9179 at commit bfd42e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-10-22T07:53:14Z

@shivaram, some thought on the naming consistency of options.
For pySpark, the name convention is spark.python.xxx, for example, spark.python.profile
And we a documented SparkR config option, that is: spark.r.numRBackendThreads

in this PR, the names of the two options to be documented are:
"spark.sparkr.r.driver.command" and "spark.sparkr.r.command"

I am thinking rename them to
"spark.r.driver.rscript" and "spark.r.rscript" for better naming consistency.

There is backward compatibility issue for renaming "spark.sparkr.r.command" to "spark.r.rscript", but I think since it is not documented before, it won't cost too much.

If you have no strong preference for this thought, you can merge this PR.
If you agree, I can update this PR.

shivaram · 2015-10-22T16:28:46Z

Hmm lets use spark.r.driver.command and spark.r.command. Its better to use command than rscript. For the backwards compatibility thing can we just support both in the code but only document spark.r.command ?

felixcheung · 2015-10-22T18:19:15Z

+1 on spark.r.driver.command and spark.r.command

sun-rui · 2015-10-24T00:49:56Z

@shivaram, good idea, agree.

SparkQA · 2015-10-24T03:19:56Z

Test build #44277 has finished for PR 9179 at commit 4f44138.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-24T04:37:00Z

Thanks @sun-rui -- LGTM. Merging this

Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes. The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395). BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host. For your information, PYSPARK has two environment variables serving simliar purpose: PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`). PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script. Author: Sun Rui <[email protected]> Closes #9179 from sun-rui/SPARK-10971. (cherry picked from commit 2462dbc) Signed-off-by: Shivaram Venkataraman <[email protected]>

msannell · 2015-10-29T19:42:54Z

I believe I've noticed a problem with this fix. You fixed
core/src/main/scala/org/apache/spark/deploy/RRunner.scala to handle
"spark.r.command" (with nice support for the depreciated
"spark.sparkr.r.command"), but you didn't change
core/src/main/scala/org/apache/spark/api/r/RRDD.scala, which still
just uses "spark.sparkr.r.command". This should also be changed.

sun-rui · 2015-10-30T03:50:24Z

@msannell, good catch. Thanks! I will fix it.

[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript.

3de695c

shivaram reviewed Oct 20, 2015
View reviewed changes

Do not assume 'spark.submit.deployMode' is always present.

817cc58

Update the 'spark configuration' documentation page accordingly.

bfd42e1

Renaming configuration options for better naming consistency.

4f44138

asfgit closed this in 2462dbc Oct 24, 2015

felixcheung mentioned this pull request Oct 29, 2015

[SPARK-8019] [SPARKR] Support SparkR spawning worker R processes with a command other then Rscript #6557

Closed

[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. #9179

[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. #9179

Uh oh!

Conversation

sun-rui commented Oct 20, 2015

Uh oh!

tgravescs commented Oct 20, 2015

Uh oh!

SparkQA commented Oct 20, 2015

Uh oh!

JoshRosen commented Oct 20, 2015

Uh oh!

shivaram commented Oct 20, 2015

Uh oh!

shivaram Oct 20, 2015

Choose a reason for hiding this comment

Uh oh!

sun-rui Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Oct 20, 2015

Uh oh!

felixcheung commented Oct 20, 2015

Uh oh!

sun-rui commented Oct 21, 2015

Uh oh!

sun-rui commented Oct 21, 2015

Uh oh!

sun-rui commented Oct 21, 2015

Uh oh!

sun-rui commented Oct 21, 2015

Uh oh!

sun-rui commented Oct 21, 2015

Uh oh!

felixcheung commented Oct 21, 2015

Uh oh!

SparkQA commented Oct 21, 2015

Uh oh!

shivaram commented Oct 21, 2015

Uh oh!

shivaram commented Oct 21, 2015

Uh oh!

sun-rui commented Oct 21, 2015

Uh oh!

sun-rui commented Oct 22, 2015

Uh oh!

shivaram commented Oct 22, 2015

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

sun-rui commented Oct 22, 2015

Uh oh!

shivaram commented Oct 22, 2015

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

sun-rui commented Oct 24, 2015

Uh oh!

SparkQA commented Oct 24, 2015

Uh oh!

shivaram commented Oct 24, 2015

Uh oh!

msannell commented Oct 29, 2015

Uh oh!

sun-rui commented Oct 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants