-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. #9179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
thanks for working on this. The changes look fine to me. Obviously need someone more familiar with the R stuff to review too. Is there any documentation for the R configs that should be updated for this? |
|
Test build #43974 has finished for PR 9179 at commit
|
|
+1 on a documentation update, otherwise this won't be discoverable. I'll defer to a reviewer with more R experience for final sign-off and merge. |
|
Yeah this can go under the SparkR section at http://spark.apache.org/docs/latest/configuration.html#sparkr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this actually happen ? I thought we only used RRunner when we launched non-shell jobs (i.e. when we are executing scripts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we only use RRunner when we launched non-shell jobs (i.e. when we are executing scripts). But R scripts can be executed in either client mode or cluster mode. In client mode, the R script is to be executed on the local host, while in cluster mode, it is to be executed on a selected worker node.
|
Does SPARKR_DRIVER_R map to spark.sparkr.r.driver.command? |
|
@davies can comment, I believe in PySpark the PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON is passed from the driver to the executer |
|
@felixcheung, no, SPARKR_DRIVER_R is for the R shell executable on the local host, while "spark.sparkr.r.driver.command" is intended for the Rscript script engine on the local host. |
|
There seems some in-consistency for the configurations, some of which are spark conf options, the other is an env variable. I don't have strong opinion on this, and try to keep them as is for backward compatibility |
|
Will update the doc until this new conf option is agreed. |
|
@felixcheung, you are correct, the value of PYSPARK_PYTHON is passed to workers. In sparkR, "spark.sparkr.r.command" is a spark conf option, and will be passed to workers. They are passed to worker in different ways. |
|
Add safeguard that 'spark.submit.deployMode' may be missing. This is for consistency with SPARK-10711 |
|
Got it, I don't know what is the plan with env vs. spark conf, but maybe we want consistency here. |
|
Test build #44030 has finished for PR 9179 at commit
|
|
FWIW Spark config variables are preferred to environment variables if we are adding something new. |
|
@sun-rui Code change looks pretty good. Could we also make the docs change in this same PR or were you planning on a separate PR ? |
|
@shivaram, I will update the documentation |
|
@shivaram, documentation updated. |
|
LGTM. Thanks @sun-rui -- Will merge after Jenkins passes |
|
Test build #44130 has finished for PR 9179 at commit
|
|
@shivaram, some thought on the naming consistency of options. in this PR, the names of the two options to be documented are: I am thinking rename them to There is backward compatibility issue for renaming "spark.sparkr.r.command" to "spark.r.rscript", but I think since it is not documented before, it won't cost too much. If you have no strong preference for this thought, you can merge this PR. |
|
Hmm lets use |
|
+1 on |
|
@shivaram, good idea, agree. |
|
Test build #44277 has finished for PR 9179 at commit
|
|
Thanks @sun-rui -- LGTM. Merging this |
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes. The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395). BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host. For your information, PYSPARK has two environment variables serving simliar purpose: PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`). PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script. Author: Sun Rui <[email protected]> Closes #9179 from sun-rui/SPARK-10971. (cherry picked from commit 2462dbc) Signed-off-by: Shivaram Venkataraman <[email protected]>
|
I believe I've noticed a problem with this fix. You fixed |
|
@msannell, good catch. Thanks! I will fix it. |
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.
The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also launch R worker script.
BTW, envrionment variable "SPARKR_DRIVER_R" is used to locate R shell on the local host.
For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is
python).PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code here to determine the python executable for a python script.