|  | 
|  | 1 | +# R on Spark | 
|  | 2 | + | 
|  | 3 | +SparkR is an R package that provides a light-weight frontend to use Spark from R. | 
|  | 4 | + | 
|  | 5 | +### SparkR development | 
|  | 6 | + | 
|  | 7 | +#### Build Spark | 
|  | 8 | + | 
|  | 9 | +Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-PsparkR` profile to build the R package. For example to use the default Hadoop versions you can run | 
|  | 10 | +``` | 
|  | 11 | +  build/mvn -DskipTests -Psparkr package | 
|  | 12 | +``` | 
|  | 13 | + | 
|  | 14 | +#### Running sparkR | 
|  | 15 | + | 
|  | 16 | +You can start using SparkR by launching the SparkR shell with | 
|  | 17 | + | 
|  | 18 | +    ./bin/sparkR | 
|  | 19 | + | 
|  | 20 | +The `sparkR` script automatically creates a SparkContext with Spark by default in | 
|  | 21 | +local mode. To specify the Spark master of a cluster for the automatically created | 
|  | 22 | +SparkContext, you can run | 
|  | 23 | + | 
|  | 24 | +    ./bin/sparkR --master "local[2]" | 
|  | 25 | + | 
|  | 26 | +To set other options like driver memory, executor memory etc. you can pass in the [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html) arguments to `./bin/sparkR` | 
|  | 27 | + | 
|  | 28 | +#### Using SparkR from RStudio | 
|  | 29 | + | 
|  | 30 | +If you wish to use SparkR from RStudio or other R frontends you will need to set some environment variables which point SparkR to your Spark installation. For example  | 
|  | 31 | +``` | 
|  | 32 | +# Set this to where Spark is installed | 
|  | 33 | +Sys.setenv(SPARK_HOME="/Users/shivaram/spark") | 
|  | 34 | +# This line loads SparkR from the installed directory | 
|  | 35 | +.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) | 
|  | 36 | +library(SparkR) | 
|  | 37 | +sc <- sparkR.init(master="local") | 
|  | 38 | +``` | 
|  | 39 | + | 
|  | 40 | +#### Making changes to SparkR | 
|  | 41 | + | 
|  | 42 | +The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR. | 
|  | 43 | +If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes. | 
|  | 44 | +Once you have made your changes, please include unit tests for them and run existing unit tests using the `run-tests.sh` script as described below.  | 
|  | 45 | +     | 
|  | 46 | +#### Generating documentation | 
|  | 47 | + | 
|  | 48 | +The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script. | 
|  | 49 | +     | 
|  | 50 | +### Examples, Unit tests | 
|  | 51 | + | 
|  | 52 | +SparkR comes with several sample programs in the `examples/src/main/r` directory. | 
|  | 53 | +To run one of them, use `./bin/sparkR <filename> <args>`. For example: | 
|  | 54 | + | 
|  | 55 | +    ./bin/sparkR examples/src/main/r/pi.R local[2] | 
|  | 56 | + | 
|  | 57 | +You can also run the unit-tests for SparkR by running (you need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first): | 
|  | 58 | + | 
|  | 59 | +    R -e 'install.packages("testthat", repos="http://cran.us.r-project.org")' | 
|  | 60 | +    ./R/run-tests.sh | 
|  | 61 | + | 
|  | 62 | +### Running on YARN | 
|  | 63 | +The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run | 
|  | 64 | +``` | 
|  | 65 | +export YARN_CONF_DIR=/etc/hadoop/conf | 
|  | 66 | +./bin/spark-submit --master yarn examples/src/main/r/pi.R 4 | 
|  | 67 | +``` | 
0 commit comments