Skip to content

Commit f5d3355

Browse files
committed
Merge pull request apache#218 from davies/merge
[SPARKR-225] Merge master into sparkr-sql branch
2 parents 3139325 + 70f620c commit f5d3355

File tree

23 files changed

+1219
-568
lines changed

23 files changed

+1219
-568
lines changed

BUILDING.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,8 @@ include Rtools and R in `PATH`.
77
2. Install
88
[JDK7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html) and set
99
`JAVA_HOME` in the system environment variables.
10-
3. Install `rJava` using `install.packages(rJava)`. If rJava fails to load due to missing jvm.dll,
11-
you will need to add the directory containing jvm.dll to `PATH`. See this [stackoverflow post](http://stackoverflow.com/a/7604469]
12-
for more details.
13-
4. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin`
10+
3. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin`
1411
directory in Maven in `PATH`.
15-
5. Get SparkR source code either using [`git]`(http://git-scm.com/downloads) or by downloading a
12+
4. Get SparkR source code either using [`git`](http://git-scm.com/downloads) or by downloading a
1613
source zip from github.
17-
6. Open a command shell (`cmd`) in the SparkR directory and run `install-dev.bat`
14+
5. Open a command shell (`cmd`) in the SparkR directory and run `install-dev.bat`

README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,22 @@ the environment variable `USE_MAVEN=1`. For example
4646
If you are building SparkR from behind a proxy, you can [setup maven](https://maven.apache.org/guides/mini/guide-proxies.html) to use the right proxy
4747
server.
4848

49+
#### Building from source from GitHub
50+
51+
Run the following within R to pull source code from GitHub and build locally. It is possible
52+
to specify build dependencies by starting R with environment values:
53+
54+
1. Start R
55+
```
56+
SPARK_VERSION=1.2.0 SPARK_HADOOP_VERSION=2.5.0 R
57+
```
58+
59+
2. Run install_github
60+
```
61+
library(devtools)
62+
install_github("repo/SparkR-pkg", ref="branchname", subdir="pkg")
63+
```
64+
*note: replace repo and branchname*
4965

5066
## Running sparkR
5167
If you have cloned and built SparkR, you can start using it by launching the SparkR
@@ -110,17 +126,47 @@ cd SparkR-pkg/
110126
USE_YARN=1 SPARK_YARN_VERSION=2.4.0 SPARK_HADOOP_VERSION=2.4.0 ./install-dev.sh
111127
```
112128

129+
Alternatively, install_github can be use (on CDH in this case):
130+
131+
```
132+
# assume devtools package is installed by install.packages("devtools")
133+
USE_YARN=1 SPARK_VERSION=1.1.0 SPARK_YARN_VERSION=2.5.0-cdh5.3.0 SPARK_HADOOP_VERSION=2.5.0-cdh5.3.0 R
134+
```
135+
Then within R,
136+
```
137+
library(devtools)
138+
install_github("amplab-extras/SparkR-pkg", ref="master", subdir="pkg")
139+
```
140+
113141
Before launching an application, make sure each worker node has a local copy of `lib/SparkR/sparkr-assembly-0.1.jar`. With a cluster launched with the `spark-ec2` script, do:
114142
```
115143
~/spark-ec2/copy-dir ~/SparkR-pkg
116144
```
145+
Or run the above installation steps on all worker node.
117146

118147
Finally, when launching an application, the environment variable `YARN_CONF_DIR` needs to be set to the directory which contains the client-side configuration files for the Hadoop cluster (with a cluster launched with `spark-ec2`, this defaults to `/root/ephemeral-hdfs/conf/`):
119148
```
120149
YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ MASTER=yarn-client ./sparkR
121150
YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ ./sparkR examples/pi.R yarn-client
122151
```
123152

153+
## Running on a cluster using sparkR-submit
154+
155+
sparkR-submit is a script introduced to facilitate submission of SparkR jobs to a Spark supported cluster (eg. Standalone, Mesos, YARN).
156+
It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html). SPARK_HOME and JAVA_HOME must be defined.
157+
158+
On YARN, YARN_CONF_DIR must be defined. sparkR-submit supports [YARN deploy modes](http://spark.apache.org/docs/latest/running-on-yarn.html): yarn-client and yarn-cluster.
159+
160+
sparkR-submit is installed with the SparkR package. By default, it can be found under the default Library (['library'](https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html) subdirectory of R_HOME)
161+
162+
For example, to run on YARN (CDH 5.3.0),
163+
```
164+
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark
165+
export YARN_CONF_DIR=/etc/hadoop/conf
166+
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
167+
/usr/lib64/R/library/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4
168+
```
169+
124170
## Report Issues/Feedback
125171

126172
For better tracking and collaboration, issues and TODO items are reported to a dedicated [SparkR JIRA](https://sparkr.atlassian.net/browse/SPARKR/).

pkg/DESCRIPTION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Suggests:
1515
Description: R frontend for Spark
1616
License: Apache License (== 2.0)
1717
Collate:
18+
'generics.R'
1819
'jobj.R'
1920
'SQLTypes.R'
2021
'RDD.R'

pkg/NAMESPACE

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ exportMethods(
66
"aggregateRDD",
77
"cache",
88
"checkpoint",
9+
"coalesce",
910
"cogroup",
1011
"collect",
1112
"collectAsMap",
@@ -48,6 +49,7 @@ exportMethods(
4849
"reduce",
4950
"reduceByKey",
5051
"reduceByKeyLocally",
52+
"repartition",
5153
"rightOuterJoin",
5254
"sampleRDD",
5355
"saveAsTextFile",
@@ -61,7 +63,9 @@ exportMethods(
6163
"unionRDD",
6264
"unpersist",
6365
"value",
64-
"values"
66+
"values",
67+
"zipWithIndex",
68+
"zipWithUniqueId"
6569
)
6670

6771
# S3 methods exported

pkg/R/DataFrame.R

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -393,10 +393,6 @@ setMethod("unpersist",
393393
#' df <- jsonFile(sqlCtx, path)
394394
#' newDF <- repartition(df, 2L)
395395
#'}
396-
setGeneric("repartition", function(x, numPartitions) { standardGeneric("repartition") })
397-
398-
#' @rdname repartition
399-
#' @export
400396
setMethod("repartition",
401397
signature(x = "DataFrame", numPartitions = "numeric"),
402398
function(x, numPartitions) {

0 commit comments

Comments
 (0)