davies
diff --git a/‎BUILDING.md‎
Lines changed: 3 additions & 6 deletions b/‎BUILDING.md‎
Lines changed: 3 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 46 additions & 0 deletions b/‎README.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎pkg/DESCRIPTION‎
Lines changed: 1 addition & 0 deletions b/‎pkg/DESCRIPTION‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎pkg/NAMESPACE‎
Lines changed: 5 additions & 1 deletion b/‎pkg/NAMESPACE‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎pkg/R/DataFrame.R‎
Lines changed: 0 additions & 4 deletions b/‎pkg/R/DataFrame.R‎
Lines changed: 0 additions & 4 deletions
@@ -7,11 +7,8 @@ include Rtools and R in `PATH`.
 2. Install
 [JDK7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html) and set
 `JAVA_HOME` in the system environment variables.
-3. Install `rJava` using `install.packages(rJava)`. If rJava fails to load due to missing jvm.dll,
-you will need to add the directory containing jvm.dll to `PATH`. See this [stackoverflow post](http://stackoverflow.com/a/7604469]
-for more details.
-4. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin`
+3. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin`
 directory in Maven in `PATH`.
-5. Get SparkR source code either using [`git]`(http://git-scm.com/downloads) or by downloading a
+4. Get SparkR source code either using [`git`](http://git-scm.com/downloads) or by downloading a
 source zip from github.
-6. Open a command shell (`cmd`) in the SparkR directory and run `install-dev.bat`
+5. Open a command shell (`cmd`) in the SparkR directory and run `install-dev.bat`
@@ -46,6 +46,22 @@ the environment variable `USE_MAVEN=1`. For example
 If you are building SparkR from behind a proxy, you can [setup maven](https://maven.apache.org/guides/mini/guide-proxies.html) to use the right proxy
 server.
 
+#### Building from source from GitHub
+
+Run the following within R to pull source code from GitHub and build locally. It is possible
+to specify build dependencies by starting R with environment values:
+
+1. Start R
+```
+SPARK_VERSION=1.2.0 SPARK_HADOOP_VERSION=2.5.0 R
+```
+
+2. Run install_github
+```
+library(devtools)
+install_github("repo/SparkR-pkg", ref="branchname", subdir="pkg")
+```
+*note: replace repo and branchname*
 
 ## Running sparkR
 If you have cloned and built SparkR, you can start using it by launching the SparkR
@@ -110,17 +126,47 @@ cd SparkR-pkg/
 USE_YARN=1 SPARK_YARN_VERSION=2.4.0 SPARK_HADOOP_VERSION=2.4.0 ./install-dev.sh
 ```
 
+Alternatively, install_github can be use (on CDH in this case):
+
+```
+# assume devtools package is installed by install.packages("devtools")
+USE_YARN=1 SPARK_VERSION=1.1.0 SPARK_YARN_VERSION=2.5.0-cdh5.3.0 SPARK_HADOOP_VERSION=2.5.0-cdh5.3.0 R
+```
+Then within R,
+```
+library(devtools)
+install_github("amplab-extras/SparkR-pkg", ref="master", subdir="pkg")
+```
+
 Before launching an application, make sure each worker node has a local copy of `lib/SparkR/sparkr-assembly-0.1.jar`. With a cluster launched with the `spark-ec2` script, do:
 ```
 ~/spark-ec2/copy-dir ~/SparkR-pkg
 ```
+Or run the above installation steps on all worker node.
 
 Finally, when launching an application, the environment variable `YARN_CONF_DIR` needs to be set to the directory which contains the client-side configuration files for the Hadoop cluster (with a cluster launched with `spark-ec2`, this defaults to `/root/ephemeral-hdfs/conf/`):
 ```
 YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ MASTER=yarn-client ./sparkR
 YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ ./sparkR examples/pi.R yarn-client
 ```
 
+## Running on a cluster using sparkR-submit
+
+sparkR-submit is a script introduced to facilitate submission of SparkR jobs to a Spark supported cluster (eg. Standalone, Mesos, YARN).
+It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html). SPARK_HOME and JAVA_HOME must be defined.
+
+On YARN, YARN_CONF_DIR must be defined. sparkR-submit supports [YARN deploy modes](http://spark.apache.org/docs/latest/running-on-yarn.html): yarn-client and yarn-cluster.
+
+sparkR-submit is installed with the SparkR package. By default, it can be found under the default Library (['library'](https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html) subdirectory of R_HOME)
+
+For example, to run on YARN (CDH 5.3.0),
+```
+export SPARK_HOME=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark
+export YARN_CONF_DIR=/etc/hadoop/conf
+export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
+/usr/lib64/R/library/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4
+```
+
 ## Report Issues/Feedback 
 
 For better tracking and collaboration, issues and TODO items are reported to a dedicated [SparkR JIRA](https://sparkr.atlassian.net/browse/SPARKR/).
 
@@ -15,6 +15,7 @@ Suggests:
 Description: R frontend for Spark
 License: Apache License (== 2.0)
 Collate:
+    'generics.R'
     'jobj.R'
     'SQLTypes.R'
     'RDD.R'
 
@@ -6,6 +6,7 @@ exportMethods(
               "aggregateRDD",
               "cache",
               "checkpoint",
+              "coalesce",
               "cogroup",
               "collect",
               "collectAsMap",
@@ -48,6 +49,7 @@ exportMethods(
               "reduce",
               "reduceByKey",
               "reduceByKeyLocally",
+              "repartition",
               "rightOuterJoin",
               "sampleRDD",
               "saveAsTextFile",
@@ -61,7 +63,9 @@ exportMethods(
               "unionRDD",
               "unpersist",
               "value",
-              "values"
+              "values",
+              "zipWithIndex",
+              "zipWithUniqueId"
              )
 
 # S3 methods exported
 
@@ -393,10 +393,6 @@ setMethod("unpersist",
 #' df <- jsonFile(sqlCtx, path)
 #' newDF <- repartition(df, 2L)
 #'}
-setGeneric("repartition", function(x, numPartitions) { standardGeneric("repartition") })
-
-#' @rdname repartition
-#' @export
 setMethod("repartition",
           signature(x = "DataFrame", numPartitions = "numeric"),
           function(x, numPartitions) {