You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
Add
- R vignettes
- R programming guide
- SS programming guide
- R example
Also disable spark.als in vignettes for now since it's failing (SPARK-20402)
manually
Author: Felix Cheung <[email protected]>
Closes#17814 from felixcheung/rdocss.
(cherry picked from commit b8302cc)
Signed-off-by: Felix Cheung <[email protected]>
Copy file name to clipboardExpand all lines: R/pkg/vignettes/sparkr-vignettes.Rmd
+72-5Lines changed: 72 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -182,7 +182,7 @@ head(df)
182
182
```
183
183
184
184
### Data Sources
185
-
SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. You can check the Spark SQL programming guide for more [specific options](https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources.
185
+
SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. You can check the Spark SQL Programming Guide for more [specific options](https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources.
186
186
187
187
The general method for creating `SparkDataFrame` from data sources is `read.df`. This method takes in the path for the file to load and the type of data source, and the currently active Spark Session will be used automatically. SparkR supports reading CSV, JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like Avro. These packages can be added with `sparkPackages` parameter when initializing SparkSession using `sparkR.session`.
You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details can be found in the [SQL programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`).
235
+
You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details can be found in the [SQL Programming Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`).
236
236
237
237
```{r, eval=FALSE}
238
238
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Survival analysis studies the expected duration of time until an event happens, and often the relationship with risk factors or treatment taken on the subject. In contrast to standard regression analysis, survival modeling has to deal with special characteristics in the data including non-negative survival time and censoring.
658
658
659
659
Accelerated Failure Time (AFT) model is a parametric survival model for censored data that assumes the effect of a covariate is to accelerate or decelerate the life course of an event by some constant. For more information, refer to the Wikipedia page [AFT Model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) and the references there. Different from a [Proportional Hazards Model](https://en.wikipedia.org/wiki/Proportional_hazards_model) designed for the same purpose, the AFT model is easier to parallelize because each instance contributes to the objective function independently.
660
+
660
661
```{r, warning=FALSE}
661
662
library(survival)
662
663
ovarianDF <- createDataFrame(ovarian)
@@ -887,15 +888,15 @@ perplexity
887
888
888
889
There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, `nonnegative`. For a complete list, refer to the help file.
SparkR supports the Structured Streaming API (experimental).
994
+
995
+
You can check the Structured Streaming Programming Guide for [an introduction](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model) to its programming model and basic concepts.
996
+
997
+
### Simple Source and Sink
998
+
999
+
Spark has a few built-in input sources. As an example, to test with a socket source reading text into words and displaying the computed word counts:
1000
+
1001
+
```{r, eval=FALSE}
1002
+
# Create DataFrame representing the stream of input lines from connection
1003
+
lines <- read.stream("socket", host = hostname, port = port)
1004
+
1005
+
# Split the lines into words
1006
+
words <- selectExpr(lines, "explode(split(value, ' ')) as word")
1007
+
1008
+
# Generate running word count
1009
+
wordCounts <- count(groupBy(words, "word"))
1010
+
1011
+
# Start running the query that prints the running counts to the console
It is simple to read data from Kafka. For more information, see [Input Sources](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources) supported by Structured Streaming.
keyvalue <- selectExpr(topic, "CAST(key AS STRING)", "CAST(value AS STRING)")
1024
+
```
1025
+
1026
+
### Operations and Sinks
1027
+
1028
+
Most of the common operations on `SparkDataFrame` are supported for streaming, including selection, projection, and aggregation. Once you have defined the final result, to start the streaming computation, you will call the `write.stream` method setting a sink and `outputMode`.
1029
+
1030
+
A streaming `SparkDataFrame` can be written for debugging to the console, to a temporary in-memory table, or for further processing in a fault-tolerant manner to a File Sink in different formats.
Copy file name to clipboardExpand all lines: docs/sparkr.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -559,6 +559,10 @@ The following example shows how to save/load a MLlib model by SparkR.
559
559
</tr>
560
560
</table>
561
561
562
+
# Structured Streaming
563
+
564
+
SparkR supports the Structured Streaming API (experimental). Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. For more information see the R API on the [Structured Streaming Programming Guide](structured-streaming-programming-guide.html)
565
+
562
566
# R Function Name Conflicts
563
567
564
568
When loading and attaching a new package in R, it is possible to have a name [conflict](https://stat.ethz.ch/R-manual/R-devel/library/base/html/library.html), where a
0 commit comments