Skip to content

Commit 4dc42e9

Browse files
committed
Added monitoring and other documentation in the streaming guide.
1 parent 14c6564 commit 4dc42e9

File tree

3 files changed

+49
-11
lines changed

3 files changed

+49
-11
lines changed

docs/configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -472,7 +472,7 @@ Apart from these, the following properties are also available, and may be useful
472472
<td>spark.streaming.blockInterval</td>
473473
<td>200</td>
474474
<td>
475-
Duration (milliseconds) of how long to batch new objects coming from network receivers used
475+
Duration (milliseconds) of how long to batch new objects coming from receivers used
476476
in Spark Streaming.
477477
</td>
478478
</tr>

docs/streaming-custom-receivers.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ The full source code is in the example [JavaCustomReceiver.java](https://github.
193193

194194

195195

196-
### Actor-based Receivers
196+
### Implementing and Using a Custom Actor-based Receiver
197197

198198
Custom [Akka Actors](http://doc.akka.io/docs/akka/2.2.4/scala/actors.html) can also be used to
199199
receive data. The [ActorHelper](api/scala/index.html#org.apache.spark.streaming.receiver.ActorHelper)

docs/streaming-programming-guide.md

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -853,6 +853,32 @@ For DStreams that must be checkpointed (that is, DStreams created by `updateStat
853853
`reduceByKeyAndWindow` with inverse function), the checkpoint interval of the DStream is by
854854
default set to a multiple of the DStream's sliding interval such that its at least 10 seconds.
855855

856+
## Deployment and Monitoring
857+
A Spark Streaming application is deployed on a cluster in the same way as any other Spark application.
858+
Please refer to the [deployment guide](cluster-overview.html) for more details.
859+
860+
Beyond Spark's [monitoring capabilities](monitoring.html), there are additional capabilities
861+
specific to Spark Streaming. When a StreamingContext is used, the
862+
[Spark web UI](monitoring.html#web-interfaces) shows
863+
an additional `Streaming` tab which shows statistics about running receivers (whether
864+
receivers are active, number of records received, receiver error, etc.)
865+
and completed batches (batch processing times, queueing delays, etc.). This can be used to
866+
monitor the progress of the streaming application.
867+
868+
An important thing to notice in the UI are the following two metrics -
869+
*Processing Time* and *Scheduling Delay* (under *Batch Processing Statistics*). The first is the
870+
time to process each batch of data, and the second is the time a batch waits in a queue
871+
for the processing of previous batches to finish. If the batch processing time is consistently more
872+
than the batch interval and/or the queueing delay keeps increasing, then it indicates the system is
873+
not able to process the batches as fast they are being generated and falling behind.
874+
In that case, consider
875+
[reducing](#reducing-the-processing-time-of-each-batch) the batch processing time.
876+
877+
The progress of a Spark Streaming program can also be monitored using the
878+
[StreamingListener](api/scala/index.html#org.apache.spark.scheduler.StreamingListener) interface,
879+
which allows you to get receiver status and processing times. Note that this is a developer API
880+
and it is likely to be improved upon (i.e., more information reported) in the future.
881+
856882
***************************************************************************************************
857883

858884
# Performance Tuning
@@ -874,7 +900,27 @@ There are a number of optimizations that can be done in Spark to minimize the pr
874900
each batch. These have been discussed in detail in [Tuning Guide](tuning.html). This section
875901
highlights some of the most important ones.
876902

877-
### Level of Parallelism
903+
### Level of Parallelism in Data Receiving
904+
Since the receiver of each input stream (other than file stream) runs on a single worker, often
905+
that proves to be the bottleneck in increasing the throughput. Consider receiving the data
906+
in parallel through multiple receivers. This can be done by creating two input streams and
907+
configuring them receive different partitions of the data stream from the data source(s).
908+
For example, a single Kafka stream receiving two topics of data can split into two
909+
Kafka streams receiving one topic each. This would run two receivers on two workers, thus allowing
910+
data to received in parallel, and increasing overall throughput.
911+
912+
Another parameter that should be considered is the receiver's blocking interval. For most receivers,
913+
the received data is coalesced together into large blocks of data before storing inside Spark's memory.
914+
The number of blocks in each batch determines the number of tasks that will be used to process those
915+
the received data in a map-like transformation. This blocking interval is determined by the
916+
[configuration parameter](configuration.html) `spark.streaming.blockInterval` and the default value
917+
is 200 milliseconds.
918+
919+
If it is infeasible to parallelize the receiving using multiple input streams / receivers, it is sometimes beneficial to explicitly repartition the input data stream
920+
(using `inputStream.repartition(<number of partitions>)`) to distribute the received
921+
data across all the machines in the cluster before further processing.
922+
923+
### Level of Parallelism in Data Processing
878924
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
879925
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
880926
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
@@ -947,14 +993,6 @@ Hence it is necessary to set the delay to at least the value of the largest wind
947993
in the Spark Streaming application. If this delay is set too low, the application will throw an
948994
exception saying so.
949995

950-
## Monitoring
951-
Besides Spark's in-built [monitoring capabilities](monitoring.html),
952-
the progress of a Spark Streaming program can also be monitored using the [StreamingListener]
953-
(api/scala/index.html#org.apache.spark.scheduler.StreamingListener) interface,
954-
which allows you to get statistics of batch processing times, queueing delays,
955-
and total end-to-end delays. Note that this is still an experimental API and it is likely to be
956-
improved upon (i.e., more information reported) in the future.
957-
958996
## Memory Tuning
959997
Tuning the memory usage and GC behavior of Spark applications have been discussed in great detail
960998
in the [Tuning Guide](tuning.html). It is recommended that you read that. In this section,

0 commit comments

Comments
 (0)