You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -874,7 +900,27 @@ There are a number of optimizations that can be done in Spark to minimize the pr
874
900
each batch. These have been discussed in detail in [Tuning Guide](tuning.html). This section
875
901
highlights some of the most important ones.
876
902
877
-
### Level of Parallelism
903
+
### Level of Parallelism in Data Receiving
904
+
Since the receiver of each input stream (other than file stream) runs on a single worker, often
905
+
that proves to be the bottleneck in increasing the throughput. Consider receiving the data
906
+
in parallel through multiple receivers. This can be done by creating two input streams and
907
+
configuring them receive different partitions of the data stream from the data source(s).
908
+
For example, a single Kafka stream receiving two topics of data can split into two
909
+
Kafka streams receiving one topic each. This would run two receivers on two workers, thus allowing
910
+
data to received in parallel, and increasing overall throughput.
911
+
912
+
Another parameter that should be considered is the receiver's blocking interval. For most receivers,
913
+
the received data is coalesced together into large blocks of data before storing inside Spark's memory.
914
+
The number of blocks in each batch determines the number of tasks that will be used to process those
915
+
the received data in a map-like transformation. This blocking interval is determined by the
916
+
[configuration parameter](configuration.html)`spark.streaming.blockInterval` and the default value
917
+
is 200 milliseconds.
918
+
919
+
If it is infeasible to parallelize the receiving using multiple input streams / receivers, it is sometimes beneficial to explicitly repartition the input data stream
920
+
(using `inputStream.repartition(<number of partitions>)`) to distribute the received
921
+
data across all the machines in the cluster before further processing.
922
+
923
+
### Level of Parallelism in Data Processing
878
924
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
879
925
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
880
926
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
@@ -947,14 +993,6 @@ Hence it is necessary to set the delay to at least the value of the largest wind
947
993
in the Spark Streaming application. If this delay is set too low, the application will throw an
0 commit comments