Skip to content

Commit fe4e3d5

Browse files
committed
Add more docs for Hadoop Configuration
1 parent a878660 commit fe4e3d5

File tree

2 files changed

+45
-2
lines changed

2 files changed

+45
-2
lines changed

core/src/main/scala/org/apache/spark/SparkContext.scala

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,11 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
242242
// the bound port to the cluster manager properly
243243
ui.foreach(_.bind())
244244

245-
/** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. */
245+
/** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
246+
*
247+
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
248+
* plan to set some global configurations for all Hadoop RDDs.
249+
*/
246250
val hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(conf)
247251

248252
val startTime = System.currentTimeMillis()
@@ -630,7 +634,10 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
630634
* necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable),
631635
* using the older MapReduce API (`org.apache.hadoop.mapred`).
632636
*
633-
* @param conf JobConf for setting up the dataset
637+
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
638+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
639+
* sure you won't modify the conf. A safe approach is always creating a new conf for
640+
* a new RDD.
634641
* @param inputFormatClass Class of the InputFormat
635642
* @param keyClass Class of the keys
636643
* @param valueClass Class of the values
@@ -756,6 +763,14 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
756763
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
757764
* and extra configuration options to pass to the input format.
758765
*
766+
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
767+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
768+
* sure you won't modify the conf. A safe approach is always creating a new conf for
769+
* a new RDD.
770+
* @param fClass Class of the InputFormat
771+
* @param kClass Class of the keys
772+
* @param vClass Class of the values
773+
*
759774
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
760775
* record, directly caching the returned RDD will create many references to the same object.
761776
* If you plan to directly cache Hadoop writable objects, you should first copy them using

core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -387,6 +387,15 @@ class JavaSparkContext(val sc: SparkContext)
387387
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
388388
* etc).
389389
*
390+
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
391+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
392+
* sure you won't modify the conf. A safe approach is always creating a new conf for
393+
* a new RDD.
394+
* @param inputFormatClass Class of the InputFormat
395+
* @param keyClass Class of the keys
396+
* @param valueClass Class of the values
397+
* @param minPartitions Minimum number of Hadoop Splits to generate.
398+
*
390399
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
391400
* record, directly caching the returned RDD will create many references to the same object.
392401
* If you plan to directly cache Hadoop writable objects, you should first copy them using
@@ -409,6 +418,14 @@ class JavaSparkContext(val sc: SparkContext)
409418
* Get an RDD for a Hadoop-readable dataset from a Hadooop JobConf giving its InputFormat and any
410419
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
411420
*
421+
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
422+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
423+
* sure you won't modify the conf. A safe approach is always creating a new conf for
424+
* a new RDD.
425+
* @param inputFormatClass Class of the InputFormat
426+
* @param keyClass Class of the keys
427+
* @param valueClass Class of the values
428+
*
412429
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
413430
* record, directly caching the returned RDD will create many references to the same object.
414431
* If you plan to directly cache Hadoop writable objects, you should first copy them using
@@ -490,6 +507,14 @@ class JavaSparkContext(val sc: SparkContext)
490507
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
491508
* and extra configuration options to pass to the input format.
492509
*
510+
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
511+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
512+
* sure you won't modify the conf. A safe approach is always creating a new conf for
513+
* a new RDD.
514+
* @param fClass Class of the InputFormat
515+
* @param kClass Class of the keys
516+
* @param vClass Class of the values
517+
*
493518
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
494519
* record, directly caching the returned RDD will create many references to the same object.
495520
* If you plan to directly cache Hadoop writable objects, you should first copy them using
@@ -689,6 +714,9 @@ class JavaSparkContext(val sc: SparkContext)
689714

690715
/**
691716
* Returns the Hadoop configuration used for the Hadoop code (e.g. file systems) we reuse.
717+
*
718+
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
719+
* plan to set some global configurations for all Hadoop RDDs.
692720
*/
693721
def hadoopConfiguration(): Configuration = {
694722
sc.hadoopConfiguration

0 commit comments

Comments
 (0)