@@ -275,9 +275,7 @@ We describe operations on distributed datasets later on.
275275** Note:** * In this guide, we'll often use the concise Java 8 lambda syntax to specify Java functions, but
276276in older versions of Java you can implement the interfaces in the
277277[ org.apache.spark.api.java.function] ( api/java/org/apache/spark/api/java/function/package-summary.html ) package.
278- For example, for the ` reduce ` above, we could create a
279- [ Function2] ( api/java/org/apache/spark/api/java/function/Function2.html ) that adds two numbers.
280- We describe [ writing functions in Java] ( #java-functions ) in more detail below.*
278+ We describe [ passing functions to Spark] ( #passing-functions-to-spark ) in more detail below.*
281279
282280</div >
283281
@@ -409,7 +407,7 @@ By default, each transformed RDD may be recomputed each time you run an action o
409407
410408<div class =" codetabs " >
411409
412- <div data-lang =" scala " markdown =" 1 " >
410+ <div data-lang =" scala " markdown =" 1 " >
413411
414412To illustrate RDD basics, consider the simple program below:
415413
@@ -435,7 +433,71 @@ lineLengths.persist()
435433
436434which would cause it to be saved in memory after the first time it is computed.
437435
438- <h4 id =" scala-functions " >Passing Functions in Scala</h4 >
436+ </div >
437+
438+ <div data-lang =" java " markdown =" 1 " >
439+
440+ To illustrate RDD basics, consider the simple program below:
441+
442+ {% highlight java %}
443+ JavaRDD<String > lines = sc.textFile("data.txt");
444+ JavaRDD<Integer > lineLengths = lines.map(s -> s.length());
445+ int totalLength = lineLengths.reduce((a, b) -> a + b);
446+ {% endhighlight %}
447+
448+ The first line defines a base RDD from an external file. This dataset is not loaded in memory or
449+ otherwise acted on: ` lines ` is merely a pointer to the file.
450+ The second line defines ` lineLengths ` as the result of a ` map ` transformation. Again, ` lineLengths `
451+ is * not* immediately computed, due to laziness.
452+ Finally, we run ` reduce ` , which is an action. At this point Spark breaks the computation into tasks
453+ to run on separate machines, and each machine runs both its part of the map and a local reduction,
454+ returning only its answer to the driver program.
455+
456+ If we also wanted to use ` lineLengths ` again later, we could add:
457+
458+ {% highlight java %}
459+ lineLengths.persist();
460+ {% endhighlight %}
461+
462+ which would cause it to be saved in memory after the first time it is computed.
463+
464+ </div >
465+
466+ <div data-lang =" python " markdown =" 1 " >
467+
468+ To illustrate RDD basics, consider the simple program below:
469+
470+ {% highlight python %}
471+ lines = sc.textFile("data.txt")
472+ lineLengths = lines.map(lambda s: len(s))
473+ totalLength = lineLengths.reduce(lambda a, b: a + b)
474+ {% endhighlight %}
475+
476+ The first line defines a base RDD from an external file. This dataset is not loaded in memory or
477+ otherwise acted on: ` lines ` is merely a pointer to the file.
478+ The second line defines ` lineLengths ` as the result of a ` map ` transformation. Again, ` lineLengths `
479+ is * not* immediately computed, due to laziness.
480+ Finally, we run ` reduce ` , which is an action. At this point Spark breaks the computation into tasks
481+ to run on separate machines, and each machine runs both its part of the map and a local reduction,
482+ returning only its answer to the driver program.
483+
484+ If we also wanted to use ` lineLengths ` again later, we could add:
485+
486+ {% highlight scala %}
487+ lineLengths.persist()
488+ {% endhighlight %}
489+
490+ which would cause it to be saved in memory after the first time it is computed.
491+
492+ </div >
493+
494+ </div >
495+
496+ ### Passing Functions to Spark
497+
498+ <div class =" codetabs " >
499+
500+ <div data-lang =" scala " markdown =" 1 " >
439501
440502Spark's API relies heavily on passing functions in the driver program to run on the cluster.
441503There are two recommended ways to do this:
@@ -491,32 +553,6 @@ def doStuff(rdd: RDD[String]): RDD[String] = {
491553
492554<div data-lang =" java " markdown =" 1 " >
493555
494- To illustrate RDD basics, consider the simple program below:
495-
496- {% highlight java %}
497- JavaRDD<String > lines = sc.textFile("data.txt");
498- JavaRDD<Integer > lineLengths = lines.map(s -> s.length());
499- int totalLength = lineLengths.reduce((a, b) -> a + b);
500- {% endhighlight %}
501-
502- The first line defines a base RDD from an external file. This dataset is not loaded in memory or
503- otherwise acted on: ` lines ` is merely a pointer to the file.
504- The second line defines ` lineLengths ` as the result of a ` map ` transformation. Again, ` lineLengths `
505- is * not* immediately computed, due to laziness.
506- Finally, we run ` reduce ` , which is an action. At this point Spark breaks the computation into tasks
507- to run on separate machines, and each machine runs both its part of the map and a local reduction,
508- returning only its answer to the driver program.
509-
510- If we also wanted to use ` lineLengths ` again later, we could add:
511-
512- {% highlight java %}
513- lineLengths.persist();
514- {% endhighlight %}
515-
516- which would cause it to be saved in memory after the first time it is computed.
517-
518- <h4 id =" java-functions " >Passing Functions in Java</h4 >
519-
520556Spark's API relies heavily on passing functions in the driver program to run on the cluster.
521557In Java, functions are represented by classes implementing the interfaces in the
522558[ org.apache.spark.api.java.function] ( api/java/org/apache/spark/api/java/function/package-summary.html ) package.
@@ -563,32 +599,6 @@ for other languages.
563599
564600<div data-lang =" python " markdown =" 1 " >
565601
566- To illustrate RDD basics, consider the simple program below:
567-
568- {% highlight python %}
569- lines = sc.textFile("data.txt")
570- lineLengths = lines.map(lambda s: len(s))
571- totalLength = lineLengths.reduce(lambda a, b: a + b)
572- {% endhighlight %}
573-
574- The first line defines a base RDD from an external file. This dataset is not loaded in memory or
575- otherwise acted on: ` lines ` is merely a pointer to the file.
576- The second line defines ` lineLengths ` as the result of a ` map ` transformation. Again, ` lineLengths `
577- is * not* immediately computed, due to laziness.
578- Finally, we run ` reduce ` , which is an action. At this point Spark breaks the computation into tasks
579- to run on separate machines, and each machine runs both its part of the map and a local reduction,
580- returning only its answer to the driver program.
581-
582- If we also wanted to use ` lineLengths ` again later, we could add:
583-
584- {% highlight scala %}
585- lineLengths.persist()
586- {% endhighlight %}
587-
588- which would cause it to be saved in memory after the first time it is computed.
589-
590- <h4 id =" python-functions " >Passing Functions in Python</h4 >
591-
592602Spark's API relies heavily on passing functions in the driver program to run on the cluster.
593603There are three recommended ways to do this:
594604
0 commit comments