Skip to content

Commit 318d2c9

Browse files
committed
tweaks
1 parent 1c81477 commit 318d2c9

File tree

1 file changed

+67
-57
lines changed

1 file changed

+67
-57
lines changed

docs/programming-guide.md

Lines changed: 67 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -275,9 +275,7 @@ We describe operations on distributed datasets later on.
275275
**Note:** *In this guide, we'll often use the concise Java 8 lambda syntax to specify Java functions, but
276276
in older versions of Java you can implement the interfaces in the
277277
[org.apache.spark.api.java.function](api/java/org/apache/spark/api/java/function/package-summary.html) package.
278-
For example, for the `reduce` above, we could create a
279-
[Function2](api/java/org/apache/spark/api/java/function/Function2.html) that adds two numbers.
280-
We describe [writing functions in Java](#java-functions) in more detail below.*
278+
We describe [passing functions to Spark](#passing-functions-to-spark) in more detail below.*
281279

282280
</div>
283281

@@ -409,7 +407,7 @@ By default, each transformed RDD may be recomputed each time you run an action o
409407

410408
<div class="codetabs">
411409

412-
<div data-lang="scala" markdown="1">
410+
<div data-lang="scala" markdown="1">
413411

414412
To illustrate RDD basics, consider the simple program below:
415413

@@ -435,7 +433,71 @@ lineLengths.persist()
435433

436434
which would cause it to be saved in memory after the first time it is computed.
437435

438-
<h4 id="scala-functions">Passing Functions in Scala</h4>
436+
</div>
437+
438+
<div data-lang="java" markdown="1">
439+
440+
To illustrate RDD basics, consider the simple program below:
441+
442+
{% highlight java %}
443+
JavaRDD<String> lines = sc.textFile("data.txt");
444+
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
445+
int totalLength = lineLengths.reduce((a, b) -> a + b);
446+
{% endhighlight %}
447+
448+
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
449+
otherwise acted on: `lines` is merely a pointer to the file.
450+
The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
451+
is *not* immediately computed, due to laziness.
452+
Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
453+
to run on separate machines, and each machine runs both its part of the map and a local reduction,
454+
returning only its answer to the driver program.
455+
456+
If we also wanted to use `lineLengths` again later, we could add:
457+
458+
{% highlight java %}
459+
lineLengths.persist();
460+
{% endhighlight %}
461+
462+
which would cause it to be saved in memory after the first time it is computed.
463+
464+
</div>
465+
466+
<div data-lang="python" markdown="1">
467+
468+
To illustrate RDD basics, consider the simple program below:
469+
470+
{% highlight python %}
471+
lines = sc.textFile("data.txt")
472+
lineLengths = lines.map(lambda s: len(s))
473+
totalLength = lineLengths.reduce(lambda a, b: a + b)
474+
{% endhighlight %}
475+
476+
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
477+
otherwise acted on: `lines` is merely a pointer to the file.
478+
The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
479+
is *not* immediately computed, due to laziness.
480+
Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
481+
to run on separate machines, and each machine runs both its part of the map and a local reduction,
482+
returning only its answer to the driver program.
483+
484+
If we also wanted to use `lineLengths` again later, we could add:
485+
486+
{% highlight scala %}
487+
lineLengths.persist()
488+
{% endhighlight %}
489+
490+
which would cause it to be saved in memory after the first time it is computed.
491+
492+
</div>
493+
494+
</div>
495+
496+
### Passing Functions to Spark
497+
498+
<div class="codetabs">
499+
500+
<div data-lang="scala" markdown="1">
439501

440502
Spark's API relies heavily on passing functions in the driver program to run on the cluster.
441503
There are two recommended ways to do this:
@@ -491,32 +553,6 @@ def doStuff(rdd: RDD[String]): RDD[String] = {
491553

492554
<div data-lang="java" markdown="1">
493555

494-
To illustrate RDD basics, consider the simple program below:
495-
496-
{% highlight java %}
497-
JavaRDD<String> lines = sc.textFile("data.txt");
498-
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
499-
int totalLength = lineLengths.reduce((a, b) -> a + b);
500-
{% endhighlight %}
501-
502-
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
503-
otherwise acted on: `lines` is merely a pointer to the file.
504-
The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
505-
is *not* immediately computed, due to laziness.
506-
Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
507-
to run on separate machines, and each machine runs both its part of the map and a local reduction,
508-
returning only its answer to the driver program.
509-
510-
If we also wanted to use `lineLengths` again later, we could add:
511-
512-
{% highlight java %}
513-
lineLengths.persist();
514-
{% endhighlight %}
515-
516-
which would cause it to be saved in memory after the first time it is computed.
517-
518-
<h4 id="java-functions">Passing Functions in Java</h4>
519-
520556
Spark's API relies heavily on passing functions in the driver program to run on the cluster.
521557
In Java, functions are represented by classes implementing the interfaces in the
522558
[org.apache.spark.api.java.function](api/java/org/apache/spark/api/java/function/package-summary.html) package.
@@ -563,32 +599,6 @@ for other languages.
563599

564600
<div data-lang="python" markdown="1">
565601

566-
To illustrate RDD basics, consider the simple program below:
567-
568-
{% highlight python %}
569-
lines = sc.textFile("data.txt")
570-
lineLengths = lines.map(lambda s: len(s))
571-
totalLength = lineLengths.reduce(lambda a, b: a + b)
572-
{% endhighlight %}
573-
574-
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
575-
otherwise acted on: `lines` is merely a pointer to the file.
576-
The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
577-
is *not* immediately computed, due to laziness.
578-
Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
579-
to run on separate machines, and each machine runs both its part of the map and a local reduction,
580-
returning only its answer to the driver program.
581-
582-
If we also wanted to use `lineLengths` again later, we could add:
583-
584-
{% highlight scala %}
585-
lineLengths.persist()
586-
{% endhighlight %}
587-
588-
which would cause it to be saved in memory after the first time it is computed.
589-
590-
<h4 id="python-functions">Passing Functions in Python</h4>
591-
592602
Spark's API relies heavily on passing functions in the driver program to run on the cluster.
593603
There are three recommended ways to do this:
594604

0 commit comments

Comments
 (0)