Skip to content

Commit ad9bdc2

Browse files
committed
Use labeled points and predictOnValues in examples
1 parent 77dbd3f commit ad9bdc2

File tree

2 files changed

+12
-6
lines changed

2 files changed

+12
-6
lines changed

docs/mllib-clustering.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ First we import the neccessary classes.
180180
{% highlight scala %}
181181

182182
import org.apache.spark.mllib.linalg.Vectors
183+
import org.apache.spark.mllib.regression.LabeledPoint
183184
import org.apache.spark.mllib.clustering.StreamingKMeans
184185

185186
{% endhighlight %}
@@ -189,7 +190,7 @@ Then we make an input stream of vectors for training, as well as one for testing
189190
{% highlight scala %}
190191

191192
val trainingData = ssc.textFileStream("/training/data/dir").map(Vectors.parse)
192-
val testData = ssc.textFileStream("/testing/data/dir").map(Vectors.parse)
193+
val testData = ssc.textFileStream("/testing/data/dir").map(LabeledPoint.parse)
193194

194195
{% endhighlight %}
195196

@@ -211,7 +212,7 @@ Now register the streams for training and testing and start the job, printing th
211212
{% highlight scala %}
212213

213214
model.trainOn(trainingData)
214-
model.predictOn(testData).print()
215+
model.predictOnValues(testData).print()
215216

216217
ssc.start()
217218
ssc.awaitTermination()

examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeans.scala

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
package org.apache.spark.examples.mllib
1919

2020
import org.apache.spark.mllib.linalg.Vectors
21+
import org.apache.spark.mllib.regression.LabeledPoint
2122
import org.apache.spark.mllib.clustering.StreamingKMeans
2223
import org.apache.spark.SparkConf
2324
import org.apache.spark.streaming.{Seconds, StreamingContext}
@@ -27,9 +28,13 @@ import org.apache.spark.streaming.{Seconds, StreamingContext}
2728
* on another stream, where the data streams arrive as text files
2829
* into two different directories.
2930
*
30-
* The rows of the text files must be vector data in the form
31+
* The rows of the training text files must be vector data in the form
3132
* `[x1,x2,x3,...,xn]`
32-
* Where n is the number of dimensions. n must be the same for train and test.
33+
* Where n is the number of dimensions.
34+
*
35+
* The rows of the test text files must be labeled data in the form
36+
* `(y,[x1,x2,x3,...,xn])`
37+
* Where y is some identifier. n must be the same for train and test.
3338
*
3439
* Usage: StreamingKmeans <trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>
3540
*
@@ -57,15 +62,15 @@ object StreamingKMeans {
5762
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
5863

5964
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
60-
val testData = ssc.textFileStream(args(1)).map(Vectors.parse)
65+
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
6166

6267
val model = new StreamingKMeans()
6368
.setK(args(3).toInt)
6469
.setDecayFactor(1.0)
6570
.setRandomCenters(args(4).toInt)
6671

6772
model.trainOn(trainingData)
68-
model.predictOn(testData).print()
73+
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
6974

7075
ssc.start()
7176
ssc.awaitTermination()

0 commit comments

Comments
 (0)