[SPARK-3190] Avoid overflow in VertexRDD.count()

ankurdave · JoshRosen · commit 96df92906978 · 2014-08-28T15:17:01.000-07:00
VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them. The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion). ```scala val pairs = sc.parallelize(0L until 500L).map(_ * 10000000) .flatMap(start => start until (start + 10000000)).map(x => (x, x)) VertexRDD(pairs).count() ``` Author: Ankur Dave <ankurdave@gmail.com> Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits: 641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala b/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala
@@ -108,7 +108,7 @@ class VertexRDD[@specialized VD: ClassTag](
 
   /** The number of vertices in the RDD. */
   override def count(): Long = {
-    partitionsRDD.map(_.size).reduce(_ + _)
+    partitionsRDD.map(_.size.toLong).reduce(_ + _)
   }
 
   /**

Original file line number	Diff line number	Diff line change
`@@ -108,7 +108,7 @@ class VertexRDD[@specialized VD: ClassTag](`
`108`	`108`
`109`	`109`	`/** The number of vertices in the RDD. */`
`110`	`110`	`override def count(): Long = {`
`111`		`- partitionsRDD.map(_.size).reduce(_ + _)`
	`111`	`+ partitionsRDD.map(_.size.toLong).reduce(_ + _)`
`112`	`112`	`}`
`113`	`113`
`114`	`114`	`/**`