Skip to content

Commit 96df929

Browse files
ankurdaveJoshRosen
authored andcommitted
[SPARK-3190] Avoid overflow in VertexRDD.count()
VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them. The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion). ```scala val pairs = sc.parallelize(0L until 500L).map(_ * 10000000) .flatMap(start => start until (start + 10000000)).map(x => (x, x)) VertexRDD(pairs).count() ``` Author: Ankur Dave <[email protected]> Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits: 641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
1 parent 3901245 commit 96df929

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ class VertexRDD[@specialized VD: ClassTag](
108108

109109
/** The number of vertices in the RDD. */
110110
override def count(): Long = {
111-
partitionsRDD.map(_.size).reduce(_ + _)
111+
partitionsRDD.map(_.size.toLong).reduce(_ + _)
112112
}
113113

114114
/**

0 commit comments

Comments
 (0)