Skip to content

Commit 6568d2c

Browse files
committed
added repartition function to python API.
1 parent a1cd185 commit 6568d2c

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

python/pyspark/rdd.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -970,6 +970,19 @@ def keyBy(self, f):
970970
"""
971971
return self.map(lambda x: (f(x), x))
972972

973+
def repartition(self, numPartitions):
974+
"""
975+
Return a new RDD that has exactly numPartitions partitions.
976+
977+
Can increase or decrease the level of parallelism in this RDD. Internally, this uses
978+
a shuffle to redistribute data.
979+
>>> sc.parallelize([1, 2, 3, 4, 5]).repartition(10).count()
980+
If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
981+
which can avoid performing a shuffle.
982+
"""
983+
jrdd = self._jrdd.repartition(numPartitions)
984+
return RDD(jrdd, self.ctx, self._jrdd_deserializer)
985+
973986
# TODO: `lookup` is disabled because we can't make direct comparisons based
974987
# on the key; we need to compare the hash of the key to the hash of the
975988
# keys in the pairs. This could be an expensive operation, since those

0 commit comments

Comments
 (0)