[MINOR][SQL][PYSPARK] Allow user to specify numSlices in SparkSession.createDataFrame #17926

patrick-nicholson · 2017-05-09T20:39:08Z

What changes were proposed in this pull request?

In my experience, pushing pandas.DataFrames to pyspark.DataFrames will very quickly run up against size issues. These can usually be remedied by changing configuration parameters (e.g., spark.rpc.message.maxSize), but it is much more convenient to change the level of parallelization used during RDD creation. This option is available in sparkContext.broadcast. This pull request exposes it to sparkSession.createDataFrame.

How was this patch tested?

I have been using a patch implementing this change for a while. I'm only exposing a keyword argument used by an underlying function to the user.

…tion

gatorsmile · 2017-05-09T20:45:09Z

ok to test

SparkQA · 2017-05-09T20:49:23Z

Test build #76698 has finished for PR 17926 at commit c9a6348.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-09T21:23:52Z

Test build #76700 has finished for PR 17926 at commit 6ef9fdd.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-09T22:03:24Z

Test build #76701 has finished for PR 17926 at commit 4a9d58d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-10T02:46:50Z

It seems adding a functionality and not a trivial fix. I think we need a JIRA.

HyukjinKwon

I think this is a rather niche case and we can workaround by parallelizing outside:

>>> df = spark.createDataFrame(spark.sparkContext.parallelize([[1],[2],[3],[4],[5]], numSlices=5))
>>> df.rdd.getNumPartitions()
5

Also, this looks only applying when the data is not RDD. I think this is confusing if a user sets this and this option is not working in some cases unless the user reads the documentation.

srowen · 2017-05-10T08:31:34Z

Does this cause any incompatibility with existing code?

HyukjinKwon · 2017-05-10T08:42:18Z

I don't think so (this is Python ... ) for both positional and keyword arguments. (If the new numSlices is added in the middle of the arguments it will break for positional arguments but this one adds it at the last).

patrick-nicholson · 2017-05-10T14:27:09Z

It seems adding a functionality and not a trivial fix. I think we need a JIRA.

It's up to you. All I'm doing is passing a keyword argument from one preexisting public method to another. I don't view that as adding functionality, but I am not the arbiter of such things.

I think this is a rather niche case and we can workaround by parallelizing outside.

It has been a rather common case for me since I'm often working with pandas.DataFrames of millions of rows with many columns of mixed types (where any numeric types are implicitly numpy types, rather than base). It can be worked outside by manually performing the steps inside of createDataFrame:

df = spark.createDataFrame(spark.sparkContext.parallelize([r.tolist() for r in pandas_df.to_records(index=False)], numSlices=5)), schema=[str(_) for _ in pandas_df.columns])

Again, I don't see the proposed change as adding any functionality, just exposing machinery already in place for distributing Python data to an RDD in a consistent way for convenience.

Also, this looks only applying when the data is not RDD. I think this is confusing if a user sets this and this option is not working in some cases unless the user reads the documentation.

Given that RDD and local data are necessarily different and that createDataFrame already has separate code paths for RDD and local Python data, I don't know how this can be avoided.

HyukjinKwon · 2017-05-10T15:05:50Z

No, it is not virtually the same before/after (and also we need a regression test). So, it needs a JIRA - see http://spark.apache.org/contributing.html. Adding an parameter to createDataFrame is to add a functionality to createDataFrame that does not exist before in this API.

As you said, this can be done in a single line like that, you could just make a wrapper function for it in application side in few lines.

def createDataFrame(data, numSlices, **kwargs):
    return spark.createDataFrame(
        spark.sparkContext.parallelize(data, numSlices=numSlices), **kwargs)

I am not sure if it is worth adding this parameter. It looks there is a potential confusion to users and workaround looks so easy.

gatorsmile · 2017-05-10T16:39:09Z

How about adding this workaround to the function description of createDataFrame now? In the future, we can change the interface if more people needs this?

Thanks!

felixcheung · 2017-05-11T04:00:13Z

FYI we added numPartitions in R - but that's primarily because we don't have sc.parallelize
https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R#L190

numSlices added to createDataFrame to allow more flexible paralleliza…

c9a6348

…tion

pep8 style changes

6ef9fdd

pep8 style changes

4a9d58d

HyukjinKwon reviewed May 10, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Jun 7, 2017

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

[MINOR][SQL][PYSPARK] Allow user to specify numSlices in SparkSession.createDataFrame #17926

[MINOR][SQL][PYSPARK] Allow user to specify numSlices in SparkSession.createDataFrame #17926

Uh oh!

Conversation

patrick-nicholson commented May 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

HyukjinKwon commented May 10, 2017

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented May 10, 2017

Uh oh!

HyukjinKwon commented May 10, 2017

Uh oh!

patrick-nicholson commented May 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 10, 2017

Uh oh!

gatorsmile commented May 10, 2017

Uh oh!

felixcheung commented May 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

patrick-nicholson commented May 10, 2017 •

edited

Loading