-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[MINOR][SQL][PYSPARK] Allow user to specify numSlices in SparkSession.createDataFrame #17926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #76698 has finished for PR 17926 at commit
|
|
Test build #76700 has finished for PR 17926 at commit
|
|
Test build #76701 has finished for PR 17926 at commit
|
|
It seems adding a functionality and not a trivial fix. I think we need a JIRA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a rather niche case and we can workaround by parallelizing outside:
>>> df = spark.createDataFrame(spark.sparkContext.parallelize([[1],[2],[3],[4],[5]], numSlices=5))
>>> df.rdd.getNumPartitions()
5Also, this looks only applying when the data is not RDD. I think this is confusing if a user sets this and this option is not working in some cases unless the user reads the documentation.
|
Does this cause any incompatibility with existing code? |
|
I don't think so (this is Python ... ) for both positional and keyword arguments. (If the new |
It's up to you. All I'm doing is passing a keyword argument from one preexisting public method to another. I don't view that as adding functionality, but I am not the arbiter of such things.
It has been a rather common case for me since I'm often working with Again, I don't see the proposed change as adding any functionality, just exposing machinery already in place for distributing Python data to an
Given that |
|
No, it is not virtually the same before/after (and also we need a regression test). So, it needs a JIRA - see http://spark.apache.org/contributing.html. Adding an parameter to As you said, this can be done in a single line like that, you could just make a wrapper function for it in application side in few lines. def createDataFrame(data, numSlices, **kwargs):
return spark.createDataFrame(
spark.sparkContext.parallelize(data, numSlices=numSlices), **kwargs)I am not sure if it is worth adding this parameter. It looks there is a potential confusion to users and workaround looks so easy. |
|
How about adding this workaround to the function description of Thanks! |
|
FYI we added |
What changes were proposed in this pull request?
In my experience, pushing
pandas.DataFrames topyspark.DataFrames will very quickly run up against size issues. These can usually be remedied by changing configuration parameters (e.g.,spark.rpc.message.maxSize), but it is much more convenient to change the level of parallelization used duringRDDcreation. This option is available insparkContext.broadcast. This pull request exposes it tosparkSession.createDataFrame.How was this patch tested?
I have been using a patch implementing this change for a while. I'm only exposing a keyword argument used by an underlying function to the user.