-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions) #37662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Would you like to keep the existing format for PR description ?? |
|
I think we should also check if |
Fixed |
Thanks, done |
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good otherwise.
| >>> df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType()) | ||
| >>> df_small = spark.range(3) | ||
| >>> df_b = broadcast(df_small) | ||
| >>> df.join(df_b, df.value == df_small.id).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using explain(True) to explicitly show the broadcast is used as a strategy for ResolvedHint ??
>>> df.join(df_b, df.value == df_small.id).explain(True)
== Parsed Logical Plan ==
Join Inner, (cast(value#267 as bigint) = id#269L)
:- LogicalRDD [value#267], false
+- ResolvedHint (strategy=broadcast)
+- Range (0, 3, step=1, splits=Some(16))
== Analyzed Logical Plan ==
value: int, id: bigint
Join Inner, (cast(value#267 as bigint) = id#269L)
:- LogicalRDD [value#267], false
+- ResolvedHint (strategy=broadcast)
+- Range (0, 3, step=1, splits=Some(16))
== Optimized Logical Plan ==
Join Inner, (cast(value#267 as bigint) = id#269L), rightHint=(strategy=broadcast)
:- Filter isnotnull(value#267)
: +- LogicalRDD [value#267], false
+- Range (0, 3, step=1, splits=Some(16))
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [cast(value#267 as bigint)], [id#269L], Inner, BuildRight, false
:- Filter isnotnull(value#267)
: +- Scan ExistingRDD[value#267]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=164]
+- Range (0, 3, step=1, splits=16)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that those IDs are subject to change from run to run and these docstring examples are run during tests/validations and setting them hardcoded will brake the builds.
python/pyspark/sql/functions.py
Outdated
| >>> df.agg(count_distinct(df.age, df.name).alias('c')).collect() | ||
| [Row(c=2)] | ||
| >>> df.agg(count_distinct("age", "name").alias('c')).collect() | ||
| [Row(c=2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can just remove the existing example since now we have a better one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
python/pyspark/sql/functions.py
Outdated
| | Bob| 5| | ||
| +-----+----------+ | ||
| >>> df.groupby("name").agg(first("age", True)).orderBy("name").show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a short description for this example why here we set the ignorenulls as True ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
python/pyspark/sql/functions.py
Outdated
| Examples | ||
| -------- | ||
| >>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, also we can just remove the existing one since now we have improved one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/pyspark/sql/functions.py
Outdated
| | Bob| 5| | ||
| +-----+---------+ | ||
| >>> df.groupby("name").agg(last("age", True)).orderBy("name").show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, also can we add a simple description why we're setting the ignorenulls as True ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@HyukjinKwon FYI |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from a cursory look
|
Can one of the admins verify this patch? |
|
Merged to master. |
…ples self-contained (part 5, ~28 functions) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions Closes #37786 from khalidmammadov/docstrings_funcs_part_5. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ples self-contained (part 6, ~50 functions) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37797 from khalidmammadov/docstrings_funcs_part_6. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (part 7, ~30 functions) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37850 from khalidmammadov/docstrings_funcs_part_7. Lead-authored-by: Khalid Mammadov <[email protected]> Co-authored-by: khalidmammadov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ples self-contained (part 7, ~30 functions) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache#37592, apache#37662, apache#37686, apache#37786, apache#37797) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes apache#37850 from khalidmammadov/docstrings_funcs_part_7. Lead-authored-by: Khalid Mammadov <[email protected]> Co-authored-by: khalidmammadov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797, #37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Docstring improvements
Why are the changes needed?
To help users to understand pyspark API
Does this PR introduce any user-facing change?
Yes, documentation
How was this patch tested?
./python/run-tests --testnames pyspark.sql.functions
./dev/lint-python