[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions) #37662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

khalidmammadov wants to merge 3 commits into apache:master from khalidmammadov:master

Contributor

khalidmammadov commented Aug 25, 2022 •

edited

Loading

What changes were proposed in this pull request?

Docstring improvements

Why are the changes needed?

To help users to understand pyspark API

Does this PR introduce any user-facing change?

Yes, documentation

How was this patch tested?

./python/run-tests --testnames pyspark.sql.functions
./dev/lint-python


          Part 3

49998d2

github-actions bot added CORE PYTHON SQL labels

Contributor

itholic commented Aug 26, 2022

Would you like to keep the existing format for PR description ??

Contributor

itholic commented Aug 26, 2022

I think we should also check if dev/lint-python passed, not only python/run-tests --testnames pyspark.sql.functions.


          Fixed long line

b785555

Contributor Author

khalidmammadov commented Aug 26, 2022

Would you like to keep the existing format for PR description ??

Fixed

Contributor Author

khalidmammadov commented Aug 26, 2022

I think we should also check if dev/lint-python passed, not only python/run-tests --testnames pyspark.sql.functions.

Thanks, done

itholic reviewed

View reviewed changes

Contributor

itholic left a comment

Looks pretty good otherwise.

python/pyspark/sql/functions.py

    
                  >>> df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType())

                  >>> df_small = spark.range(3)

                  >>> df_b = broadcast(df_small)

                  >>> df.join(df_b, df.value == df_small.id).show()

Contributor

itholic Aug 26, 2022

What about using explain(True) to explicitly show the broadcast is used as a strategy for ResolvedHint ??

>>> df.join(df_b, df.value == df_small.id).explain(True)
== Parsed Logical Plan ==
Join Inner, (cast(value#267 as bigint) = id#269L)
:- LogicalRDD [value#267], false
+- ResolvedHint (strategy=broadcast)
   +- Range (0, 3, step=1, splits=Some(16))

== Analyzed Logical Plan ==
value: int, id: bigint
Join Inner, (cast(value#267 as bigint) = id#269L)
:- LogicalRDD [value#267], false
+- ResolvedHint (strategy=broadcast)
   +- Range (0, 3, step=1, splits=Some(16))

== Optimized Logical Plan ==
Join Inner, (cast(value#267 as bigint) = id#269L), rightHint=(strategy=broadcast)
:- Filter isnotnull(value#267)
:  +- LogicalRDD [value#267], false
+- Range (0, 3, step=1, splits=Some(16))

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [cast(value#267 as bigint)], [id#269L], Inner, BuildRight, false
   :- Filter isnotnull(value#267)
   :  +- Scan ExistingRDD[value#267]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=164]
      +- Range (0, 3, step=1, splits=16)

Contributor Author

khalidmammadov Aug 26, 2022

The problem is that those IDs are subject to change from run to run and these docstring examples are run during tests/validations and setting them hardcoded will brake the builds.

python/pyspark/sql/functions.py Outdated

Comment on lines 2549 to 2553

    
                  >>> df.agg(count_distinct(df.age, df.name).alias('c')).collect()

                  [Row(c=2)]

                  >>> df.agg(count_distinct("age", "name").alias('c')).collect()

                  [Row(c=2)]

Contributor

itholic Aug 26, 2022

May be we can just remove the existing example since now we have a better one ?

Contributor Author

khalidmammadov Aug 26, 2022

Removed

python/pyspark/sql/functions.py Outdated

    
                  |  Bob|         5|

                  +-----+----------+

                  >>> df.groupby("name").agg(first("age", True)).orderBy("name").show()

Contributor

itholic Aug 26, 2022

Can we add a short description for this example why here we set the ignorenulls as True ??

Contributor Author

khalidmammadov Aug 26, 2022

Added

python/pyspark/sql/functions.py Outdated

    
                  Examples

                  --------

                  >>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show()

Contributor

itholic Aug 26, 2022

Here, also we can just remove the existing one since now we have improved one ?

Contributor Author

khalidmammadov Aug 26, 2022

Done

python/pyspark/sql/functions.py Outdated

    
                  |  Bob|        5|

                  +-----+---------+

                  >>> df.groupby("name").agg(last("age", True)).orderBy("name").show()

Contributor

itholic Aug 26, 2022

Here, also can we add a simple description why we're setting the ignorenulls as True ?

Contributor Author

khalidmammadov Aug 26, 2022

Done


          Review fixes

80d3270

Contributor Author

khalidmammadov commented Aug 26, 2022

@HyukjinKwon FYI

HyukjinKwon approved these changes

View reviewed changes

Member

HyukjinKwon left a comment

LGTM from a cursory look

AmplabJenkins commented Aug 27, 2022

Can one of the admins verify this patch?

Member

HyukjinKwon commented Aug 29, 2022

Merged to master.

HyukjinKwon closed this in

6ce8d8f

This was referenced Sep 3, 2022

[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 5, ~28 functions) #37786

Closed

[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 6, ~50 functions) #37797

Closed

HyukjinKwon pushed a commit that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

5a03f70

…ples self-contained (part 5, ~28 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions

Closes #37786 from khalidmammadov/docstrings_funcs_part_5.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>

srowen pushed a commit that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

eaccadb

…ples self-contained (part 6, ~50 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?

To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?

Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37797 from khalidmammadov/docstrings_funcs_part_6.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

khalidmammadov mentioned this pull request

[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 7, ~30 functions) #37850

Closed

HyukjinKwon pushed a commit that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

6fa6edb

…ples self-contained (part 7, ~30 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37850 from khalidmammadov/docstrings_funcs_part_7.

Lead-authored-by: Khalid Mammadov <[email protected]>
Co-authored-by: khalidmammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>

LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

f9d3fad

…ples self-contained (part 7, ~30 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache#37592, apache#37662, apache#37686, apache#37786, apache#37797)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes apache#37850 from khalidmammadov/docstrings_funcs_part_7.

Lead-authored-by: Khalid Mammadov <[email protected]>
Co-authored-by: khalidmammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>

khalidmammadov mentioned this pull request

[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (FINAL) #37988

Closed

srowen pushed a commit that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

57e6cf0

…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797, #37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

a0x8o added a commit to a0x8o/spark that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

b4c9dea

…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

a0x8o added a commit to a0x8o/spark that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

359591f

…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

a0x8o added a commit to a0x8o/spark that referenced this pull request


          [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions exam…

03b2fbd

…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE PYTHON SQL