Skip to content

Conversation

@khalidmammadov
Copy link
Contributor

@khalidmammadov khalidmammadov commented Aug 25, 2022

What changes were proposed in this pull request?

Docstring improvements

Why are the changes needed?

To help users to understand pyspark API

Does this PR introduce any user-facing change?

Yes, documentation

How was this patch tested?

./python/run-tests --testnames pyspark.sql.functions
./dev/lint-python

@itholic
Copy link
Contributor

itholic commented Aug 26, 2022

Would you like to keep the existing format for PR description ??

@itholic
Copy link
Contributor

itholic commented Aug 26, 2022

I think we should also check if dev/lint-python passed, not only python/run-tests --testnames pyspark.sql.functions.

@khalidmammadov
Copy link
Contributor Author

Would you like to keep the existing format for PR description ??

Fixed

@khalidmammadov
Copy link
Contributor Author

I think we should also check if dev/lint-python passed, not only python/run-tests --testnames pyspark.sql.functions.

Thanks, done

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good otherwise.

>>> df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType())
>>> df_small = spark.range(3)
>>> df_b = broadcast(df_small)
>>> df.join(df_b, df.value == df_small.id).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using explain(True) to explicitly show the broadcast is used as a strategy for ResolvedHint ??

>>> df.join(df_b, df.value == df_small.id).explain(True)
== Parsed Logical Plan ==
Join Inner, (cast(value#267 as bigint) = id#269L)
:- LogicalRDD [value#267], false
+- ResolvedHint (strategy=broadcast)
   +- Range (0, 3, step=1, splits=Some(16))

== Analyzed Logical Plan ==
value: int, id: bigint
Join Inner, (cast(value#267 as bigint) = id#269L)
:- LogicalRDD [value#267], false
+- ResolvedHint (strategy=broadcast)
   +- Range (0, 3, step=1, splits=Some(16))

== Optimized Logical Plan ==
Join Inner, (cast(value#267 as bigint) = id#269L), rightHint=(strategy=broadcast)
:- Filter isnotnull(value#267)
:  +- LogicalRDD [value#267], false
+- Range (0, 3, step=1, splits=Some(16))

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [cast(value#267 as bigint)], [id#269L], Inner, BuildRight, false
   :- Filter isnotnull(value#267)
   :  +- Scan ExistingRDD[value#267]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=164]
      +- Range (0, 3, step=1, splits=16)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that those IDs are subject to change from run to run and these docstring examples are run during tests/validations and setting them hardcoded will brake the builds.

Comment on lines 2549 to 2553
>>> df.agg(count_distinct(df.age, df.name).alias('c')).collect()
[Row(c=2)]
>>> df.agg(count_distinct("age", "name").alias('c')).collect()
[Row(c=2)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can just remove the existing example since now we have a better one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

| Bob| 5|
+-----+----------+
>>> df.groupby("name").agg(first("age", True)).orderBy("name").show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a short description for this example why here we set the ignorenulls as True ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Examples
--------
>>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, also we can just remove the existing one since now we have improved one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| Bob| 5|
+-----+---------+
>>> df.groupby("name").agg(last("age", True)).orderBy("name").show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, also can we add a simple description why we're setting the ignorenulls as True ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@khalidmammadov
Copy link
Contributor Author

@HyukjinKwon FYI

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a cursory look

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Sep 5, 2022
…ples self-contained (part 5, ~28 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions

Closes #37786 from khalidmammadov/docstrings_funcs_part_5.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
srowen pushed a commit that referenced this pull request Sep 8, 2022
…ples self-contained (part 6, ~50 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?

To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?

Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37797 from khalidmammadov/docstrings_funcs_part_6.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Sep 19, 2022
…ples self-contained (part 7, ~30 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37850 from khalidmammadov/docstrings_funcs_part_7.

Lead-authored-by: Khalid Mammadov <[email protected]>
Co-authored-by: khalidmammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Sep 20, 2022
…ples self-contained (part 7, ~30 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache#37592, apache#37662, apache#37686, apache#37786, apache#37797)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes apache#37850 from khalidmammadov/docstrings_funcs_part_7.

Lead-authored-by: Khalid Mammadov <[email protected]>
Co-authored-by: khalidmammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
srowen pushed a commit that referenced this pull request Sep 25, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797, #37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 25, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants