[SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv` #38399

soxofaan · 2022-10-26T13:58:51Z

What changes were proposed in this pull request?

as discussed in https://issues.apache.org/jira/browse/SPARK-40922:

The path argument of pyspark.pandas.read_csv(path, ...) currently has type annotation str and is documented as
  path : str
      The path string storing the CSV file to be read.
The implementation however uses pyspark.sql.DataFrameReader.csv(path, ...) which does support multiple paths:
   path : str or list
       string, or list of strings, for input path(s),
       or RDD of Strings storing CSV rows.

This PR updates the type annotation and documentation of path argument of pyspark.pandas.read_csv

Why are the changes needed?

Loading multiple CSV files at once is a useful feature to have and should be documented

Does this PR introduce any user-facing change?

it documents and existing feature

How was this patch tested?

No need for tests (so far): only type annotations and docblocks were changed

AmplabJenkins · 2022-10-26T19:21:31Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-10-27T01:06:31Z

python/pyspark/pandas/namespace.py

-    path : str
-        The path string storing the CSV file to be read.
+    path : str or list
+        path(s) of the CSV file(s) to be read.


Can we add this examaple to the docstring?

Yeah, let's add at least one example for docstring

HyukjinKwon · 2022-10-27T01:06:41Z

cc @itholic

itholic · 2022-10-27T01:49:34Z

Looks good except #38399 (comment)

…read_csv`

soxofaan · 2022-10-27T07:36:56Z

I added an example

FYI: while looking around in the code, I suspect the feature of supporting multiple paths is also present in other read_* functions (like read_orc, read_json and probably some others too), but I haven't experimented with that yet

HyukjinKwon · 2022-10-27T10:53:43Z

Merged to master.

…as.read_csv` ### What changes were proposed in this pull request? as discussed in https://issues.apache.org/jira/browse/SPARK-40922: > The path argument of `pyspark.pandas.read_csv(path, ...)` currently has type annotation `str` and is documented as > > path : str > The path string storing the CSV file to be read. >The implementation however uses `pyspark.sql.DataFrameReader.csv(path, ...)` which does support multiple paths: > > path : str or list > string, or list of strings, for input path(s), > or RDD of Strings storing CSV rows. > This PR updates the type annotation and documentation of `path` argument of `pyspark.pandas.read_csv` ### Why are the changes needed? Loading multiple CSV files at once is a useful feature to have and should be documented ### Does this PR introduce _any_ user-facing change? it documents and existing feature ### How was this patch tested? No need for tests (so far): only type annotations and docblocks were changed Closes apache#38399 from soxofaan/SPARK-40922-pyspark-pandas-read-csv-multiple-paths. Authored-by: Stefaan Lippens <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

SPARK-40922 document multiple path support in pyspark.pandas.read_csv

d8911b4

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Oct 26, 2022

HyukjinKwon changed the title ~~[SPARK-40922][PYTHON] document multiple path support in pyspark.pandas.read_csv~~ [SPARK-40922][PYTHON] Document multiple path support in pyspark.pandas.read_csv Oct 27, 2022

HyukjinKwon reviewed Oct 27, 2022

View reviewed changes

fixup! SPARK-40922 document multiple path support in `pyspark.pandas.…

e09bd33

…read_csv`

HyukjinKwon approved these changes Oct 27, 2022

View reviewed changes

HyukjinKwon closed this in 5b5eb23 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv` #38399

[SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv` #38399

Uh oh!

soxofaan commented Oct 26, 2022

Uh oh!

AmplabJenkins commented Oct 26, 2022

Uh oh!

HyukjinKwon Oct 27, 2022

Uh oh!

itholic Oct 27, 2022

Uh oh!

HyukjinKwon commented Oct 27, 2022

Uh oh!

itholic commented Oct 27, 2022

Uh oh!

soxofaan commented Oct 27, 2022

Uh oh!

HyukjinKwon commented Oct 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-40922][PYTHON] Document multiple path support in pyspark.pandas.read_csv #38399

[SPARK-40922][PYTHON] Document multiple path support in pyspark.pandas.read_csv #38399

Uh oh!

Conversation

soxofaan commented Oct 26, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Oct 26, 2022

Uh oh!

HyukjinKwon Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

itholic Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 27, 2022

Uh oh!

itholic commented Oct 27, 2022

Uh oh!

soxofaan commented Oct 27, 2022

Uh oh!

HyukjinKwon commented Oct 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv` #38399

[SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv` #38399