Skip to content

Conversation

@soxofaan
Copy link
Contributor

What changes were proposed in this pull request?

as discussed in https://issues.apache.org/jira/browse/SPARK-40922:

The path argument of pyspark.pandas.read_csv(path, ...) currently has type annotation str and is documented as

  path : str
      The path string storing the CSV file to be read.

The implementation however uses pyspark.sql.DataFrameReader.csv(path, ...) which does support multiple paths:

   path : str or list
       string, or list of strings, for input path(s),
       or RDD of Strings storing CSV rows.

This PR updates the type annotation and documentation of path argument of pyspark.pandas.read_csv

Why are the changes needed?

Loading multiple CSV files at once is a useful feature to have and should be documented

Does this PR introduce any user-facing change?

it documents and existing feature

How was this patch tested?

No need for tests (so far): only type annotations and docblocks were changed

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon HyukjinKwon changed the title [SPARK-40922][PYTHON] document multiple path support in pyspark.pandas.read_csv [SPARK-40922][PYTHON] Document multiple path support in pyspark.pandas.read_csv Oct 27, 2022
path : str
The path string storing the CSV file to be read.
path : str or list
path(s) of the CSV file(s) to be read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this examaple to the docstring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's add at least one example for docstring

@HyukjinKwon
Copy link
Member

cc @itholic

@itholic
Copy link
Contributor

itholic commented Oct 27, 2022

Looks good except #38399 (comment)

@soxofaan
Copy link
Contributor Author

I added an example

FYI: while looking around in the code, I suspect the feature of supporting multiple paths is also present in other read_* functions (like read_orc, read_json and probably some others too), but I haven't experimented with that yet

@HyukjinKwon
Copy link
Member

Merged to master.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…as.read_csv`

### What changes were proposed in this pull request?

as discussed in https://issues.apache.org/jira/browse/SPARK-40922:

> The path argument of `pyspark.pandas.read_csv(path, ...)` currently has type annotation `str` and is documented as
>
>       path : str
>           The path string storing the CSV file to be read.
>The implementation however uses `pyspark.sql.DataFrameReader.csv(path, ...)` which does support multiple paths:
>
>        path : str or list
>            string, or list of strings, for input path(s),
>            or RDD of Strings storing CSV rows.
>

This PR updates the type annotation and documentation of `path` argument of `pyspark.pandas.read_csv`

### Why are the changes needed?

Loading multiple CSV files at once is a useful feature to have and should be documented

### Does this PR introduce _any_ user-facing change?
it documents and existing feature

### How was this patch tested?
No need for tests (so far): only type annotations and docblocks were changed

Closes apache#38399 from soxofaan/SPARK-40922-pyspark-pandas-read-csv-multiple-paths.

Authored-by: Stefaan Lippens <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants