[SPARK-13174][SparkR] Add read.csv and write.csv for SparkR #11457

HyukjinKwon · 2016-03-02T02:34:52Z

What changes were proposed in this pull request?

This PR adds read.csv and write.csv for SparkR. They were added for Scala, Java and Python already.

This PR is basically similar with #10348.

How was this patch tested?

Running R/run_tests.sh also will test this with Jenkins.

SparkQA · 2016-03-02T02:57:42Z

Test build #52282 has finished for PR 11457 at commit 67cd332.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-02T04:15:02Z

Test build #52289 has finished for PR 11457 at commit f601748.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-02T04:20:50Z

R/pkg/inst/tests/testthat/test_context.R

  namesOfMasked <- c("describe", "cov", "filter", "lag", "na.omit", "predict", "sd", "var",
                     "colnames", "colnames<-", "intersect", "rank", "rbind", "sample", "subset",
-                     "summary", "transform", "drop")
+                     "summary", "transform", "drop", "read.csv", "write.csv")


@felixcheung @sun-rui (Firstly, sorry I am not pretty much fluent to R).

I am not too sure why it says they are being masked. I anyway added those here and could pass all tests but still I am wondering why it is and if it is correct or not. When I change the function name to another, it looks fine but it looks there are no such functions to mask.

It looks it masks the functions called read.csv and write.csv. The console output was below:

The following object is masked from ‘package:testthat’: describe The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:utils’: read.csv, write.csv The following objects are masked from ‘package:base’: colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, summary, transform

Would you please give me some advices on this?

There are existing read.csv()/csv2() and write.csv()/csv2() in the R base package. It would be great that we can implement these functions in SparkR with the same signature in the R base package. Thus users can use these function in both SparkR and base package. It seems that the supported option in the signatures of these functions in the R base package are also supported in spark-csv.

@sun-rui Thanks!

To oversimplify it, it would mask the other implementation when the name conflicts but the parameter signature is not exactly the same. Therefore try to see if you could change it to match as sun-rui suggests.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

@felixcheung Actually, I have been looking up the link and working on this; however, I realised that it actually cannot use the extactly same parameters as spark-csv does not support all. So, I tried to add some possible parameters (locally) but it looks it ended up with incomplete versions.

This drags me into a thought that this one might have to be merged first and then dealt with later (also partly because CSV options are not confirmed yet and they are not fully merged yet, SPARK-12420).

Otherwise, would you maybe create another PR based on this? I think it takes pretty much to figure out for me since this is my very second R codes (sorry).

This is likely an issue though - R base:: read.csv is very commonly used and a name conflict as detected here will make base version inaccessible whenever SparkR package is loaded.
I can probably help to sort out the right signature - I'll try to check this out in the next few days.

@felixcheung Thanks.

I think its better to match the arguments of read.csv and not mask the local read.csv -- One thing is that we don't need to support all the options and in case some of the flags we don't support are set we can throw a warning / error.

@felixcheung Could you maybe give me the link for the PR if you opened? I will close this after checking another PR is open.

SparkQA · 2016-03-02T04:41:17Z

Test build #52291 has finished for PR 11457 at commit 8bb455e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-04-22T04:11:53Z

I am closing this but it would be great to let me know if you have a new PR for this.

Add read.csv and write.csv for SparkR

67cd332

Modify masked functions test

f601748

Correct style

8bb455e

HyukjinKwon reviewed Mar 2, 2016
View reviewed changes

HyukjinKwon closed this Apr 22, 2016

HyukjinKwon mentioned this pull request May 1, 2016

[SPARK-13425][SQL] Documentation for CSV datasource options #12817

Closed

HyukjinKwon deleted the SPARK-13174 branch October 1, 2016 06:42

[SPARK-13174][SparkR] Add read.csv and write.csv for SparkR #11457

[SPARK-13174][SparkR] Add read.csv and write.csv for SparkR #11457

Uh oh!

Conversation

HyukjinKwon commented Mar 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 2, 2016

Uh oh!

SparkQA commented Mar 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2016

Uh oh!

HyukjinKwon commented Apr 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants