Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR adds read.csv and write.csv for SparkR. They were added for Scala, Java and Python already.

This PR is basically similar with #10348.

How was this patch tested?

Running R/run_tests.sh also will test this with Jenkins.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52282 has finished for PR 11457 at commit 67cd332.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52289 has finished for PR 11457 at commit f601748.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

namesOfMasked <- c("describe", "cov", "filter", "lag", "na.omit", "predict", "sd", "var",
"colnames", "colnames<-", "intersect", "rank", "rbind", "sample", "subset",
"summary", "transform", "drop")
"summary", "transform", "drop", "read.csv", "write.csv")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung @sun-rui (Firstly, sorry I am not pretty much fluent to R).

I am not too sure why it says they are being masked. I anyway added those here and could pass all tests but still I am wondering why it is and if it is correct or not. When I change the function name to another, it looks fine but it looks there are no such functions to mask.

It looks it masks the functions called read.csv and write.csv. The console output was below:

The following object is masked from ‘package:testthat’:

    describe

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:utils’:

    read.csv, write.csv

The following objects are masked from ‘package:base’:

    colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
    summary, transform

Would you please give me some advices on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing read.csv()/csv2() and write.csv()/csv2() in the R base package. It would be great that we can implement these functions in SparkR with the same signature in the R base package. Thus users can use these function in both SparkR and base package. It seems that the supported option in the signatures of these functions in the R base package are also supported in spark-csv.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sun-rui Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To oversimplify it, it would mask the other implementation when the name conflicts but the parameter signature is not exactly the same. Therefore try to see if you could change it to match as sun-rui suggests.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Actually, I have been looking up the link and working on this; however, I realised that it actually cannot use the extactly same parameters as spark-csv does not support all. So, I tried to add some possible parameters (locally) but it looks it ended up with incomplete versions.

This drags me into a thought that this one might have to be merged first and then dealt with later (also partly because CSV options are not confirmed yet and they are not fully merged yet, SPARK-12420).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, would you maybe create another PR based on this? I think it takes pretty much to figure out for me since this is my very second R codes (sorry).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely an issue though - R base:: read.csv is very commonly used and a name conflict as detected here will make base version inaccessible whenever SparkR package is loaded.
I can probably help to sort out the right signature - I'll try to check this out in the next few days.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to match the arguments of read.csv and not mask the local read.csv -- One thing is that we don't need to support all the options and in case some of the flags we don't support are set we can throw a warning / error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Could you maybe give me the link for the PR if you opened? I will close this after checking another PR is open.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52291 has finished for PR 11457 at commit 8bb455e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I am closing this but it would be great to let me know if you have a new PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants