- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-13174][SparkR] Add read.csv and write.csv for SparkR #11457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Test build #52282 has finished for PR 11457 at commit  
 | 
| Test build #52289 has finished for PR 11457 at commit  
 | 
| namesOfMasked <- c("describe", "cov", "filter", "lag", "na.omit", "predict", "sd", "var", | ||
| "colnames", "colnames<-", "intersect", "rank", "rbind", "sample", "subset", | ||
| "summary", "transform", "drop") | ||
| "summary", "transform", "drop", "read.csv", "write.csv") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung @sun-rui (Firstly, sorry I am not pretty much fluent to R).
I am not too sure why it says they are being masked. I anyway added those here and could pass all tests but still I am wondering why it is and if it is correct or not. When I change the function name to another, it looks fine but it looks there are no such functions to mask.
It looks it masks the functions called read.csv and write.csv. The console output was below:
The following object is masked from ‘package:testthat’:
    describe
The following objects are masked from ‘package:stats’:
    cov, filter, lag, na.omit, predict, sd, var
The following objects are masked from ‘package:utils’:
    read.csv, write.csv
The following objects are masked from ‘package:base’:
    colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
    summary, transform
Would you please give me some advices on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are existing read.csv()/csv2() and write.csv()/csv2() in the R base package. It would be great that we can implement these functions in SparkR with the same signature in the R base package. Thus users can use these function in both SparkR and base package. It seems that the supported option in the signatures of these functions in the R base package are also supported in spark-csv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun-rui Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To oversimplify it, it would mask the other implementation when the name conflicts but the parameter signature is not exactly the same. Therefore try to see if you could change it to match as sun-rui suggests.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Actually, I have been looking up the link and working on this; however, I realised that it actually cannot use the extactly same parameters as spark-csv does not support all. So, I tried to add some possible parameters (locally) but it looks it ended up with incomplete versions.
This drags me into a thought that this one might have to be merged first and then dealt with later (also partly because CSV options are not confirmed yet and they are not fully merged yet, SPARK-12420).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, would you maybe create another PR based on this? I think it takes pretty much to figure out for me since this is my very second R codes (sorry).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely an issue though - R base:: read.csv is very commonly used and a name conflict as detected here will make base version inaccessible whenever SparkR package is loaded.
I can probably help to sort out the right signature - I'll try to check this out in the next few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its better to match the arguments of read.csv and not mask the local read.csv -- One thing is that we don't need to support all the options and in case some of the flags we don't support are set we can throw a warning / error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Could you maybe give me the link for the PR if you opened? I will close this after checking another PR is open.
| Test build #52291 has finished for PR 11457 at commit  
 | 
| I am closing this but it would be great to let me know if you have a new PR for this. | 
What changes were proposed in this pull request?
This PR adds
read.csvandwrite.csvfor SparkR. They were added for Scala, Java and Python already.This PR is basically similar with #10348.
How was this patch tested?
Running
R/run_tests.shalso will test this with Jenkins.