-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20208][R][DOCS] Document R fpGrowth support #17557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #75588 has finished for PR 17557 at commit
|
|
Test build #75593 has started for PR 17557 at commit |
|
Test build #75602 has finished for PR 17557 at commit
|
|
Test build #75605 has finished for PR 17557 at commit
|
R/pkg/vignettes/sparkr-vignettes.Rmd
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! - I'd prefer example with real data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "real"? Something human readable (e.g. milk, bread, butter) or some standard pattern mining dataset? If the former one then it is not a problem. If the latter one I am not aware of any dataset which would be safe enough on the license side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something that is not coded in 3 lines ;)
reading from a file if we could - if there isn't any dataset that we can license to use, can we anonymize an existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something that is not coded in 3 lines ;)
That's for sure :) For now I am more trying to figure out how to present this to make it useful. For ML guide we can safely reuse data/mllib but I don't think we can do the same with vignette unless we bring sample_fpgrowth.txt as a package data.
|
Test build #75633 has finished for PR 17557 at commit
|
|
@felixcheung For vignette I used a bit larger synthetic dataset which should show all the features implemented by |
R/pkg/vignettes/sparkr-vignettes.Rmd
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it's slightly less clear, since there are 3 references to "items" (or really, just the SparkDataFrame and its column name), which "items" L923 is referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach you have there
https://github.com/apache/spark/pull/17557/files#diff-1d0d34d8ea18a9340f0a02c6befe6269R30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Updated.
BTW There is a JIRA tracking SQL functions parity, isn't there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean making sure we have all the SQL functions in R?
we don't, actually, since it's a evolving tasks - there are constantly new functions being added.
I think you are referring to split - yes we should probably add that in R too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you please rename the dataframe to df like the other example you have too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was pretty sure I've seen one :) split and array. There are of course name conflicts involved (spark.array?) but it would be really useful to have these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's possible but I'm fairly certain there are still quite a few functions we have missed over the year that are not in JIRA.
feel free to add them - would appreciate your help.
agree array is a bit tricky - I'd rather not having to diverge because of consistency with array_contain function and so on, but I can see spark.array might be an approach. Or perhaps array_col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe create_array (like PySpark create_map)?
|
Test build #75854 has finished for PR 17557 at commit
|
|
Test build #75859 has finished for PR 17557 at commit
|
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one issue
R/pkg/vignettes/sparkr-vignettes.Rmd
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops. missed this - this should be {r}
- Vignettes. - Programming guide. - Code example.
|
Test build #75893 has finished for PR 17557 at commit
|
## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? Manual tests. Author: zero323 <[email protected]> Closes #17557 from zero323/SPARK-20208. (cherry picked from commit 702d85a) Signed-off-by: Felix Cheung <[email protected]>
|
merged to master/2.2 |
|
Thanks @felixcheung! |
## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? Manual tests. Author: zero323 <[email protected]> Closes apache#17557 from zero323/SPARK-20208.
What changes were proposed in this pull request?
Document fpGrowth in:
How was this patch tested?
Manual tests.