-
Notifications
You must be signed in to change notification settings - Fork 309
Implement Arrow #1611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Arrow #1611
Conversation
c6770e9 to
6ab9709
Compare
4d5d4ae to
3c253e5
Compare
ba12f2f to
2ca93ae
Compare
2ca93ae to
2deea96
Compare
|
huzzah! |
|
@javierluraschi would you be interested in doing a write up for the Apache Arrow blog about this work, including all the benchmark results? |
|
@wesm yes, for sure. However, I'm not considering this work complete, mostly due to arrow_data.R#L21, since I'm currently tuning off arrow for the unsupported data types, we have dates almost figured out but nested data is also missing. I'm also investigating larger copy/collect use cases by tweaking batches. So, we could write a "preliminary results" post in your blog mentioning these caveats and the current state of this work, or we could wait until we push everything to CRAN, which is probably a couple months away, or do both posts. What's your take? |
|
I recommend a blog much sooner as a means of also drumming up community involvement. |
|
@wesm Makes sense. How do I send you a blog post? |
|
You can do it as a pull request to the site/ directory in the Arrow repo |
|
@wesm here is a draft post: apache/arrow#3001 |
Support for Apache Arrow in
sparklyr.Benchmarks
For completeness, adding
sparkR, which gets initialized as:Copying
Collecting
Running this benchmark with
10^6entries shows improvements underarrow,spark_apply()
Notice that JIT was turned off since it adds a bit of overhead in
spark_apply()for this particular example, here is a detailed comparison between JIT enabled/disabled witharrow:Here is a profile measuring time spent while running
spark_apply(), loadingarrowseems to take260mswhich could be worth investigating further at some point:Comparing with
scala:Tests
From the Travis run performance results, we can compare execution against arrow as follows:
Overall,
arrowtests execute faster than thesparklyrserializer, Travis tests use only small datasets but help ensure unnecessary overhead is not being introduced.