-
Notifications
You must be signed in to change notification settings - Fork 323
[SparkR-239] buildSchema
and field
functions
#235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of using a list[list[list[]]], use specific constructors for schema and field objects.
@shivaram Any opinions on this? |
@cafreeman Sorry for the delay -- I'll take a look at this today @davies Could you also take a look ? |
@cafreeman Sorry for the delay in looking at this -- I think this is a good idea and it definitely improves usability. One thing I was wondering is if we should just create a new file called On that thought, I was also wondering if we should name things more consistently, i.e. calling it @davies Will be good if you can look at the |
The |
@shivaram Agreed on the naming. I'll need to refactor the existing |
Refactored `structType` and `structField` so that they can be used to create schemas from R for use with `createDataFrame`. Moved everything to `schema.R` Added new methods to `SQLUtils.scala` for handling `StructType` and `StructField` on the JVM side
Refactored to use the new `structType` and `structField` functions.
New version uses takes a `StructType` from R and creates a DataFrame. Commented out the `tojson` version since we don't currently use it.
Updated `NAMESPACE`, `DESCRIPTION`, and unit tests for new schema functions. Deleted `SQLTypes.R` since everything has been moved to `schema.R`.
Alright, just pushed an overhaul of the schema functions.
I've commented out anything related to the |
Currently, we can specify the schema manually like this:
Does this new API sounds much better than current one? Another thing we need to do is that to make sure the rdd has the right types as provided schema, and to proper type conversion. |
I think the new API is preferable because there's a defined structure for going about creating DataFrames. It definitely uses the same pattern as the original method, the difference is that you're now calling a function that has defined arguments, error-checking, and is consistent with the current API (i.e. The goal was to have less guesswork on the part of the user when it came to specifying schemas and making sure all the essential pieces were there. |
That makes sense, then do we need all the API for all the types (included nested one, ArrayType and MapType). In Python, these are called developer API, should not be called frequently for end-user, I'm still wondering that do we really need the effort to improve the usability. |
I think having Like, we don't expect the end-user to be trying to create Arrays (for example) on the JVM from the R side, so we probably don't need to fully expose them like we are for |
@cafreeman Sorry for the delay Will the map type look something like |
@shivaram To be honest, I'm not entirely sure yet. I think @davies was pointing out that getting rid of the JSON option means needing to more explicitly support Map and Array types. While supporting them directly on the R side would mean that we'd have more API elements, I don't think it's the end of the world to have map and array type constructors similar to what I did for structType here. The map type will come in handy for some of the MLLib stuff I'm working on right now, actually, so I could definitely add API support for maps and arrays as well. |
Alright. Since the code diff here is pretty small, lets try this approach and see how it goes. We already have an issue open for nested types [1] and we can revisit this at that point. @cafreeman Can you bring this up to date with sparkr-sql branch ? I'll do one more pass of review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A one line comment on top with a description of what is in this file would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Fixing that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly feel free to delete this now
Alright, addressed the remaining comments from @shivaram |
LGTM. Merging this into sparkr-sql. @davies Could you include this in your PR of merging things from here to Spark ? |
[SparkR-239] `buildSchema` and `field` functions
Added an API for building schemas when converting
RDD
toDataFrame
. The old approach simply relied on a series of nested lists which resulted in minimal opportunity for error checking or straightforward UX. This new approach is very similar, but by formalizing the fields and schema into S3 objects with constructors, we're able to specify the arguments that need to be present and check incoming argument types before hitting the JVM.