Skip to content

Conversation

cafreeman
Copy link
Contributor

Added an API for building schemas when converting RDD to DataFrame. The old approach simply relied on a series of nested lists which resulted in minimal opportunity for error checking or straightforward UX. This new approach is very similar, but by formalizing the fields and schema into S3 objects with constructors, we're able to specify the arguments that need to be present and check incoming argument types before hitting the JVM.

cafreeman added 3 commits March 29, 2015 19:50
Instead of using a list[list[list[]]], use specific constructors for schema and field objects.
@cafreeman
Copy link
Contributor Author

@shivaram Any opinions on this?

@shivaram
Copy link
Contributor

shivaram commented Apr 1, 2015

@cafreeman Sorry for the delay -- I'll take a look at this today

@davies Could you also take a look ?

@shivaram
Copy link
Contributor

shivaram commented Apr 6, 2015

@cafreeman Sorry for the delay in looking at this -- I think this is a good idea and it definitely improves usability. One thing I was wondering is if we should just create a new file called schema.R which has definitions for field, struct, tojson and buildSchema.

On that thought, I was also wondering if we should name things more consistently, i.e. calling it structField to keep it consistent with the Scala naming conventions

@davies Will be good if you can look at the tojson changes as it'd be great to check that logic.

@davies
Copy link
Contributor

davies commented Apr 6, 2015

The tojson changes look good to me.

@cafreeman
Copy link
Contributor Author

@shivaram Agreed on the naming. I'll need to refactor the existing SQLTypes.R code so that we can use structType and structField to create the R constructs as well as reference the jobj versions. Working on that right now.

cafreeman added 4 commits April 8, 2015 14:09
Refactored `structType` and `structField` so that they can be used to create schemas from R for use with `createDataFrame`.

Moved everything to `schema.R`

Added new methods to `SQLUtils.scala` for handling `StructType` and `StructField` on the JVM side
Refactored to use the new `structType` and `structField` functions.
New version uses takes a `StructType` from R and creates a DataFrame.

Commented out the `tojson` version since we don't currently use it.
Updated `NAMESPACE`, `DESCRIPTION`, and unit tests for new schema functions.

Deleted `SQLTypes.R` since everything has been moved to `schema.R`.
@cafreeman
Copy link
Contributor Author

Alright, just pushed an overhaul of the schema functions. structType and structField can now be used to read existing objects as well as create new instances of each object on the JVM and in SparkR.

createDataFrame now uses structType instead of buildSchema. buildSchema and field have both been deleted.

infer_type and createDataFrame now use structType and do not use tojson.

I've commented out anything related to the tojson utility function since none of the dataframe/schema functions actually use it anymore. Is there any reason to keep it around or can we go ahead and delete all the tojson stuff?

@shivaram @davies It'd be great to get your opinions.

@davies
Copy link
Contributor

davies commented Apr 9, 2015

Currently, we can specify the schema manually like this:

# The schema is encoded in a string.
schema <- list(type="struct", fields=list(
   list(name="name", type="string", nullable=TRUE),
   list(name="age", type="integer", nullable=TRUE)
))

Does this new API sounds much better than current one?

Another thing we need to do is that to make sure the rdd has the right types as provided schema, and to proper type conversion.

@cafreeman
Copy link
Contributor Author

I think the new API is preferable because there's a defined structure for going about creating DataFrames. It definitely uses the same pattern as the original method, the difference is that you're now calling a function that has defined arguments, error-checking, and is consistent with the current API (i.e. StructType and StructField).

The goal was to have less guesswork on the part of the user when it came to specifying schemas and making sure all the essential pieces were there.

@davies
Copy link
Contributor

davies commented Apr 9, 2015

That makes sense, then do we need all the API for all the types (included nested one, ArrayType and MapType).

In Python, these are called developer API, should not be called frequently for end-user, I'm still wondering that do we really need the effort to improve the usability.

@cafreeman
Copy link
Contributor Author

I think having structType and structField makes sense since the user would actually interact with these directly, but I don't know if we need to fully expose the remainder of the SparkSQL types in the same way.

Like, we don't expect the end-user to be trying to create Arrays (for example) on the JVM from the R side, so we probably don't need to fully expose them like we are for structType. Is that what you're saying?

@shivaram
Copy link
Contributor

@cafreeman Sorry for the delay
I like the new API as well and it seems clean to use jobj instead of json. However I am still trying to understand what this means for ArrayTypes, MapTypes.

Will the map type look something like MapType(key="integer", value="string") ? And in terms of nesting will I guess it will just involve passing jobj as a type instead of just strings ?

@cafreeman
Copy link
Contributor Author

@shivaram To be honest, I'm not entirely sure yet. I think @davies was pointing out that getting rid of the JSON option means needing to more explicitly support Map and Array types.

While supporting them directly on the R side would mean that we'd have more API elements, I don't think it's the end of the world to have map and array type constructors similar to what I did for structType here.

The map type will come in handy for some of the MLLib stuff I'm working on right now, actually, so I could definitely add API support for maps and arrays as well.

@shivaram
Copy link
Contributor

Alright. Since the code diff here is pretty small, lets try this approach and see how it goes. We already have an issue open for nested types [1] and we can revisit this at that point.

@cafreeman Can you bring this up to date with sparkr-sql branch ? I'll do one more pass of review now

[1] https://issues.apache.org/jira/browse/SPARK-6819

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A one line comment on top with a description of what is in this file would be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Fixing that now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly feel free to delete this now

@cafreeman
Copy link
Contributor Author

Alright, addressed the remaining comments from @shivaram

@shivaram
Copy link
Contributor

LGTM. Merging this into sparkr-sql.

@davies Could you include this in your PR of merging things from here to Spark ?
Also we should update the schema section of apache/spark#5442 to reflect this.

shivaram added a commit that referenced this pull request Apr 10, 2015
[SparkR-239] `buildSchema` and `field` functions
@shivaram shivaram merged commit 6731fb8 into amplab-extras:sparkr-sql Apr 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants