[SparkR-239] `buildSchema` and `field` functions #235

cafreeman · 2015-03-30T15:11:16Z

Added an API for building schemas when converting RDD to DataFrame. The old approach simply relied on a series of nested lists which resulted in minimal opportunity for error checking or straightforward UX. This new approach is very similar, but by formalizing the fields and schema into S3 objects with constructors, we're able to specify the arguments that need to be present and check incoming argument types before hitting the JVM.

Instead of using a list[list[list[]]], use specific constructors for schema and field objects.

cafreeman · 2015-04-01T14:57:38Z

@shivaram Any opinions on this?

shivaram · 2015-04-01T21:22:54Z

@cafreeman Sorry for the delay -- I'll take a look at this today

@davies Could you also take a look ?

shivaram · 2015-04-06T17:47:59Z

@cafreeman Sorry for the delay in looking at this -- I think this is a good idea and it definitely improves usability. One thing I was wondering is if we should just create a new file called schema.R which has definitions for field, struct, tojson and buildSchema.

On that thought, I was also wondering if we should name things more consistently, i.e. calling it structField to keep it consistent with the Scala naming conventions

@davies Will be good if you can look at the tojson changes as it'd be great to check that logic.

davies · 2015-04-06T22:01:33Z

The tojson changes look good to me.

cafreeman · 2015-04-08T13:12:35Z

@shivaram Agreed on the naming. I'll need to refactor the existing SQLTypes.R code so that we can use structType and structField to create the R constructs as well as reference the jobj versions. Working on that right now.

Refactored `structType` and `structField` so that they can be used to create schemas from R for use with `createDataFrame`. Moved everything to `schema.R` Added new methods to `SQLUtils.scala` for handling `StructType` and `StructField` on the JVM side

Refactored to use the new `structType` and `structField` functions.

New version uses takes a `StructType` from R and creates a DataFrame. Commented out the `tojson` version since we don't currently use it.

Updated `NAMESPACE`, `DESCRIPTION`, and unit tests for new schema functions. Deleted `SQLTypes.R` since everything has been moved to `schema.R`.

cafreeman · 2015-04-08T19:18:54Z

Alright, just pushed an overhaul of the schema functions. structType and structField can now be used to read existing objects as well as create new instances of each object on the JVM and in SparkR.

createDataFrame now uses structType instead of buildSchema. buildSchema and field have both been deleted.

infer_type and createDataFrame now use structType and do not use tojson.

I've commented out anything related to the tojson utility function since none of the dataframe/schema functions actually use it anymore. Is there any reason to keep it around or can we go ahead and delete all the tojson stuff?

@shivaram @davies It'd be great to get your opinions.

davies · 2015-04-09T22:27:38Z

Currently, we can specify the schema manually like this:

# The schema is encoded in a string.
schema <- list(type="struct", fields=list(
   list(name="name", type="string", nullable=TRUE),
   list(name="age", type="integer", nullable=TRUE)
))

Does this new API sounds much better than current one?

Another thing we need to do is that to make sure the rdd has the right types as provided schema, and to proper type conversion.

cafreeman · 2015-04-09T22:31:13Z

I think the new API is preferable because there's a defined structure for going about creating DataFrames. It definitely uses the same pattern as the original method, the difference is that you're now calling a function that has defined arguments, error-checking, and is consistent with the current API (i.e. StructType and StructField).

The goal was to have less guesswork on the part of the user when it came to specifying schemas and making sure all the essential pieces were there.

davies · 2015-04-09T22:54:19Z

That makes sense, then do we need all the API for all the types (included nested one, ArrayType and MapType).

In Python, these are called developer API, should not be called frequently for end-user, I'm still wondering that do we really need the effort to improve the usability.

cafreeman · 2015-04-09T22:58:38Z

I think having structType and structField makes sense since the user would actually interact with these directly, but I don't know if we need to fully expose the remainder of the SparkSQL types in the same way.

Like, we don't expect the end-user to be trying to create Arrays (for example) on the JVM from the R side, so we probably don't need to fully expose them like we are for structType. Is that what you're saying?

shivaram · 2015-04-10T02:19:14Z

@cafreeman Sorry for the delay
I like the new API as well and it seems clean to use jobj instead of json. However I am still trying to understand what this means for ArrayTypes, MapTypes.

Will the map type look something like MapType(key="integer", value="string") ? And in terms of nesting will I guess it will just involve passing jobj as a type instead of just strings ?

cafreeman · 2015-04-10T02:31:22Z

@shivaram To be honest, I'm not entirely sure yet. I think @davies was pointing out that getting rid of the JSON option means needing to more explicitly support Map and Array types.

While supporting them directly on the R side would mean that we'd have more API elements, I don't think it's the end of the world to have map and array type constructors similar to what I did for structType here.

The map type will come in handy for some of the MLLib stuff I'm working on right now, actually, so I could definitely add API support for maps and arrays as well.

shivaram · 2015-04-10T04:45:38Z

Alright. Since the code diff here is pretty small, lets try this approach and see how it goes. We already have an issue open for nested types [1] and we can revisit this at that point.

@cafreeman Can you bring this up to date with sparkr-sql branch ? I'll do one more pass of review now

[1] https://issues.apache.org/jira/browse/SPARK-6819

shivaram · 2015-04-10T04:55:31Z

pkg/R/schema.R

A one line comment on top with a description of what is in this file would be good

Good call. Fixing that now.

shivaram · 2015-04-10T05:11:14Z

pkg/src/src/main/scala/edu/berkeley/cs/amplab/sparkr/SQLUtils.scala

Similarly feel free to delete this now

# Conflicts: # pkg/R/DataFrame.R

cafreeman · 2015-04-10T14:53:55Z

Alright, addressed the remaining comments from @shivaram

shivaram · 2015-04-10T16:39:56Z

LGTM. Merging this into sparkr-sql.

@davies Could you include this in your PR of merging things from here to Spark ?
Also we should update the schema section of apache/spark#5442 to reflect this.

[SparkR-239] `buildSchema` and `field` functions

cafreeman added 3 commits March 29, 2015 19:50

Define functions for schema and fields

f7e88ba

Instead of using a list[list[list[]]], use specific constructors for schema and field objects.

Documentation

162e76f

new line at EOF

b50000d

cafreeman added 4 commits April 8, 2015 14:09

refactor schema functions

483506a

Refactored `structType` and `structField` so that they can be used to create schemas from R for use with `createDataFrame`. Moved everything to `schema.R` Added new methods to `SQLUtils.scala` for handling `StructType` and `StructField` on the JVM side

Update createDataFrame and toDF

eb728b1

Refactored to use the new `structType` and `structField` functions.

new version of CreateDF

79d4876

New version uses takes a `StructType` from R and creates a DataFrame. Commented out the `tojson` version since we don't currently use it.

Update NAMESPACE and tests

6b404df

Updated `NAMESPACE`, `DESCRIPTION`, and unit tests for new schema functions. Deleted `SQLTypes.R` since everything has been moved to `schema.R`.

cafreeman added 3 commits April 8, 2015 14:21

Fixed duplicate export

0ab9862

Merge branch 'sparkr-sql' into newSchema

c2bb246

Update subtract to work with generics.R

af21482

shivaram reviewed Apr 10, 2015
View reviewed changes

pkg/src/src/main/scala/edu/berkeley/cs/amplab/sparkr/SQLUtils.scala Outdated

Copy link

Contributor

shivaram Apr 10, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly feel free to delete this now

cafreeman added 6 commits April 10, 2015 08:15

Merge branch 'sparkr-sql' into newSchema

c3be1ed

# Conflicts: # pkg/R/DataFrame.R

Rename the SQL DataType function

0c241bf

Fix spacing

243df0d

Remove tojson functions

921d64f

Update docs and examples

afb38cd

Use object attribute instead of argument

50f4c90

shivaram added a commit that referenced this pull request Apr 10, 2015

Merge pull request #235 from cafreeman/newSchema

6731fb8

[SparkR-239] `buildSchema` and `field` functions

shivaram merged commit 6731fb8 into amplab-extras:sparkr-sql Apr 10, 2015

[SparkR-239] buildSchema and field functions #235

[SparkR-239] buildSchema and field functions #235

Uh oh!

Conversation

cafreeman commented Mar 30, 2015

Uh oh!

cafreeman commented Apr 1, 2015

Uh oh!

shivaram commented Apr 1, 2015

Uh oh!

shivaram commented Apr 6, 2015

Uh oh!

davies commented Apr 6, 2015

Uh oh!

cafreeman commented Apr 8, 2015

Uh oh!

cafreeman commented Apr 8, 2015

Uh oh!

davies commented Apr 9, 2015

Uh oh!

cafreeman commented Apr 9, 2015

Uh oh!

davies commented Apr 9, 2015

Uh oh!

cafreeman commented Apr 9, 2015

Uh oh!

shivaram commented Apr 10, 2015

Uh oh!

cafreeman commented Apr 10, 2015

Uh oh!

shivaram commented Apr 10, 2015

Uh oh!

shivaram Apr 10, 2015

Choose a reason for hiding this comment

Uh oh!

cafreeman Apr 10, 2015

Choose a reason for hiding this comment

Uh oh!

shivaram Apr 10, 2015

Choose a reason for hiding this comment

Uh oh!

cafreeman commented Apr 10, 2015

Uh oh!

shivaram commented Apr 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SparkR-239] `buildSchema` and `field` functions #235

[SparkR-239] `buildSchema` and `field` functions #235