[SPARK-28188] Materialize Dataframe API #24991

vgankidi · 2019-06-27T23:52:44Z

What changes were proposed in this pull request?

Add materialize API to dataframes. Materialize is a Spark action. It is a way to let Spark explicitly know that the dataframe has already been computed. Once a dataframe is materialized, Spark skips all stages prior to the materialize when the dataframe is reused later on.
Please refer to SPARK-28188 for the rationale behind adding the materialize API

How was this patch tested?

Tested manually

srowen · 2019-06-30T01:34:23Z

I don't think we should add this. It's already very common to .count() or .mapPartitions with a no-op to do this. I do think there are use cases for proactively materializing, but, it's overused too.

danielcweeks · 2019-07-01T18:18:21Z

@srowen I think the point here is to provide an explicit way of materializing data so that it can be reused in later stages as opposed to relying on side-effects of operations that produce the same result.

This is a really common issue, so having a clear expression of the intent really helps to disambiguate the logic.

dongjoon-hyun · 2019-07-03T23:22:30Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

   */
  def cache(): this.type = persist()

+  def materialize(): RDD[T] = withScope{


Please fix the PR title. This is RDD API, not a Dataframe API.

This PR also includes Dataset.materialize, which calls this RDD version.

dongjoon-hyun · 2019-07-03T23:24:32Z

I agree with @srowen , but since this is a legitimate community request, ping @mateiz , @rxin , @gatorsmile , @cloud-fan , too.

rxin · 2019-07-03T23:31:03Z

Did I miss something? How does "runJob" materialize the query plan / rdd?

rdblue · 2019-07-03T23:51:27Z

@rxin, this runs the query up to the point where materialize is called. The underlying RDD can then pick up from the last shuffle the next time it is used. This works better than caching in most cases when using dynamic allocation because executors are not sitting idle, but work can be resumed and shared across queries. We could rename the method if that would be more clear.

@srowen, I've seen this suggested on the dev list a few times and I think it is a good idea to add it. There is not guarantee that count does the same thing -- it could be optimized -- and it is a little tricky to get this to work with the dataset API. This version creates a new DataFrame from the underlying RDD so that the work is reused from the last shuffle, instead of allowing the planner to re-optimize with later changes (usually projections) and discard the intermediate result. We have found this really useful for better control over the planner, as well as to cache data using the shuffle system.

rxin · 2019-07-04T00:38:17Z

Got it. But the name is plain wrong because the function doesn't materialize the DataFrame. As a matter of fact, if you run this "materialize" on a query plan without a shuffle this becomes a job that's completely useless that waste a lot of resources.

I've seen a different use case myself: I want to measure the execution time of a query plan, without materializing the data or invoking any I/O, or any overhead. What I ended up doing was implementing a simple data sink that doesn't write anything.

Looks like that can be used here as well?

cloud-fan · 2019-07-04T08:22:16Z

+1 for the no-op sink, I think it should be the same as the no-op runJob here.

vgankidi · 2019-07-04T18:36:42Z

@rxin Yes, materializing on a query plan without a shuffle just wastes resources. We usually recommend repartitioning the dataframe before invoking materialize to add a shuffle. I think this can be used for your use case of measuring execution time of a query plan. What name would you suggest for this api?

rxin · 2019-07-04T19:05:42Z

Something that's like write to a no-op sink ...

vgankidi · 2019-07-12T19:16:57Z

Updated the PR with the change in name.

cloud-fan · 2019-07-15T04:20:25Z

It's simple to write to noop sink: df.write.format("noop").save, why do we need this extra public API?

felixcheung · 2019-07-15T05:46:09Z

@rxin, this runs the query up to the point where materialize is called. The underlying RDD can then pick up from the last shuffle the next time it is used. This works better than caching in most cases when using dynamic allocation because executors are not sitting idle, but work can be resumed and shared across queries. We could rename the method if that would be more clear.

@srowen, I've seen this suggested on the dev list a few times and I think it is a good idea to add it. There is not guarantee that count does the same thing -- it could be optimized -- and it is a little tricky to get this to work with the dataset API. This version creates a new DataFrame from the underlying RDD so that the work is reused from the last shuffle, instead of allowing the planner to re-optimize with later changes (usually projections) and discard the intermediate result. We have found this really useful for better control over the planner, as well as to cache data using the shuffle system.

I have to agree with this - I've seen count() or cache() mis-used too many times and too many times people need to go back to clean up and remove all calls to count(). So much so I'm planning to write an optimizer rule to remove them. I'm only partly kidding.

Maybe this isn't the API for it, and that's ok, let's improve it then and make good suggestion to the community/contributor etc.

I'm not sure df.write.format("noop").save is a good suggestion to general spark user.

rdblue · 2019-07-15T16:41:26Z

I think this should be an action, not a sink. A no-op sink is just another way to misuse existing APIs for a different purpose. And worse, a noop sink doesn't actually accomplish the goal. This call returns a dataframe that will reuse the data stored in shuffle servers. A noop sink would not work for dataframes because you have to get Spark to re-use the same underlying RDD that has been run.

cloud-fan · 2019-07-16T08:09:20Z

I've spent more time understanding the use case, and think table cache should be a better choice here

disk vs memory: you can set the storage level to disk-only with more than one copy, which is more reliable than shuffle files.
shuffle service: it supports RDD blocks as well thanks to [SPARK-27677][Core] Serve local disk persisted blocks by the external service after releasing executor by dynamic allocation #24499

In addition, the table cache has more advantages:

It can work for any dataframes, even without shuffles
other queries can benefit from table cache automatically

You do have a point that table cache is lazy, but we can add a flag to control it. e.g. def cache(eager: Boolean = false)

rdblue · 2019-07-16T16:18:10Z

@cloud-fan, I wasn't aware of #24499. Serving cache blocks from the shuffle service sounds like it could be a good alternative solution. I like that it would serve whole blocks instead of shards. We would definitely need a way to eagerly cache.

AmplabJenkins · 2019-09-16T18:10:46Z

Can one of the admins verify this patch?

dongjoon-hyun · 2019-10-23T07:42:47Z

Hi, All.
Given the previous discussion and long inactivity, I'll close this PR.
@vgankidi . Feel free to reopen this if you have another opinion.

SPARK-28188 Materialize Dataframe API

89546a8

dongjoon-hyun added the SPARK CORE label Jun 28, 2019

dongjoon-hyun reviewed Jul 3, 2019

View reviewed changes

dongjoon-hyun changed the title ~~SPARK-28188 Materialize Dataframe API~~ [SPARK-28188] Materialize Dataframe API Jul 3, 2019

Rename to noOpRun

191972b

dongjoon-hyun closed this Oct 23, 2019

[SPARK-28188] Materialize Dataframe API #24991

[SPARK-28188] Materialize Dataframe API #24991

Uh oh!

Conversation

vgankidi commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Jun 30, 2019

Uh oh!

danielcweeks commented Jul 1, 2019

Uh oh!

dongjoon-hyun Jul 3, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 3, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 3, 2019

Uh oh!

rxin commented Jul 3, 2019

Uh oh!

rdblue commented Jul 3, 2019

Uh oh!

rxin commented Jul 4, 2019

Uh oh!

cloud-fan commented Jul 4, 2019

Uh oh!

vgankidi commented Jul 4, 2019

Uh oh!

rxin commented Jul 4, 2019

Uh oh!

vgankidi commented Jul 12, 2019

Uh oh!

cloud-fan commented Jul 15, 2019

Uh oh!

felixcheung commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jul 15, 2019

Uh oh!

cloud-fan commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jul 16, 2019

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

dongjoon-hyun commented Oct 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

vgankidi commented Jun 27, 2019 •

edited

Loading

felixcheung commented Jul 15, 2019 •

edited

Loading

cloud-fan commented Jul 16, 2019 •

edited

Loading