[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035

liancheng · 2015-08-07T16:34:01Z

PR #7696 added two HadoopFsRelation.refresh() calls (this, and this) in DataSourceStrategy to make test case InsertSuite.save directly to the path of a JSON table pass. However, this forces every HadoopFsRelation table scan to do a refresh, which can be super expensive for tables with large number of partitions.

The reason why the original test case fails without the refresh() calls is that, the old JSON relation builds the base RDD with the input paths, while HadoopFsRelation provides FileStatuses of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for HadoopFsRelation.

This PR removes those two expensive refresh calls, and moves the refresh into JSONRelation to fix this issue. We might want to update HadoopFsRelation interface to provide better support for this use case.

liancheng · 2015-08-07T16:34:14Z

cc @yhuai

liancheng · 2015-08-07T16:40:00Z

cc @chenghao-intel

chenghao-intel · 2015-08-07T16:59:10Z

sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala

Is that the same with the refresh invoking in DataSourceStrategy? As every time we build the rdd, we need to refresh the file status.

Actually, I've updated the #8023, by simply removing all of the refresh calls in the DataSourceStrategy, and the InsertIntoHadoopFsRelation will actually update the file status implicitly while writing the data.

The refresh added in DataSourceStrategy also affects Parquet and ORC.

I haven't finished reviewing #8023, let's have an offline discussion about it tomorrow :)

Why calling refresh at here will work? Is inputPaths containing all file paths or dir paths?

inputPaths contains the "base" input paths, while the inputPaths of the other buildScan method in JSONRelation contains FileStatuses of leaf files under base input paths, which is retrieved from the FileStatusCache in HadoopFsRelation.

What issue? Cached metadata can be out-of-date when users add data directly? Because refreshing ORC and Parquet tables' metadata is expensive, I think that is a expected behavior (users need to call refresh explicitly).

Ok, but seems only ParquetRelation will refresh the meta data, for Json and ORC only will refresh the file status.

The problem here, once the table cached, even if people call the refresh() implicitly, we will not get the refreshed data either.

For json, we do not have other metadata to cache other than file status, right?

once the table cached => you mean after we call cache table? I do see we drop the cached logical plan from our cache manager. Can you try it and see if we drop the cached table data?

I don't mean we need to refresh the schema / metadata, but fetching the latest file under the table path.

As the files under the table path probably changed by other applications, I think we have to refresh the file status before the file scanning.

Probably we can move out the parquet meta data refreshing from the HadoopFsRelation.refresh.
Update: I mean from the ParquetRelation.refresh()

SparkQA · 2015-08-07T19:38:33Z

Test build #40176 timed out for PR 8035 at commit ec1957d after a configured wait of 175m.

yhuai · 2015-08-07T21:55:40Z

test this please

yhuai · 2015-08-07T22:13:46Z

sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala

When we create rdds for partitioned json table, are we using this broadcastedConf?

No, 'cause we're not using SqlNewHadoopRDD here. This method was originally marked as final in HadoopFsRelation and wasn't supposed to be overriden in concrete data source implementations.

Then, if we read a table with lots of partitioned, the time spent on planning the query will be pretty long, right?

Let's file a followup jira to re-use this broadcastedConf. So, we will not broadcast hadoop conf for every RDD created for every partition.

btw, do we have a jira for investigating using SerializableConfiguration instead of this shared broad conf for 1.6?

marmbrus · 2015-08-07T23:34:01Z

I think you mean #7696

SparkQA · 2015-08-08T00:09:20Z

Test build #40200 has finished for PR 8035 at commit ec1957d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-08T05:40:32Z

@marmbrus Yeah, thanks for the correction, updated.

yhuai · 2015-08-10T16:06:24Z

LGTM. Merging to master and branch 1.5.

PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing (cherry picked from commit e3fef0f) Signed-off-by: Yin Huai <[email protected]>

PR apache#7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes apache#8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing

Fixes JSONRelation refreshing

ec1957d

chenghao-intel reviewed Aug 7, 2015
View reviewed changes

yhuai reviewed Aug 7, 2015
View reviewed changes

asfgit closed this in e3fef0f Aug 10, 2015

liancheng deleted the spark-9743/fix-json-relation-refreshing branch August 10, 2015 18:00

liancheng mentioned this pull request Aug 26, 2015

[SPARK-10289] [SQL] Fixes JSONRelation refreshing on read path #8460

Closed

chenghao-intel mentioned this pull request Oct 20, 2015

[SPARK-9735][SQL]Respect the user specified schema than the infer partition schema for HadoopFsRelation #8026

Closed

[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035

[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035

Uh oh!

Conversation

liancheng commented Aug 7, 2015

Uh oh!

liancheng commented Aug 7, 2015

Uh oh!

liancheng commented Aug 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 7, 2015

Uh oh!

yhuai commented Aug 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Aug 7, 2015

Uh oh!

SparkQA commented Aug 8, 2015

Uh oh!

liancheng commented Aug 8, 2015

Uh oh!

yhuai commented Aug 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants