Skip to content

Conversation

@liancheng
Copy link
Contributor

PR #7696 added two HadoopFsRelation.refresh() calls (this, and this) in DataSourceStrategy to make test case InsertSuite.save directly to the path of a JSON table pass. However, this forces every HadoopFsRelation table scan to do a refresh, which can be super expensive for tables with large number of partitions.

The reason why the original test case fails without the refresh() calls is that, the old JSON relation builds the base RDD with the input paths, while HadoopFsRelation provides FileStatuses of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for HadoopFsRelation.

This PR removes those two expensive refresh calls, and moves the refresh into JSONRelation to fix this issue. We might want to update HadoopFsRelation interface to provide better support for this use case.

@liancheng
Copy link
Contributor Author

cc @yhuai

@liancheng
Copy link
Contributor Author

cc @chenghao-intel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the same with the refresh invoking in DataSourceStrategy? As every time we build the rdd, we need to refresh the file status.

Actually, I've updated the #8023, by simply removing all of the refresh calls in the DataSourceStrategy, and the InsertIntoHadoopFsRelation will actually update the file status implicitly while writing the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refresh added in DataSourceStrategy also affects Parquet and ORC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing #8023, let's have an offline discussion about it tomorrow :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why calling refresh at here will work? Is inputPaths containing all file paths or dir paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputPaths contains the "base" input paths, while the inputPaths of the other buildScan method in JSONRelation contains FileStatuses of leaf files under base input paths, which is retrieved from the FileStatusCache in HadoopFsRelation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What issue? Cached metadata can be out-of-date when users add data directly? Because refreshing ORC and Parquet tables' metadata is expensive, I think that is a expected behavior (users need to call refresh explicitly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but seems only ParquetRelation will refresh the meta data, for Json and ORC only will refresh the file status.

The problem here, once the table cached, even if people call the refresh() implicitly, we will not get the refreshed data either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For json, we do not have other metadata to cache other than file status, right?

once the table cached => you mean after we call cache table? I do see we drop the cached logical plan from our cache manager. Can you try it and see if we drop the cached table data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean we need to refresh the schema / metadata, but fetching the latest file under the table path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the files under the table path probably changed by other applications, I think we have to refresh the file status before the file scanning.

Probably we can move out the parquet meta data refreshing from the HadoopFsRelation.refresh.
Update: I mean from the ParquetRelation.refresh()

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #40176 timed out for PR 8035 at commit ec1957d after a configured wait of 175m.

@yhuai
Copy link
Contributor

yhuai commented Aug 7, 2015

test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we create rdds for partitioned json table, are we using this broadcastedConf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, 'cause we're not using SqlNewHadoopRDD here. This method was originally marked as final in HadoopFsRelation and wasn't supposed to be overriden in concrete data source implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, if we read a table with lots of partitioned, the time spent on planning the query will be pretty long, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a followup jira to re-use this broadcastedConf. So, we will not broadcast hadoop conf for every RDD created for every partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, do we have a jira for investigating using SerializableConfiguration instead of this shared broad conf for 1.6?

@marmbrus
Copy link
Contributor

marmbrus commented Aug 7, 2015

I think you mean #7696

@SparkQA
Copy link

SparkQA commented Aug 8, 2015

Test build #40200 has finished for PR 8035 at commit ec1957d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@marmbrus Yeah, thanks for the correction, updated.

@yhuai
Copy link
Contributor

yhuai commented Aug 10, 2015

LGTM. Merging to master and branch 1.5.

asfgit pushed a commit that referenced this pull request Aug 10, 2015
PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions.

The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`.

This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case.

[1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63
[2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91

Author: Cheng Lian <[email protected]>

Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits:

ec1957d [Cheng Lian] Fixes JSONRelation refreshing

(cherry picked from commit e3fef0f)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in e3fef0f Aug 10, 2015
@liancheng liancheng deleted the spark-9743/fix-json-relation-refreshing branch August 10, 2015 18:00
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
PR apache#7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions.

The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`.

This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case.

[1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63
[2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91

Author: Cheng Lian <[email protected]>

Closes apache#8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits:

ec1957d [Cheng Lian] Fixes JSONRelation refreshing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants