-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035
Conversation
|
cc @yhuai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the same with the refresh invoking in DataSourceStrategy? As every time we build the rdd, we need to refresh the file status.
Actually, I've updated the #8023, by simply removing all of the refresh calls in the DataSourceStrategy, and the InsertIntoHadoopFsRelation will actually update the file status implicitly while writing the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refresh added in DataSourceStrategy also affects Parquet and ORC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reviewing #8023, let's have an offline discussion about it tomorrow :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why calling refresh at here will work? Is inputPaths containing all file paths or dir paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputPaths contains the "base" input paths, while the inputPaths of the other buildScan method in JSONRelation contains FileStatuses of leaf files under base input paths, which is retrieved from the FileStatusCache in HadoopFsRelation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What issue? Cached metadata can be out-of-date when users add data directly? Because refreshing ORC and Parquet tables' metadata is expensive, I think that is a expected behavior (users need to call refresh explicitly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but seems only ParquetRelation will refresh the meta data, for Json and ORC only will refresh the file status.
The problem here, once the table cached, even if people call the refresh() implicitly, we will not get the refreshed data either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For json, we do not have other metadata to cache other than file status, right?
once the table cached => you mean after we call cache table? I do see we drop the cached logical plan from our cache manager. Can you try it and see if we drop the cached table data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean we need to refresh the schema / metadata, but fetching the latest file under the table path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the files under the table path probably changed by other applications, I think we have to refresh the file status before the file scanning.
Probably we can move out the parquet meta data refreshing from the HadoopFsRelation.refresh.
Update: I mean from the ParquetRelation.refresh()
|
Test build #40176 timed out for PR 8035 at commit |
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we create rdds for partitioned json table, are we using this broadcastedConf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, 'cause we're not using SqlNewHadoopRDD here. This method was originally marked as final in HadoopFsRelation and wasn't supposed to be overriden in concrete data source implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, if we read a table with lots of partitioned, the time spent on planning the query will be pretty long, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file a followup jira to re-use this broadcastedConf. So, we will not broadcast hadoop conf for every RDD created for every partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, do we have a jira for investigating using SerializableConfiguration instead of this shared broad conf for 1.6?
|
I think you mean #7696 |
|
Test build #40200 has finished for PR 8035 at commit
|
|
@marmbrus Yeah, thanks for the correction, updated. |
|
LGTM. Merging to master and branch 1.5. |
PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing (cherry picked from commit e3fef0f) Signed-off-by: Yin Huai <[email protected]>
PR apache#7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes apache#8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing
PR #7696 added two
HadoopFsRelation.refresh()calls (this, and this) inDataSourceStrategyto make test caseInsertSuite.save directly to the path of a JSON tablepass. However, this forces everyHadoopFsRelationtable scan to do a refresh, which can be super expensive for tables with large number of partitions.The reason why the original test case fails without the
refresh()calls is that, the old JSON relation builds the base RDD with the input paths, whileHadoopFsRelationprovidesFileStatuses of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true forHadoopFsRelation.This PR removes those two expensive refresh calls, and moves the refresh into
JSONRelationto fix this issue. We might want to updateHadoopFsRelationinterface to provide better support for this use case.