-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19905][SQL] Bring back Dataset.inputFiles for Hive SerDe tables #17247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19905][SQL] Bring back Dataset.inputFiles for Hive SerDe tables #17247
Conversation
|
cc @cloud-fan |
| case fr: FileRelation => | ||
| fr.inputFiles | ||
| case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) => | ||
| r.tableMeta.storage.locationUri.map { _.toString }.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't use r.tableMeta.location here intentionally for safty.
|
|
||
| test("SPARK-19905: Hive SerDe table input paths") { | ||
| withTable("spark_19905") { | ||
| spark.range(10).createOrReplaceTempView("spark_19905_view") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withTempView
| case fr: FileRelation => | ||
| fr.inputFiles | ||
| case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) => | ||
| r.tableMeta.storage.locationUri.map { _.toString }.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: xx.map(_.toString)
|
LGTM |
| withTable("spark_19905") { | ||
| spark.range(10).createOrReplaceTempView("spark_19905_view") | ||
| sql("CREATE TABLE spark_19905 STORED AS RCFILE AS SELECT * FROM spark_19905_view") | ||
| assert(spark.table("spark_19905").inputFiles.nonEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also try sql("SELECT input_file_name() FROM spark_19905")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_file_name and Dataset.inputFiles are different code paths.
|
LGTM |
|
Test build #74332 has finished for PR 17247 at commit
|
|
Test build #74336 has finished for PR 17247 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Dataset.inputFilesworks by matchingFileRelations in the query plan. In Spark 2.1, Hive SerDe tables are represented byMetastoreRelation, which inherits fromFileRelation. However, in Spark 2.2, Hive SerDe tables are now represented byCatalogRelation, which doesn't inherit fromFileRelationanymore, due to the unification of Hive SerDe tables and data source tables. This change breaksDataset.inputFilesfor Hive SerDe tables.This PR tries to fix this issue by explicitly matching
CatalogRelations that are Hive SerDe tables inDataset.inputFiles. Note that we can't makeCatalogRelationinherit fromFileRelationsince not allCatalogRelations are file based (e.g., JDBC data source tables).How was this patch tested?
New test case added in
HiveDDLSuite.