Skip to content

Conversation

@liancheng
Copy link
Contributor

What changes were proposed in this pull request?

Dataset.inputFiles works by matching FileRelations in the query plan. In Spark 2.1, Hive SerDe tables are represented by MetastoreRelation, which inherits from FileRelation. However, in Spark 2.2, Hive SerDe tables are now represented by CatalogRelation, which doesn't inherit from FileRelation anymore, due to the unification of Hive SerDe tables and data source tables. This change breaks Dataset.inputFiles for Hive SerDe tables.

This PR tries to fix this issue by explicitly matching CatalogRelations that are Hive SerDe tables in Dataset.inputFiles. Note that we can't make CatalogRelation inherit from FileRelation since not all CatalogRelations are file based (e.g., JDBC data source tables).

How was this patch tested?

New test case added in HiveDDLSuite.

@liancheng
Copy link
Contributor Author

cc @cloud-fan

case fr: FileRelation =>
fr.inputFiles
case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) =>
r.tableMeta.storage.locationUri.map { _.toString }.toArray
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't use r.tableMeta.location here intentionally for safty.


test("SPARK-19905: Hive SerDe table input paths") {
withTable("spark_19905") {
spark.range(10).createOrReplaceTempView("spark_19905_view")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withTempView

case fr: FileRelation =>
fr.inputFiles
case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) =>
r.tableMeta.storage.locationUri.map { _.toString }.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: xx.map(_.toString)

@cloud-fan
Copy link
Contributor

LGTM

withTable("spark_19905") {
spark.range(10).createOrReplaceTempView("spark_19905_view")
sql("CREATE TABLE spark_19905 STORED AS RCFILE AS SELECT * FROM spark_19905_view")
assert(spark.table("spark_19905").inputFiles.nonEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also try sql("SELECT input_file_name() FROM spark_19905")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_file_name and Dataset.inputFiles are different code paths.

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74332 has finished for PR 17247 at commit 3e0abc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74336 has finished for PR 17247 at commit 7e24047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in ffee4f1 Mar 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants