[SPARK-20590] Map default input data source formats to inlined classes #17847

sameeragarwal · 2017-05-03T22:20:03Z

What changes were proposed in this pull request?

One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.

As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see https://github.com/databricks/spark-csv/issues/367):

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> val df = spark.read.csv("/foo/bar.csv")
java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
  at scala.sys.package$.error(package.scala:27)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
  at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
  at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
  ... 48 elided

This patch proposes a simple way of fixing this error by always mapping default input data source formats to inlined classes (that exist in Spark):

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> val df = spark.read.csv("/foo/bar.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string]

How was this patch tested?

Existing Tests

sameeragarwal · 2017-05-03T22:20:24Z

cc @cloud-fan @HyukjinKwon thoughts?

HyukjinKwon · 2017-05-03T23:34:10Z

Ah, so does this always let Spark's intetnal datasources have a higher precedence instead of failing fast? I support the idea but we might need to print a warning if multiple sources are detected by the same identifier (I don't think it is recommanded...). Let me check if there is something missing at my best today.

SparkQA · 2017-05-04T00:27:27Z

Test build #76427 has finished for PR 17847 at commit 1af4675.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

It looks good to me except for few comments. Probably, adding CSV only might be good enough though. One additional comment is, I think it might be nicer if we print a warning if multiple datasources that has the same shorten names with internal datasources are detected if this can be easily done.

I tested the possible cases I could think after manually building as below:

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

positive cases:

scala> spark.range(1).write.format("csv").save("/tmp/abc")

scala> spark.range(1).write.format("com.databricks.spark.csv").save("/tmp/abc1")

scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("/tmp/abc2")

negative cases:

scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc3")
java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
...

scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc3")
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html
...

I also tested after manually adding multiple sources as below to reproduce when there are same names from external data sources:

class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {

-  override def shortName(): String = "csv"
+  override def shortName(): String = "xml"

and

./bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1

scala> spark.range(1).write.format("xml").save("/tmp/abc3")
java.lang.RuntimeException: Multiple sources found for xml (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.xml.DefaultSource15), please specify the fully qualified class name.
...

HyukjinKwon · 2017-05-04T00:33:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

  def lookupDataSource(provider: String): Class[_] = {
-    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+    val provider1 = builtinShortNamesMap.getOrElse(provider,
+      backwardCompatibilityMap.getOrElse(provider, provider))


Should we maybe combine both builtinShortNamesMap and backwardCompatibilityMap and use a single getOrElse? It seems probably confusing to read a bit.

And I guess these should be case-insensitive for shorten names.

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

scala> spark.range(1).write.format("Csv").save("/tmp/abc") java.lang.RuntimeException: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27) ...

./bin/spark-shell

scala> spark.range(1).write.format("Csv").save("/tmp/abc1")

yea, short names should be case insensitive

HyukjinKwon · 2017-05-04T00:35:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+    "com.databricks.spark.csv" -> csv
+  )
+
+  private val builtinShortNamesMap: Map[String, String] = Map(


Probably, it is nicer if we explain why this one is needed with a small comment about why the shorten names of internal datasources should be mapped to fully qualified names.

HyukjinKwon · 2017-05-04T10:34:35Z

@sameeragarwal BTW, should we add hive, kafka, socket, text and console too?

sameeragarwal · 2017-05-09T23:35:58Z

I'm closing this in favor of #17916

map short names to correct class

1af4675

HyukjinKwon approved these changes May 4, 2017

View reviewed changes

HyukjinKwon mentioned this pull request May 9, 2017

[SPARK-20590][SQL] Use Spark internal datasource if multiples are found for the same shorten name #17916

Closed

sameeragarwal closed this May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20590] Map default input data source formats to inlined classes #17847

[SPARK-20590] Map default input data source formats to inlined classes #17847

Uh oh!

sameeragarwal commented May 3, 2017

Uh oh!

sameeragarwal commented May 3, 2017

Uh oh!

HyukjinKwon commented May 3, 2017

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

HyukjinKwon left a comment •

edited

Loading

Uh oh!

HyukjinKwon May 4, 2017

Uh oh!

HyukjinKwon May 4, 2017 •

edited

Loading

Uh oh!

cloud-fan May 4, 2017

Uh oh!

HyukjinKwon May 4, 2017

Uh oh!

HyukjinKwon commented May 4, 2017

Uh oh!

sameeragarwal commented May 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-20590] Map default input data source formats to inlined classes #17847

[SPARK-20590] Map default input data source formats to inlined classes #17847

Uh oh!

Conversation

sameeragarwal commented May 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sameeragarwal commented May 3, 2017

Uh oh!

HyukjinKwon commented May 3, 2017

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 4, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 4, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 4, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 4, 2017

Uh oh!

sameeragarwal commented May 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon May 4, 2017 •

edited

Loading