Skip to content

Conversation

@sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.

As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see https://github.com/databricks/spark-csv/issues/367):

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> val df = spark.read.csv("/foo/bar.csv")
java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
  at scala.sys.package$.error(package.scala:27)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
  at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
  at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
  ... 48 elided

This patch proposes a simple way of fixing this error by always mapping default input data source formats to inlined classes (that exist in Spark):

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> val df = spark.read.csv("/foo/bar.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string]

How was this patch tested?

Existing Tests

@sameeragarwal
Copy link
Member Author

cc @cloud-fan @HyukjinKwon thoughts?

@HyukjinKwon
Copy link
Member

Ah, so does this always let Spark's intetnal datasources have a higher precedence instead of failing fast? I support the idea but we might need to print a warning if multiple sources are detected by the same identifier (I don't think it is recommanded...). Let me check if there is something missing at my best today.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76427 has finished for PR 17847 at commit 1af4675.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me except for few comments. Probably, adding CSV only might be good enough though. One additional comment is, I think it might be nicer if we print a warning if multiple datasources that has the same shorten names with internal datasources are detected if this can be easily done.

I tested the possible cases I could think after manually building as below:

./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

positive cases:

scala> spark.range(1).write.format("csv").save("/tmp/abc")

scala> spark.range(1).write.format("com.databricks.spark.csv").save("/tmp/abc1")

scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("/tmp/abc2")

negative cases:

scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc3")
java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
...
scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc3")
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html
...

I also tested after manually adding multiple sources as below to reproduce when there are same names from external data sources:

class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {

-  override def shortName(): String = "csv"
+  override def shortName(): String = "xml"

and

./bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
scala> spark.range(1).write.format("xml").save("/tmp/abc3")
java.lang.RuntimeException: Multiple sources found for xml (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.xml.DefaultSource15), please specify the fully qualified class name.
...

def lookupDataSource(provider: String): Class[_] = {
val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
val provider1 = builtinShortNamesMap.getOrElse(provider,
backwardCompatibilityMap.getOrElse(provider, provider))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe combine both builtinShortNamesMap and backwardCompatibilityMap and use a single getOrElse? It seems probably confusing to read a bit.

Copy link
Member

@HyukjinKwon HyukjinKwon May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I guess these should be case-insensitive for shorten names.

 ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> spark.range(1).write.format("Csv").save("/tmp/abc")
java.lang.RuntimeException: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
  at scala.sys.package$.error(package.scala:27)
  ...
 ./bin/spark-shell
scala> spark.range(1).write.format("Csv").save("/tmp/abc1")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, short names should be case insensitive

"com.databricks.spark.csv" -> csv
)

private val builtinShortNamesMap: Map[String, String] = Map(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, it is nicer if we explain why this one is needed with a small comment about why the shorten names of internal datasources should be mapped to fully qualified names.

@HyukjinKwon
Copy link
Member

@sameeragarwal BTW, should we add hive, kafka, socket, text and console too?

@sameeragarwal
Copy link
Member Author

I'm closing this in favor of #17916

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants