-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20590] Map default input data source formats to inlined classes #17847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @HyukjinKwon thoughts? |
|
Ah, so does this always let Spark's intetnal datasources have a higher precedence instead of failing fast? I support the idea but we might need to print a warning if multiple sources are detected by the same identifier (I don't think it is recommanded...). Let me check if there is something missing at my best today. |
|
Test build #76427 has finished for PR 17847 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me except for few comments. Probably, adding CSV only might be good enough though. One additional comment is, I think it might be nicer if we print a warning if multiple datasources that has the same shorten names with internal datasources are detected if this can be easily done.
I tested the possible cases I could think after manually building as below:
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0positive cases:
scala> spark.range(1).write.format("csv").save("/tmp/abc")
scala> spark.range(1).write.format("com.databricks.spark.csv").save("/tmp/abc1")
scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("/tmp/abc2")negative cases:
scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc3")
java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
...scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc3")
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html
...I also tested after manually adding multiple sources as below to reproduce when there are same names from external data sources:
class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
- override def shortName(): String = "csv"
+ override def shortName(): String = "xml"and
./bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1scala> spark.range(1).write.format("xml").save("/tmp/abc3")
java.lang.RuntimeException: Multiple sources found for xml (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.xml.DefaultSource15), please specify the fully qualified class name.
...| def lookupDataSource(provider: String): Class[_] = { | ||
| val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) | ||
| val provider1 = builtinShortNamesMap.getOrElse(provider, | ||
| backwardCompatibilityMap.getOrElse(provider, provider)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe combine both builtinShortNamesMap and backwardCompatibilityMap and use a single getOrElse? It seems probably confusing to read a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I guess these should be case-insensitive for shorten names.
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0scala> spark.range(1).write.format("Csv").save("/tmp/abc")
java.lang.RuntimeException: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
at scala.sys.package$.error(package.scala:27)
... ./bin/spark-shellscala> spark.range(1).write.format("Csv").save("/tmp/abc1")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, short names should be case insensitive
| "com.databricks.spark.csv" -> csv | ||
| ) | ||
|
|
||
| private val builtinShortNamesMap: Map[String, String] = Map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, it is nicer if we explain why this one is needed with a small comment about why the shorten names of internal datasources should be mapped to fully qualified names.
|
@sameeragarwal BTW, should we add |
|
I'm closing this in favor of #17916 |
What changes were proposed in this pull request?
One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.
As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see https://github.com/databricks/spark-csv/issues/367):
This patch proposes a simple way of fixing this error by always mapping default input data source formats to inlined classes (that exist in Spark):
How was this patch tested?
Existing Tests