[SPARK-19340][SQL] CSV file will result in an exception if the filename contains special characters #16995

lxsmnv · 2017-02-20T02:26:04Z

What changes were proposed in this pull request?

The root cause of the problem is that when spark is inferring schema from the csv file, it tries to resolve the file path pattern more then once by calling DataSouce.resolveRelation each time.

So, if we have file path like:
<...>/test*
and the actual file with name: test{00-1}.txt
Then from the initial call of DataSouce.resolveRelation the pattern will be resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, it calls DataSouce.resolveRelation the second time. The second attempt to resolve the path pattern fails because the file name /<...>/test{00-1}.txt is considered as a pattern and not as actual file and if there no file that match that pattern the whole DataSouce.resolveRelation fails.

The idea behind the fix is quite straightforward:
The part of DataSouce.resolveRelation that creates Hadoop Relation based on a resolved(actual) file names moved to separate function createHadoopRelation. CSVFileFormat.createBaseDataset calls this new function instead of DataSouce.resolveRelation, that caused unnecessary file path resolution.

How was this patch tested?

manual tests

This contribution is my original work and I license the work to the project under the project’s open source license.
Please review http://spark.apache.org/contributing.html before opening a pull request.

AmplabJenkins · 2017-02-20T02:27:15Z

Can one of the admins verify this patch?

maropu · 2017-02-20T04:40:55Z

Could you add tests for this pr?

viirya · 2017-02-20T07:52:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+    * @return Hadoop relation object
+    */
+  def createHadoopRelation(format: FileFormat,
+                           globPaths: Array[Path]): BaseRelation = {


You do twice getOrInferFileFormatSchema. One is before calling createHadoopRelation.

@viirya I will fix this. Looks like merge issue.
@maropu I will add tests.

HyukjinKwon · 2017-02-20T13:40:55Z

@lxsmnv, could you check if this is a more general problem? I suspect it is not only a CSV specific issue. IIRC, I tested several cases with other datasources and it did not work correctly in some cases when I saw this JIRA.

lxsmnv · 2017-02-21T00:24:42Z

@viirya I've removed a duplicate call for getOrInferFileFormatSchema. Thanks for pointing out.
@maropu I've added the test case.

lxsmnv · 2017-02-21T00:46:45Z

@HyukjinKwon the problem that I have found was in CSVFileFormat. So its more of csv specific. However, it can be a problem of some other data source types, but not all - it will depend on data source implementation. If the same problem with other data source types exists, there can be some degree of a common nature of that problem for those datasource types and the fix may require a some significant amount of work and changes.

My fix is quite simple and doesn't introduce much changes. For now, I would suggest to merge it. Otherwise aiming a more generic solution now may end up in neither options being implemented.

If you tried to reproduce this issue with our datasource types, can you create a new ticket and provide the details about the tests you have done and I will have a look at it and think about more generic approach.

HyukjinKwon · 2017-02-21T05:08:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+   * @return Hadoop relation object
+   */
+  def createHadoopRelation(format: FileFormat,
+                           globPaths: Array[Path]): BaseRelation = {


Let's make this inlined.

viirya · 2017-02-21T07:59:22Z

Actually I have a simpler fix like this:

--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -213,7 +213,12 @@ class SparkHadoopUtil extends Logging {
   }

   def globPathIfNecessary(pattern: Path): Seq[Path] = {
-    if (isGlobPath(pattern)) globPath(pattern) else Seq(pattern)
+    val fs = pattern.getFileSystem(conf)
+    if (fs.exists(pattern)) {
+      Seq(pattern)
+    } else {
+      if (isGlobPath(pattern)) globPath(pattern) else Seq(pattern)
+    }
   }

viirya · 2017-02-23T03:31:20Z

@lxsmnv What do you think?

lxsmnv · 2017-02-23T13:02:13Z

@viirya Looks good to me :) and more generic. Do you want me to update my pull request or you have the other pull request with that fix?

lxsmnv · 2017-02-23T13:08:40Z

However, resolving path patterns and checking file existence multiple times is a bit awkward but it is a more general problem - all this data source stuff needs refactoring.

lxsmnv · 2017-02-23T13:22:25Z

There is a possible problem about adding file existence check to globPathIfNecessary that I have just realized. If the user will provide pattern that is exactly the same as existing file name e.g. we have file with name test* and user provides pattern test* meaning pattern but not exact file name. globPathIfNecessary with proposed modification will resolve it to only one file - test*. It will change the existing behaviour but I am not sure if this is practically a big issue.

Sync with original

update from origin

Update from master

Update from origin

update from origin

update

gatorsmile · 2017-06-13T06:24:15Z

Since this could cause the behavior change, how about we first close this PR?

gatorsmile · 2017-06-27T06:46:25Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

SPARK-19340 file path resolution for csv files fixed

507a929

viirya reviewed Feb 20, 2017

View reviewed changes

lxsmnv added 3 commits February 20, 2017 14:45

Removed redundant getOrInferFileFormatSchema call

db0b822

Fixed schema inference for non UTF-8 files

9bcb78f

Test added for SPARK-19340

02c9f3a

HyukjinKwon reviewed Feb 21, 2017

View reviewed changes

lxsmnv added 10 commits February 25, 2017 16:40

Merge pull request #1 from apache/master

b639d71

Sync with original

Merge pull request #2 from apache/master

8234183

update from origin

Merge branch 'SPARK-19340' into master

615d9f0

Merge pull request #3 from lxsmnv/master

68a98bc

Update from master

Merged with recent changes

fb3fb31

Merge pull request #6 from apache/master

28210cc

Update from origin

Merge pull request #7 from apache/master

f851ecc

update from origin

Merge pull request #9 from apache/master

0eaba14

update

Merge branch 'SPARK-19340' into master

b346adb

Merge pull request #10 from lxsmnv/master

480207d

update

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

[SPARK-19340][SQL] CSV file will result in an exception if the filename contains special characters #16995

[SPARK-19340][SQL] CSV file will result in an exception if the filename contains special characters #16995

Uh oh!

Conversation

lxsmnv commented Feb 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Feb 20, 2017

Uh oh!

maropu commented Feb 20, 2017

Uh oh!

viirya Feb 20, 2017

Choose a reason for hiding this comment

Uh oh!

lxsmnv Feb 20, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lxsmnv commented Feb 21, 2017

Uh oh!

lxsmnv commented Feb 21, 2017

Uh oh!

HyukjinKwon Feb 21, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 21, 2017

Uh oh!

viirya commented Feb 23, 2017

Uh oh!

lxsmnv commented Feb 23, 2017

Uh oh!

lxsmnv commented Feb 23, 2017

Uh oh!

lxsmnv commented Feb 23, 2017

Uh oh!

gatorsmile commented Jun 13, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon commented Feb 20, 2017 •

edited

Loading