Skip to content

Conversation

@lxsmnv
Copy link

@lxsmnv lxsmnv commented Feb 20, 2017

What changes were proposed in this pull request?

The root cause of the problem is that when spark is inferring schema from the csv file, it tries to resolve the file path pattern more then once by calling DataSouce.resolveRelation each time.

So, if we have file path like:
<...>/test*
and the actual file with name: test{00-1}.txt
Then from the initial call of DataSouce.resolveRelation the pattern will be resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, it calls DataSouce.resolveRelation the second time. The second attempt to resolve the path pattern fails because the file name /<...>/test{00-1}.txt is considered as a pattern and not as actual file and if there no file that match that pattern the whole DataSouce.resolveRelation fails.

The idea behind the fix is quite straightforward:
The part of DataSouce.resolveRelation that creates Hadoop Relation based on a resolved(actual) file names moved to separate function createHadoopRelation. CSVFileFormat.createBaseDataset calls this new function instead of DataSouce.resolveRelation, that caused unnecessary file path resolution.

How was this patch tested?

manual tests

This contribution is my original work and I license the work to the project under the project’s open source license.
Please review http://spark.apache.org/contributing.html before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu
Copy link
Member

maropu commented Feb 20, 2017

Could you add tests for this pr?

* @return Hadoop relation object
*/
def createHadoopRelation(format: FileFormat,
globPaths: Array[Path]): BaseRelation = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do twice getOrInferFileFormatSchema. One is before calling createHadoopRelation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I will fix this. Looks like merge issue.
@maropu I will add tests.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 20, 2017

@lxsmnv, could you check if this is a more general problem? I suspect it is not only a CSV specific issue. IIRC, I tested several cases with other datasources and it did not work correctly in some cases when I saw this JIRA.

@lxsmnv
Copy link
Author

lxsmnv commented Feb 21, 2017

@viirya I've removed a duplicate call for getOrInferFileFormatSchema. Thanks for pointing out.
@maropu I've added the test case.

@lxsmnv
Copy link
Author

lxsmnv commented Feb 21, 2017

@HyukjinKwon the problem that I have found was in CSVFileFormat. So its more of csv specific. However, it can be a problem of some other data source types, but not all - it will depend on data source implementation. If the same problem with other data source types exists, there can be some degree of a common nature of that problem for those datasource types and the fix may require a some significant amount of work and changes.

My fix is quite simple and doesn't introduce much changes. For now, I would suggest to merge it. Otherwise aiming a more generic solution now may end up in neither options being implemented.

If you tried to reproduce this issue with our datasource types, can you create a new ticket and provide the details about the tests you have done and I will have a look at it and think about more generic approach.

* @return Hadoop relation object
*/
def createHadoopRelation(format: FileFormat,
globPaths: Array[Path]): BaseRelation = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this inlined.

@viirya
Copy link
Member

viirya commented Feb 21, 2017

Actually I have a simpler fix like this:

--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -213,7 +213,12 @@ class SparkHadoopUtil extends Logging {
   }

   def globPathIfNecessary(pattern: Path): Seq[Path] = {
-    if (isGlobPath(pattern)) globPath(pattern) else Seq(pattern)
+    val fs = pattern.getFileSystem(conf)
+    if (fs.exists(pattern)) {
+      Seq(pattern)
+    } else {
+      if (isGlobPath(pattern)) globPath(pattern) else Seq(pattern)
+    }
   }

@viirya
Copy link
Member

viirya commented Feb 23, 2017

@lxsmnv What do you think?

@lxsmnv
Copy link
Author

lxsmnv commented Feb 23, 2017

@viirya Looks good to me :) and more generic. Do you want me to update my pull request or you have the other pull request with that fix?

@lxsmnv
Copy link
Author

lxsmnv commented Feb 23, 2017

However, resolving path patterns and checking file existence multiple times is a bit awkward but it is a more general problem - all this data source stuff needs refactoring.

@lxsmnv
Copy link
Author

lxsmnv commented Feb 23, 2017

There is a possible problem about adding file existence check to globPathIfNecessary that I have just realized. If the user will provide pattern that is exactly the same as existing file name e.g. we have file with name test* and user provides pattern test* meaning pattern but not exact file name. globPathIfNecessary with proposed modification will resolve it to only one file - test*. It will change the existing behaviour but I am not sure if this is practically a big issue.

@gatorsmile
Copy link
Member

Since this could cause the behavior change, how about we first close this PR?

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants