-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19340][SQL] CSV file will result in an exception if the filename contains special characters #16995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Could you add tests for this pr? |
| * @return Hadoop relation object | ||
| */ | ||
| def createHadoopRelation(format: FileFormat, | ||
| globPaths: Array[Path]): BaseRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do twice getOrInferFileFormatSchema. One is before calling createHadoopRelation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@lxsmnv, could you check if this is a more general problem? I suspect it is not only a CSV specific issue. IIRC, I tested several cases with other datasources and it did not work correctly in some cases when I saw this JIRA. |
|
@HyukjinKwon the problem that I have found was in CSVFileFormat. So its more of csv specific. However, it can be a problem of some other data source types, but not all - it will depend on data source implementation. If the same problem with other data source types exists, there can be some degree of a common nature of that problem for those datasource types and the fix may require a some significant amount of work and changes. My fix is quite simple and doesn't introduce much changes. For now, I would suggest to merge it. Otherwise aiming a more generic solution now may end up in neither options being implemented. If you tried to reproduce this issue with our datasource types, can you create a new ticket and provide the details about the tests you have done and I will have a look at it and think about more generic approach. |
| * @return Hadoop relation object | ||
| */ | ||
| def createHadoopRelation(format: FileFormat, | ||
| globPaths: Array[Path]): BaseRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this inlined.
|
Actually I have a simpler fix like this: |
|
@lxsmnv What do you think? |
|
@viirya Looks good to me :) and more generic. Do you want me to update my pull request or you have the other pull request with that fix? |
|
However, resolving path patterns and checking file existence multiple times is a bit awkward but it is a more general problem - all this data source stuff needs refactoring. |
|
There is a possible problem about adding file existence check to globPathIfNecessary that I have just realized. If the user will provide pattern that is exactly the same as existing file name e.g. we have file with name test* and user provides pattern test* meaning pattern but not exact file name. globPathIfNecessary with proposed modification will resolve it to only one file - test*. It will change the existing behaviour but I am not sure if this is practically a big issue. |
Sync with original
update from origin
Update from master
Update from origin
update from origin
|
Since this could cause the behavior change, how about we first close this PR? |
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
What changes were proposed in this pull request?
The root cause of the problem is that when spark is inferring schema from the csv file, it tries to resolve the file path pattern more then once by calling DataSouce.resolveRelation each time.
So, if we have file path like:
<...>/test*
and the actual file with name: test{00-1}.txt
Then from the initial call of DataSouce.resolveRelation the pattern will be resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, it calls DataSouce.resolveRelation the second time. The second attempt to resolve the path pattern fails because the file name /<...>/test{00-1}.txt is considered as a pattern and not as actual file and if there no file that match that pattern the whole DataSouce.resolveRelation fails.
The idea behind the fix is quite straightforward:
The part of DataSouce.resolveRelation that creates Hadoop Relation based on a resolved(actual) file names moved to separate function createHadoopRelation. CSVFileFormat.createBaseDataset calls this new function instead of DataSouce.resolveRelation, that caused unnecessary file path resolution.
How was this patch tested?
manual tests
This contribution is my original work and I license the work to the project under the project’s open source license.
Please review http://spark.apache.org/contributing.html before opening a pull request.