-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input #5708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… list of files as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable, but this should be setInputPaths I think for consistency. Also there are other instances of the same call in this class that can be updated too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that was the intent of the API -- you specify the paths as an argument, and it could be surprising to also include something that happens to be in the existing config already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but originally to support multiple paths one would achieve it by adding them directly to the config. Code that was relying on it breaks (my case :() though it's easy to fix. In practice I am not sure there is any real advantage in preventing people to add paths from some code outside spark. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the rest of the API already used setInputPaths so one or the other behavior really needed to change in order to fix that. I think the logic was that nobody should have been relying on anything but the method arg to set the path. I personally think it's less confusing to not have two ways to specify a path. At this point though I think it would need a very good reason to change the behavior again since it's not a question of fixing an inconsistency anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to use addInputPaths would be for preserving compatibility. I had the luck to have some unit tests that detected this change, but others might encounter it in production.
But as this has been already released, I guess we can stick with setInputPaths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but then what would you do about the inconsistency problem? some methods would then use one, others use the other. That's a bigger problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continued the conversation on the jira ticket SPARK-8439.
…r newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
|
@srowen Thanks for the comment. I updated the pull request so that setInputPaths instead of addInputPaths are used. In addition to newAPIHadoopFile(), the instances of addInputPath inside wholeTextFiles() and binaryFiles() have also been updated with setInputPaths. That should bring behavior consistency across all ScalaContext.scala. The unit test for this issue has also been updated to cover every method involved. Please let me know if there is anything else that needs to be taken care of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler just to delete the parent dirs right?
…s parent dir is already temporary.
|
@srowen Thanks. Removed unneeded file deletion on unit test. Didn't realize Utils.createTempDir() will delete recursively when shutdown. |
|
ok to test |
|
Test build #31093 has finished for PR 5708 at commit
|
|
@srowen Can this pull request be merged if there is no additional issues? |
|
I'm proceeding with this because it makes behavior more consistent, and still supports existing behavior. You can still specify just one file in all cases. It also more correctly calls set rather than add for input paths. Test pass, etc. |
… list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <[email protected]> Closes #5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input. (cherry picked from commit 3fc6cfd) Signed-off-by: Sean Owen <[email protected]>
… list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <[email protected]> Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
… list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <[email protected]> Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
… list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <[email protected]> Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
|
Is there a way to support commas in the filenames? I tried to escape with {} (as in hadoops FileInputFormat), but the curls are not removed after escaping and the path is therefore not valid. Any idea? I know that commas in a filename are wired anyway ... |
|
Hello @yongtang , I am troubleshooting slow wholeTextFiles reading issue, and while digging through the function code, I came across this ticket number. I have a question regarding the comment made here. Basically the comment note talks about how one could speed up reading wholeTextFiles() from multiple paths. I can't understand what is meant by "path/*" in the comment note. Any pointers would be much appreciated. In my case - we are attempting to process our data by reading from 24 paths, simultaneously by providing a csv list in our gs bucket URL, and it took like 40 minutes to read and prepare the dataframe - which we feel is quite slow. And when the no. of paths is reduced to 10, it takes like one fourth time. My apologies that I am posting a question to you here as I couldn't find a way to put forward questions on the spark github repo. |
|
@normalscene I haven't been very active in the community recently but I think you can create an issue in https://issues.apache.org/jira/projects/SPARK/issues/SPARK-33807?filter=allopenissues ? |
See JIRA: https://issues.apache.org/jira/browse/SPARK-7155
SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
will throw
However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.
That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.
This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.
A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().
Note: The contribution is my original work and that I license the work to the project under the project's open source license.