[SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input #5708

yongtang · 2015-04-26T19:00:46Z

See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:

sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

… list of files as input.

srowen · 2015-04-27T00:55:19Z

core/src/main/scala/org/apache/spark/SparkContext.scala

That seems reasonable, but this should be setInputPaths I think for consistency. Also there are other instances of the same call in this class that can be updated too.

@srowen @yongtang why setInputPaths is better than addInputPaths? This breaks compatibility with code that was adding directly paths using FileInputFormat.

I don't think that was the intent of the API -- you specify the paths as an argument, and it could be surprising to also include something that happens to be in the existing config already.

Yes but originally to support multiple paths one would achieve it by adding them directly to the config. Code that was relying on it breaks (my case :() though it's easy to fix. In practice I am not sure there is any real advantage in preventing people to add paths from some code outside spark. WDYT?

The problem is that the rest of the API already used setInputPaths so one or the other behavior really needed to change in order to fix that. I think the logic was that nobody should have been relying on anything but the method arg to set the path. I personally think it's less confusing to not have two ways to specify a path. At this point though I think it would need a very good reason to change the behavior again since it's not a question of fixing an inconsistency anymore.

The reason to use addInputPaths would be for preserving compatibility. I had the luck to have some unit tests that detected this change, but others might encounter it in production.

But as this has been already released, I guess we can stick with setInputPaths.

... but then what would you do about the inconsistency problem? some methods would then use one, others use the other. That's a bigger problem.

continued the conversation on the jira ticket SPARK-8439.

…r newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.

yongtang · 2015-04-27T06:57:37Z

@srowen Thanks for the comment. I updated the pull request so that setInputPaths instead of addInputPaths are used. In addition to newAPIHadoopFile(), the instances of addInputPath inside wholeTextFiles() and binaryFiles() have also been updated with setInputPaths. That should bring behavior consistency across all ScalaContext.scala.

The unit test for this issue has also been updated to cover every method involved. Please let me know if there is anything else that needs to be taken care of.

srowen · 2015-04-27T11:32:06Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

Simpler just to delete the parent dirs right?

…s parent dir is already temporary.

yongtang · 2015-04-27T15:13:13Z

@srowen Thanks. Removed unneeded file deletion on unit test. Didn't realize Utils.createTempDir() will delete recursively when shutdown.

srowen · 2015-04-28T01:02:48Z

ok to test

SparkQA · 2015-04-28T03:03:06Z

Test build #31093 has finished for PR 5708 at commit 654c80c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

yongtang · 2015-04-29T15:42:32Z

@srowen Can this pull request be merged if there is no additional issues?

srowen · 2015-04-29T22:55:36Z

I'm proceeding with this because it makes behavior more consistent, and still supports existing behavior. You can still specify just one file in all cases. It also more correctly calls set rather than add for input paths. Test pass, etc.

… list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <[email protected]> Closes #5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input. (cherry picked from commit 3fc6cfd) Signed-off-by: Sean Owen <[email protected]>

… list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <[email protected]> Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.

marcelboettcher · 2016-06-06T12:36:21Z

Is there a way to support commas in the filenames? I tried to escape with {} (as in hadoops FileInputFormat), but the curls are not removed after escaping and the path is therefore not valid.

Any idea? I know that commas in a filename are wired anyway ...

normalscene · 2021-08-20T14:43:51Z

Hello @yongtang , I am troubleshooting slow wholeTextFiles reading issue, and while digging through the function code, I came across this ticket number.

I have a question regarding the comment made here. Basically the comment note talks about how one could speed up reading wholeTextFiles() from multiple paths. I can't understand what is meant by "path/*" in the comment note. Any pointers would be much appreciated.

In my case - we are attempting to process our data by reading from 24 paths, simultaneously by providing a csv list in our gs bucket URL, and it took like 40 minutes to read and prepare the dataframe - which we feel is quite slow. And when the no. of paths is reduced to 10, it takes like one fourth time.

My apologies that I am posting a question to you here as I couldn't find a way to put forward questions on the spark github repo.
If there is an official way to ask question - please point me towards it and I shall adhere to the due process and get help from the community, officially.

yongtang · 2021-08-20T15:16:08Z

@normalscene I haven't been very active in the community recently but I think you can create an issue in https://issues.apache.org/jira/projects/SPARK/issues/SPARK-33807?filter=allopenissues ?

[SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated…

73e1f16

… list of files as input.

srowen reviewed Apr 27, 2015
View reviewed changes

[SPARK-7155] [CORE] Support comma-separated list of files as input fo…

26faa6a

…r newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.

srowen reviewed Apr 27, 2015
View reviewed changes

core/src/test/scala/org/apache/spark/SparkContextSuite.scala Outdated

Copy link

Member

srowen Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler just to delete the parent dirs right?

[SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test a…

654c80c

…s parent dir is already temporary.

asfgit closed this in 3fc6cfd Apr 29, 2015

[SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input #5708

[SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input #5708

Uh oh!

Conversation

yongtang commented Apr 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented Apr 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented Apr 27, 2015

Uh oh!

srowen commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

yongtang commented Apr 29, 2015

Uh oh!

srowen commented Apr 29, 2015

Uh oh!

marcelboettcher commented Jun 6, 2016

Uh oh!

normalscene commented Aug 20, 2021

Uh oh!

yongtang commented Aug 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants