Skip to content

Conversation

@yongtang
Copy link
Contributor

See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:

sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, but this should be setInputPaths I think for consistency. Also there are other instances of the same call in this class that can be updated too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen @yongtang why setInputPaths is better than addInputPaths? This breaks compatibility with code that was adding directly paths using FileInputFormat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that was the intent of the API -- you specify the paths as an argument, and it could be surprising to also include something that happens to be in the existing config already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but originally to support multiple paths one would achieve it by adding them directly to the config. Code that was relying on it breaks (my case :() though it's easy to fix. In practice I am not sure there is any real advantage in preventing people to add paths from some code outside spark. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the rest of the API already used setInputPaths so one or the other behavior really needed to change in order to fix that. I think the logic was that nobody should have been relying on anything but the method arg to set the path. I personally think it's less confusing to not have two ways to specify a path. At this point though I think it would need a very good reason to change the behavior again since it's not a question of fixing an inconsistency anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to use addInputPaths would be for preserving compatibility. I had the luck to have some unit tests that detected this change, but others might encounter it in production.

But as this has been already released, I guess we can stick with setInputPaths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but then what would you do about the inconsistency problem? some methods would then use one, others use the other. That's a bigger problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continued the conversation on the jira ticket SPARK-8439.

…r newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
@yongtang
Copy link
Contributor Author

@srowen Thanks for the comment. I updated the pull request so that setInputPaths instead of addInputPaths are used. In addition to newAPIHadoopFile(), the instances of addInputPath inside wholeTextFiles() and binaryFiles() have also been updated with setInputPaths. That should bring behavior consistency across all ScalaContext.scala.

The unit test for this issue has also been updated to cover every method involved. Please let me know if there is anything else that needs to be taken care of.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler just to delete the parent dirs right?

@yongtang
Copy link
Contributor Author

@srowen Thanks. Removed unneeded file deletion on unit test. Didn't realize Utils.createTempDir() will delete recursively when shutdown.

@srowen
Copy link
Member

srowen commented Apr 28, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31093 has finished for PR 5708 at commit 654c80c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@yongtang
Copy link
Contributor Author

@srowen Can this pull request be merged if there is no additional issues?

@srowen
Copy link
Member

srowen commented Apr 29, 2015

I'm proceeding with this because it makes behavior more consistent, and still supports existing behavior. You can still specify just one file in all cases. It also more correctly calls set rather than add for input paths. Test pass, etc.

@asfgit asfgit closed this in 3fc6cfd Apr 29, 2015
asfgit pushed a commit that referenced this pull request Apr 29, 2015
… list of files as input

See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
```scala
sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
```
will throw
```
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
```
However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

Author: yongtang <[email protected]>

Closes #5708 from yongtang/SPARK-7155 and squashes the following commits:

654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.

(cherry picked from commit 3fc6cfd)
Signed-off-by: Sean Owen <[email protected]>
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
… list of files as input

See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
```scala
sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
```
will throw
```
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
```
However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

Author: yongtang <[email protected]>

Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits:

654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
… list of files as input

See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
```scala
sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
```
will throw
```
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
```
However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

Author: yongtang <[email protected]>

Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits:

654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… list of files as input

See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
```scala
sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
```
will throw
```
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
```
However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

Author: yongtang <[email protected]>

Closes apache#5708 from yongtang/SPARK-7155 and squashes the following commits:

654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
@marcelboettcher
Copy link

Is there a way to support commas in the filenames? I tried to escape with {} (as in hadoops FileInputFormat), but the curls are not removed after escaping and the path is therefore not valid.

Any idea? I know that commas in a filename are wired anyway ...

@normalscene
Copy link

Hello @yongtang , I am troubleshooting slow wholeTextFiles reading issue, and while digging through the function code, I came across this ticket number.

I have a question regarding the comment made here. Basically the comment note talks about how one could speed up reading wholeTextFiles() from multiple paths. I can't understand what is meant by "path/*" in the comment note. Any pointers would be much appreciated.

In my case - we are attempting to process our data by reading from 24 paths, simultaneously by providing a csv list in our gs bucket URL, and it took like 40 minutes to read and prepare the dataframe - which we feel is quite slow. And when the no. of paths is reduced to 10, it takes like one fourth time.

My apologies that I am posting a question to you here as I couldn't find a way to put forward questions on the spark github repo.
If there is an official way to ask question - please point me towards it and I shall adhere to the due process and get help from the community, officially.

@yongtang
Copy link
Contributor Author

@normalscene I haven't been very active in the community recently but I think you can create an issue in https://issues.apache.org/jira/projects/SPARK/issues/SPARK-33807?filter=allopenissues ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants