Skip to content

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Feb 5, 2016

If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those _SUCCESS files.

In future, it is better to ignore all files/dirs starting with _ or .. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with _ or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

@SparkQA
Copy link

SparkQA commented Feb 5, 2016

Test build #50793 has finished for PR 11088 at commit 0437dbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Mar 11, 2016

LGTM, could you rebase this PR?

@yhuai
Copy link
Contributor Author

yhuai commented Mar 11, 2016

sure

@yhuai
Copy link
Contributor Author

yhuai commented Mar 11, 2016

Updated. The previous PR did not handle the cases of listing files through spark jobs. I also fixed that part and added the test.

@davies
Copy link
Contributor

davies commented Mar 11, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52936 has finished for PR 11088 at commit 2460716.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented Mar 11, 2016

test this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52940 has finished for PR 11088 at commit 2460716.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented Mar 14, 2016

Thanks. I am merging this to master.

@asfgit asfgit closed this in 250832c Mar 14, 2016
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Mar 17, 2016
If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.

In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

Author: Yin Huai <[email protected]>

Closes apache#11088 from yhuai/SPARK-13207.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.

In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

Author: Yin Huai <[email protected]>

Closes apache#11088 from yhuai/SPARK-13207.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants