Skip to content

Conversation

@zhzhan
Copy link
Contributor

@zhzhan zhzhan commented May 7, 2015

There may be mismatch between RDD schema and relation schema, I think the conversion should use RDD schema.

@AmplabJenkins
Copy link

Merged build triggered.

@zhzhan zhzhan changed the title [SPARK-7449][SQL]: Schema fix [SPARK-7449][SQL]: RDD Schema mismatch fix May 7, 2015
@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32145 has started for PR 5986 at commit c16babb.

@marmbrus
Copy link
Contributor

marmbrus commented May 7, 2015

test case?

@zhzhan
Copy link
Contributor Author

zhzhan commented May 7, 2015

@marmbrus I will find a test case for it.

I hit this issue when the inside the relation I didn't overwrite needConversion using following.
override def needConversion = false

createPhysicalRDD try to use rowToRowRdd to convert to catalyst type. For example, if we do
select b, c from table where a > 1, the output is b, c, a, but the table schema is a, b, c. It have type conflicts.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32145 has finished for PR 5986 at commit c16babb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32145/
Test PASSed.

asfgit pushed a commit that referenced this pull request May 27, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes #5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths

(cherry picked from commit 0c33c7b)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 0c33c7b May 27, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes apache#5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes apache#5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants