[SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths #6414

liancheng · 2015-05-26T15:49:14Z

In SimpleTextRelation, we specified needsConversion to true, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used Cast to convert strings to expected data types. And Cast always produces values of Catalyst types, thus no conversion is done at all. This PR makes SimpleTextRelation produce Scala values so that data conversion code paths can be properly tested.

liancheng · 2015-05-26T15:49:32Z

cc @JoshRosen @yhuai

SparkQA · 2015-05-26T16:57:29Z

Test build #33522 has finished for PR 6414 at commit 9968fba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-26T18:38:10Z

Test build #859 has finished for PR 6414 at commit 9968fba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-05-26T20:28:25Z

Nice, it looks like the test failure reproduces #6400:

Lost task 0.0 in stage 405.0 (TID 974, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.types.UTF8String  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:319)  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:212)  at org.apache.spark.sql.execution.Serializer2SerializationStream.writeKey(SparkSqlSerializer2.scala:65)  at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:206)  at org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:104)  at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)  at org.apache.spark.scheduler.Task.run(Task.scala:70)  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  at java.lang.Thread.run(Thread.java:745)  Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 405.0 failed 1 times, most recent failure: Lost task 0.0 in stage 405.0 (TID 974, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.types.UTF8String  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:319)  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:212)  at org.apache.spark.sql.execution.Serializer2SerializationStream.writeKey(SparkSqlSerializer2.scala:65)  at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:206)  at org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:104)  at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)  at org.apache.spark.scheduler.Task.run(Task.scala:70)  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  at java.lang.Thread.run(Thread.java:745)  Driver stacktrace:  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)  at scala.Option.foreach(Option.scala:236)  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

How do you want to handle merging this and #6400? I think that #6400 needs to go first so that we don't commit broken code. Should I rebase my branch to include this commit and remove the regression test that I added? Or should we leave both tests in place?

yhuai · 2015-05-26T20:43:57Z

@JoshRosen I think we can add this change to #6400 and also keep your regression test.

JoshRosen · 2015-05-26T20:44:48Z

@liancheng's commit message is really nice, so I'll quote it in my PR so that we don't lose it.

…ource input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes #5986. Author: Josh Rosen <[email protected]> Author: Cheng Lian <[email protected]> Author: Cheng Lian <[email protected]> Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba into 6166473 9968fba [Cheng Lian] Tests the data type conversion code paths (cherry picked from commit 0c33c7b) Signed-off-by: Yin Huai <[email protected]>

…ource input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes #5986. Author: Josh Rosen <[email protected]> Author: Cheng Lian <[email protected]> Author: Cheng Lian <[email protected]> Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba into 6166473 9968fba [Cheng Lian] Tests the data type conversion code paths

liancheng · 2015-05-27T16:42:06Z

Closing this as it's already merged into #6400.

…ource input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes apache#5986. Author: Josh Rosen <[email protected]> Author: Cheng Lian <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba into 6166473 9968fba [Cheng Lian] Tests the data type conversion code paths

Tests the data type conversion code paths

9968fba

liancheng changed the title ~~[SQL] [Minor] Tests the data type conversion code paths~~ [SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths May 26, 2015

JoshRosen mentioned this pull request May 26, 2015

[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion #6400

Closed

liancheng closed this May 27, 2015

liancheng deleted the hadoop-fs-relation-conversion-test branch May 28, 2015 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths #6414

[SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths #6414

Uh oh!

liancheng commented May 26, 2015

Uh oh!

liancheng commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

yhuai commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

liancheng commented May 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths #6414

[SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths #6414

Uh oh!

Conversation

liancheng commented May 26, 2015

Uh oh!

liancheng commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

SparkQA commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

yhuai commented May 26, 2015

Uh oh!

JoshRosen commented May 26, 2015

Uh oh!

liancheng commented May 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants