Skip to content

Conversation

@liancheng
Copy link
Contributor

In SimpleTextRelation, we specified needsConversion to true, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used Cast to convert strings to expected data types. And Cast always produces values of Catalyst types, thus no conversion is done at all. This PR makes SimpleTextRelation produce Scala values so that data conversion code paths can be properly tested.

@liancheng
Copy link
Contributor Author

cc @JoshRosen @yhuai

@liancheng liancheng changed the title [SQL] [Minor] Tests the data type conversion code paths [SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths May 26, 2015
@SparkQA
Copy link

SparkQA commented May 26, 2015

Test build #33522 has finished for PR 6414 at commit 9968fba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 26, 2015

Test build #859 has finished for PR 6414 at commit 9968fba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Nice, it looks like the test failure reproduces #6400:

Lost task 0.0 in stage 405.0 (TID 974, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.types.UTF8String  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:319)  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:212)  at org.apache.spark.sql.execution.Serializer2SerializationStream.writeKey(SparkSqlSerializer2.scala:65)  at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:206)  at org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:104)  at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)  at org.apache.spark.scheduler.Task.run(Task.scala:70)  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  at java.lang.Thread.run(Thread.java:745)  Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 405.0 failed 1 times, most recent failure: Lost task 0.0 in stage 405.0 (TID 974, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.types.UTF8String  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:319)  at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:212)  at org.apache.spark.sql.execution.Serializer2SerializationStream.writeKey(SparkSqlSerializer2.scala:65)  at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:206)  at org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:104)  at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)  at org.apache.spark.scheduler.Task.run(Task.scala:70)  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  at java.lang.Thread.run(Thread.java:745)  Driver stacktrace:  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)  at scala.Option.foreach(Option.scala:236)  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)            

How do you want to handle merging this and #6400? I think that #6400 needs to go first so that we don't commit broken code. Should I rebase my branch to include this commit and remove the regression test that I added? Or should we leave both tests in place?

@yhuai
Copy link
Contributor

yhuai commented May 26, 2015

@JoshRosen I think we can add this change to #6400 and also keep your regression test.

@JoshRosen
Copy link
Contributor

@liancheng's commit message is really nice, so I'll quote it in my PR so that we don't lose it.

asfgit pushed a commit that referenced this pull request May 27, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes #5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths

(cherry picked from commit 0c33c7b)
Signed-off-by: Yin Huai <[email protected]>
asfgit pushed a commit that referenced this pull request May 27, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes #5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths
@liancheng
Copy link
Contributor Author

Closing this as it's already merged into #6400.

@liancheng liancheng closed this May 27, 2015
@liancheng liancheng deleted the hadoop-fs-relation-conversion-test branch May 28, 2015 11:29
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes apache#5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ource input conversion

In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes apache#5986.

Author: Josh Rosen <[email protected]>
Author: Cheng Lian <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba into 6166473
9968fba [Cheng Lian] Tests the data type conversion code paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants