[SPARK-19710][SQL][TESTS] Fix ordering of rows in query results #17039

robbinspg · 2017-02-23T12:30:10Z

What changes were proposed in this pull request?

Changes to SQLQueryTests to make the order of the results constant.
Where possible ORDER BY has been added to match the existing expected output

How was this patch tested?

Test runs on x86, zLinux (big endian), ppc (big endian)

fails on big endian. Only change byte order on little endian

srowen

Seems reasonable but CC @kevinyu98

SparkQA · 2017-02-23T14:28:11Z

Test build #73349 has finished for PR 17039 at commit 950415f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-23T14:47:31Z

@robbinspg SQLQueryTestQuite checks if the result is sorted, if it isn't the results get sorted. The problem with this sort check is that it does not check if the output is completely sorted, and thus causing test failures when rows have the same ordering. The current PR fixes issues with the current tests, but it does not guarantee that this won't happen again. Perhaps it is an idea to make a modification to the SQLQueryTestQuite.getNormalizedResult, that always does a complete sort (while respecting the predefined sort order)?

gatorsmile · 2017-02-23T17:54:26Z

@hvanhovell Yes! This is caused by partial sort + duplication values on the sorted columns. Since all these result sets are pretty small, maybe another idea is to sort the whole result when we found duplication values on the sorted columns?

robbinspg · 2017-02-24T09:16:50Z

@hvanhovell @gatorsmile I agree that would be a better solution however I don't know how to achieve that being unfamiliar with this code.

gatorsmile · 2017-02-26T06:00:37Z

@hvanhovell I made a try and just realized it might be hard to automatically detect whether it is a complete sort and respect the predefined sort order. The order by clauses could be complex expressions and the output attributes of queries might be the alias of complex expressions too. Thus, maybe we still can keep the existing conservative way (i.e., as long as the test query specifies the order-by clauses, we expect the query result have deterministic orders)

robbinspg · 2017-02-27T11:44:00Z

@gatorsmile I'm glad it wasn't just me that found it complex ;-)

I've modified the patch to remove an unnecessary change as that query was not ordered and the test suite code handles that case.

robbinspg · 2017-02-27T12:55:55Z

Jenkins retest please

SparkQA · 2017-02-27T18:43:35Z

Test build #73508 has finished for PR 17039 at commit 4a4d7ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-27T18:50:50Z

The major issue is we do not know the original intention of users' query. The query might purposely check whether the result set is sorted or not. Thus, the existing test suite design is conservative to avoid adding any sort as long as users specify the ORDER BY clause. For example,

SELECT c1, c2, sum(c1) FROM tab1 GROUP BY c1, c2 ORDER BY c1, c2

In the above example, although the order by clause does not contain all the columns, the result set is always sorted. Thus, our test suite should not sort it.

hvanhovell · 2017-02-27T19:03:30Z

How about the more pragmatic approach. I think relation algebra only guarantees ordering when an order by is the top level operation. Why not just check that, and if we find one, add all output columns to the order by? In all other cases I would just use the current result sorting mechanism.

gatorsmile · 2017-02-28T00:25:11Z

@hvanhovell Is that possible the SQL queries are used to verify the behavior of ORDER BY? Do you think we should explicitly leave a comment to say SQLQueryTestSuite will not be used for this goal?

robbinspg · 2017-02-28T10:46:18Z

I think that the current "order if not currently ordered" in the test suite is good for checking the set of results for unordered queries.

If ordered at all then the results should be deterministic given the input data and query are part of the test otherwise it is a bad test. So... I think this PR is the way to go.

hvanhovell · 2017-02-28T12:34:50Z

I have just taken a look at this. Modifying the sort does not really help. I am fine with this approach.

LGTM

hvanhovell · 2017-02-28T13:38:50Z

A small follow-up (I got the code to run). We could use the following in SQLQueryTestSuite.getNormalizedResult:

      val baseDf = session.sql(sql)
      val (df, isSorted) = baseDf.logicalPlan match {
        case Sort(ordering, true, child) =>
          val sort = Sort(ordering ++ child.output.map(SortOrder(_, Ascending)), true, child)
          (Dataset.ofRows(session, sort), true)
        case _ =>
          (baseDf, false)
      }

      val schema = df.schema
      // Get answer, but also get rid of the #1234 expression ids that show up in explain plans
      val answer = df.queryExecution.hiveResultString().map(_.replaceAll("#\\d+", "#x"))

      // If the output is not pre-sorted, sort it.
      if (isSorted) (schema, answer) else (schema, answer.sorted)

If I run this, it also changes results in the following files:

group-analytics.sql.out
in-set-operations.sql.out
order-by-nulls-ordering.sql.out

robbinspg · 2017-03-01T13:59:42Z

@hvanhovell So I backed out the changes in this PR, implemented your change to SQLQueryTestSuite.getNormalizedResult, regenerated the golden results files and the tests all pass on my x86 and big endian platforms.

results files that were changed:
sql/core/src/test/resources/sql-tests/results/group-analytics.sql.out
sql/core/src/test/resources/sql-tests/results/order-by-nulls-ordering.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-joins.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-order-by.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-set-operations.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-joins.sql.out

So, should I abandon this PR and go with your solution? I can submit your change in a PR along with updated results files if you want.

edit: I'm not sure the output changes are correct. eg group-analytics.sql.out query 21 looked like good output prior to your change (ie matched the query ordering) but the regenerated file doesn't look like it is a valid resutl from the query

hvanhovell · 2017-03-03T14:21:29Z

@robbinspg yeah let's use the approach that I suggested. That at least makes sure we won't have to fix this again.

robbinspg · 2017-03-03T15:08:27Z

ok so here is an example of output I'm not sure is correct:

in-order-by

-- !query 17
SELECT Count(DISTINCT( t1a )),
t1b
FROM t1
WHERE t1h NOT IN (SELECT t2h
FROM t2
where t1a = t2a
order by t2d DESC nulls first
)
GROUP BY t1a,
t1b
ORDER BY t1b DESC nulls last
-- !query 17 schema
struct<count(DISTINCT t1a):bigint,t1b:smallint>
-- !query 17 output
1 10
1 10
1 16
1 6
1 8
1 NULL

That is the "new" output with your change but it doesn't actually match what you'd expect from that query (it isn't t1b DESC) which would be

1 16
1 10
1 10
1 8
1 6
1 NULL

hvanhovell · 2017-03-03T15:30:52Z

Hmmmm - apparently the ORDER BY is not the top node in that plan... Lemme check

robbinspg · 2017-03-03T15:35:15Z

ok there were a couple of similar issues such as in-set-operations query 9, group-analytics.sql.out query 21 and 22

hvanhovell · 2017-03-03T15:52:41Z

The code didn't respect the ordering because the analyzer adds a project on top; this is for cases where you sort by a column that is not in the final projection. We could also pattern match on that, but I don't think we should be adding too much magic.

Let me merge this, and we can pick this up if it ever becomes an issue again.

hvanhovell · 2017-03-03T15:52:52Z

LGTM - merging to master! Thanks!

robbinspg added 12 commits December 21, 2016 10:06

o.a.s.unsafe.types.UTF8StringSuite.writeToOutputStreamIntArray test

fbc46a6

fails on big endian. Only change byte order on little endian

Simplify setting of byte order

30e20be

Merge branch 'master' of https://github.com/apache/spark.git

145c76a

Merge branch 'master' of https://github.com/apache/spark.git

f0e77f2

remove redundant comment

1bc1adf

Merge branch 'master' of https://github.com/apache/spark.git

ea259fc

Merge branch 'master' of https://github.com/apache/spark.git

f4b76a7

Merge branch 'master' of https://github.com/apache/spark.git

b5571ea

Merge branch 'master' of https://github.com/apache/spark.git

1917773

Merge branch 'master' of https://github.com/apache/spark.git

a832b74

Merge branch 'master' of https://github.com/apache/spark.git

bafe31c

Update tests to produce reliably ordered results

950415f

robbinspg changed the title ~~[SPARK-19710] Fix ordering of rows in query results~~ [SPARK-19710] [SQL] Fix ordering of rows in query results Feb 23, 2017

robbinspg changed the title ~~[SPARK-19710] [SQL] Fix ordering of rows in query results~~ [SPARK-19710][SQL][TESTS] Fix ordering of rows in query results Feb 23, 2017

robbinspg mentioned this pull request Feb 23, 2017

[SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 3rd batch #16841

Closed

srowen reviewed Feb 23, 2017

View reviewed changes

remove unnecessary ordering

4a4d7ad

asfgit closed this in 37a1c0e Mar 3, 2017

[SPARK-19710][SQL][TESTS] Fix ordering of rows in query results #17039

[SPARK-19710][SQL][TESTS] Fix ordering of rows in query results #17039

Uh oh!

Conversation

robbinspg commented Feb 23, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 23, 2017

Uh oh!

hvanhovell commented Feb 23, 2017

Uh oh!

gatorsmile commented Feb 23, 2017

Uh oh!

robbinspg commented Feb 24, 2017

Uh oh!

gatorsmile commented Feb 26, 2017

Uh oh!

robbinspg commented Feb 27, 2017

Uh oh!

robbinspg commented Feb 27, 2017

Uh oh!

SparkQA commented Feb 27, 2017

Uh oh!

gatorsmile commented Feb 27, 2017

Uh oh!

hvanhovell commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Feb 28, 2017

Uh oh!

robbinspg commented Feb 28, 2017

Uh oh!

hvanhovell commented Feb 28, 2017

Uh oh!

hvanhovell commented Feb 28, 2017

Uh oh!

robbinspg commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Mar 3, 2017

Uh oh!

robbinspg commented Mar 3, 2017

Uh oh!

hvanhovell commented Mar 3, 2017

Uh oh!

robbinspg commented Mar 3, 2017

Uh oh!

hvanhovell commented Mar 3, 2017

Uh oh!

hvanhovell commented Mar 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hvanhovell commented Feb 27, 2017 •

edited

Loading

robbinspg commented Mar 1, 2017 •

edited

Loading