Skip to content

Conversation

@robbinspg
Copy link
Member

What changes were proposed in this pull request?

Changes to SQLQueryTests to make the order of the results constant.
Where possible ORDER BY has been added to match the existing expected output

How was this patch tested?

Test runs on x86, zLinux (big endian), ppc (big endian)

@robbinspg robbinspg changed the title [SPARK-19710] Fix ordering of rows in query results [SPARK-19710] [SQL] Fix ordering of rows in query results Feb 23, 2017
@robbinspg robbinspg changed the title [SPARK-19710] [SQL] Fix ordering of rows in query results [SPARK-19710][SQL][TESTS] Fix ordering of rows in query results Feb 23, 2017
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable but CC @kevinyu98

@SparkQA
Copy link

SparkQA commented Feb 23, 2017

Test build #73349 has finished for PR 17039 at commit 950415f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@robbinspg SQLQueryTestQuite checks if the result is sorted, if it isn't the results get sorted. The problem with this sort check is that it does not check if the output is completely sorted, and thus causing test failures when rows have the same ordering. The current PR fixes issues with the current tests, but it does not guarantee that this won't happen again. Perhaps it is an idea to make a modification to the SQLQueryTestQuite.getNormalizedResult, that always does a complete sort (while respecting the predefined sort order)?

@gatorsmile
Copy link
Member

@hvanhovell Yes! This is caused by partial sort + duplication values on the sorted columns. Since all these result sets are pretty small, maybe another idea is to sort the whole result when we found duplication values on the sorted columns?

@robbinspg
Copy link
Member Author

@hvanhovell @gatorsmile I agree that would be a better solution however I don't know how to achieve that being unfamiliar with this code.

@gatorsmile
Copy link
Member

@hvanhovell I made a try and just realized it might be hard to automatically detect whether it is a complete sort and respect the predefined sort order. The order by clauses could be complex expressions and the output attributes of queries might be the alias of complex expressions too. Thus, maybe we still can keep the existing conservative way (i.e., as long as the test query specifies the order-by clauses, we expect the query result have deterministic orders)

@robbinspg
Copy link
Member Author

@gatorsmile I'm glad it wasn't just me that found it complex ;-)

I've modified the patch to remove an unnecessary change as that query was not ordered and the test suite code handles that case.

@robbinspg
Copy link
Member Author

Jenkins retest please

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73508 has finished for PR 17039 at commit 4a4d7ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

The major issue is we do not know the original intention of users' query. The query might purposely check whether the result set is sorted or not. Thus, the existing test suite design is conservative to avoid adding any sort as long as users specify the ORDER BY clause. For example,

SELECT c1, c2, sum(c1) FROM tab1 GROUP BY c1, c2 ORDER BY c1, c2

In the above example, although the order by clause does not contain all the columns, the result set is always sorted. Thus, our test suite should not sort it.

@hvanhovell
Copy link
Contributor

hvanhovell commented Feb 27, 2017

How about the more pragmatic approach. I think relation algebra only guarantees ordering when an order by is the top level operation. Why not just check that, and if we find one, add all output columns to the order by? In all other cases I would just use the current result sorting mechanism.

@gatorsmile
Copy link
Member

@hvanhovell Is that possible the SQL queries are used to verify the behavior of ORDER BY? Do you think we should explicitly leave a comment to say SQLQueryTestSuite will not be used for this goal?

@robbinspg
Copy link
Member Author

I think that the current "order if not currently ordered" in the test suite is good for checking the set of results for unordered queries.

If ordered at all then the results should be deterministic given the input data and query are part of the test otherwise it is a bad test. So... I think this PR is the way to go.

@hvanhovell
Copy link
Contributor

I have just taken a look at this. Modifying the sort does not really help. I am fine with this approach.

LGTM

@hvanhovell
Copy link
Contributor

A small follow-up (I got the code to run). We could use the following in SQLQueryTestSuite.getNormalizedResult:

      val baseDf = session.sql(sql)
      val (df, isSorted) = baseDf.logicalPlan match {
        case Sort(ordering, true, child) =>
          val sort = Sort(ordering ++ child.output.map(SortOrder(_, Ascending)), true, child)
          (Dataset.ofRows(session, sort), true)
        case _ =>
          (baseDf, false)
      }

      val schema = df.schema
      // Get answer, but also get rid of the #1234 expression ids that show up in explain plans
      val answer = df.queryExecution.hiveResultString().map(_.replaceAll("#\\d+", "#x"))

      // If the output is not pre-sorted, sort it.
      if (isSorted) (schema, answer) else (schema, answer.sorted)

If I run this, it also changes results in the following files:

  • group-analytics.sql.out
  • in-set-operations.sql.out
  • order-by-nulls-ordering.sql.out

@robbinspg
Copy link
Member Author

robbinspg commented Mar 1, 2017

@hvanhovell So I backed out the changes in this PR, implemented your change to SQLQueryTestSuite.getNormalizedResult, regenerated the golden results files and the tests all pass on my x86 and big endian platforms.

results files that were changed:
sql/core/src/test/resources/sql-tests/results/group-analytics.sql.out
sql/core/src/test/resources/sql-tests/results/order-by-nulls-ordering.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-joins.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-order-by.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-set-operations.sql.out
sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-joins.sql.out

So, should I abandon this PR and go with your solution? I can submit your change in a PR along with updated results files if you want.

edit: I'm not sure the output changes are correct. eg group-analytics.sql.out query 21 looked like good output prior to your change (ie matched the query ordering) but the regenerated file doesn't look like it is a valid resutl from the query

@hvanhovell
Copy link
Contributor

@robbinspg yeah let's use the approach that I suggested. That at least makes sure we won't have to fix this again.

@robbinspg
Copy link
Member Author

ok so here is an example of output I'm not sure is correct:

in-order-by

-- !query 17
SELECT Count(DISTINCT( t1a )),
t1b
FROM t1
WHERE t1h NOT IN (SELECT t2h
FROM t2
where t1a = t2a
order by t2d DESC nulls first
)
GROUP BY t1a,
t1b
ORDER BY t1b DESC nulls last
-- !query 17 schema
struct<count(DISTINCT t1a):bigint,t1b:smallint>
-- !query 17 output
1 10
1 10
1 16
1 6
1 8
1 NULL

That is the "new" output with your change but it doesn't actually match what you'd expect from that query (it isn't t1b DESC) which would be

1 16
1 10
1 10
1 8
1 6
1 NULL

@hvanhovell
Copy link
Contributor

Hmmmm - apparently the ORDER BY is not the top node in that plan... Lemme check

@robbinspg
Copy link
Member Author

ok there were a couple of similar issues such as in-set-operations query 9, group-analytics.sql.out query 21 and 22

@hvanhovell
Copy link
Contributor

The code didn't respect the ordering because the analyzer adds a project on top; this is for cases where you sort by a column that is not in the final projection. We could also pattern match on that, but I don't think we should be adding too much magic.

Let me merge this, and we can pick this up if it ever becomes an issue again.

@hvanhovell
Copy link
Contributor

LGTM - merging to master! Thanks!

@asfgit asfgit closed this in 37a1c0e Mar 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants