-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19710][SQL][TESTS] Fix ordering of rows in query results #17039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fails on big endian. Only change byte order on little endian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable but CC @kevinyu98
|
Test build #73349 has finished for PR 17039 at commit
|
|
@robbinspg |
|
@hvanhovell Yes! This is caused by partial sort + duplication values on the sorted columns. Since all these result sets are pretty small, maybe another idea is to sort the whole result when we found duplication values on the sorted columns? |
|
@hvanhovell @gatorsmile I agree that would be a better solution however I don't know how to achieve that being unfamiliar with this code. |
|
@hvanhovell I made a try and just realized it might be hard to automatically detect whether it is a complete sort and respect the predefined sort order. The order by clauses could be complex expressions and the output attributes of queries might be the alias of complex expressions too. Thus, maybe we still can keep the existing conservative way (i.e., as long as the test query specifies the order-by clauses, we expect the query result have deterministic orders) |
|
@gatorsmile I'm glad it wasn't just me that found it complex ;-) I've modified the patch to remove an unnecessary change as that query was not ordered and the test suite code handles that case. |
|
Jenkins retest please |
|
Test build #73508 has finished for PR 17039 at commit
|
|
The major issue is we do not know the original intention of users' query. The query might purposely check whether the result set is sorted or not. Thus, the existing test suite design is conservative to avoid adding any sort as long as users specify the ORDER BY clause. For example, SELECT c1, c2, sum(c1) FROM tab1 GROUP BY c1, c2 ORDER BY c1, c2In the above example, although the order by clause does not contain all the columns, the result set is always sorted. Thus, our test suite should not sort it. |
|
How about the more pragmatic approach. I think relation algebra only guarantees ordering when an order by is the top level operation. Why not just check that, and if we find one, add all output columns to the order by? In all other cases I would just use the current result sorting mechanism. |
|
@hvanhovell Is that possible the SQL queries are used to verify the behavior of ORDER BY? Do you think we should explicitly leave a comment to say SQLQueryTestSuite will not be used for this goal? |
|
I think that the current "order if not currently ordered" in the test suite is good for checking the set of results for unordered queries. If ordered at all then the results should be deterministic given the input data and query are part of the test otherwise it is a bad test. So... I think this PR is the way to go. |
|
I have just taken a look at this. Modifying the sort does not really help. I am fine with this approach. LGTM |
|
A small follow-up (I got the code to run). We could use the following in val baseDf = session.sql(sql)
val (df, isSorted) = baseDf.logicalPlan match {
case Sort(ordering, true, child) =>
val sort = Sort(ordering ++ child.output.map(SortOrder(_, Ascending)), true, child)
(Dataset.ofRows(session, sort), true)
case _ =>
(baseDf, false)
}
val schema = df.schema
// Get answer, but also get rid of the #1234 expression ids that show up in explain plans
val answer = df.queryExecution.hiveResultString().map(_.replaceAll("#\\d+", "#x"))
// If the output is not pre-sorted, sort it.
if (isSorted) (schema, answer) else (schema, answer.sorted)If I run this, it also changes results in the following files:
|
|
@hvanhovell So I backed out the changes in this PR, implemented your change to SQLQueryTestSuite.getNormalizedResult, regenerated the golden results files and the tests all pass on my x86 and big endian platforms. results files that were changed: So, should I abandon this PR and go with your solution? I can submit your change in a PR along with updated results files if you want. edit: I'm not sure the output changes are correct. eg group-analytics.sql.out query 21 looked like good output prior to your change (ie matched the query ordering) but the regenerated file doesn't look like it is a valid resutl from the query |
|
@robbinspg yeah let's use the approach that I suggested. That at least makes sure we won't have to fix this again. |
|
ok so here is an example of output I'm not sure is correct: in-order-by -- !query 17 That is the "new" output with your change but it doesn't actually match what you'd expect from that query (it isn't t1b DESC) which would be 1 16 |
|
Hmmmm - apparently the ORDER BY is not the top node in that plan... Lemme check |
|
ok there were a couple of similar issues such as in-set-operations query 9, group-analytics.sql.out query 21 and 22 |
|
The code didn't respect the ordering because the analyzer adds a project on top; this is for cases where you sort by a column that is not in the final projection. We could also pattern match on that, but I don't think we should be adding too much magic. Let me merge this, and we can pick this up if it ever becomes an issue again. |
|
LGTM - merging to master! Thanks! |
What changes were proposed in this pull request?
Changes to SQLQueryTests to make the order of the results constant.
Where possible ORDER BY has been added to match the existing expected output
How was this patch tested?
Test runs on x86, zLinux (big endian), ppc (big endian)