[Postgres-xl-developers] Fix ordering of results for queries with GROUP BY and ORDER BY clauses
Tomas Vondra
tomas.vondra at 2ndquadrant.com
Sun Jul 30 18:07:29 PDT 2017
Hello,
The first pair of patches fixes an issue with queries producing results
with incorrect ordering. For example given a query like this
SELECT count(*) FROM xc_groupby_def GROUP BY a ORDER BY 1;
the master branch produces a plan like this:
QUERY PLAN
------------------------------------------------------------
Remote Subquery Scan on all
Output: count(*), a
Sort Key: xc_groupby_def.a
-> Sort
Output: (count(*)), a
Sort Key: (count(*))
-> HashAggregate
Output: count(*), a
Group Key: xc_groupby_def.a
-> Seq Scan on public.xc_groupby_def
Output: a, b
But that plan is incorrect, as the "ORDER BY 1" means we should be
sorting by "count(*)", but the Remote Subquery Scan is clearly sorting
by xc_groupby_def.a.
Interestingly enough, the Sort node uses the right Sort key, which
however makes the things worse as the Remote Subquery Scan does merge
sort, assuming input from remote nodes is sorted accordingly. As that's
not true in this case, the output is likely not sorted at all.
Turns out, this happens due to a confusion between grouping keys and
sort keys, as the top-level Remote Subquery Scan is added in
standard_planner like this:
top_plan = create_plan(root, best_path);
if (root->distribution)
{
top_plan = (Plan *) make_remotesubplan(root, top_plan, NULL,
root->distribution,
root->query_pathkeys);
}
and root->query_pathkeys are set to root->group_pathkeys (which is done
by standard_qp_callback). This explains why the top-level sort key is
set to "xc_groupby_def.a" in the example plan, as that's precisely the
group by key.
This code worked fine in older PostgreSQL releases, but it seems the
upper-planner pathification changed the behavior a bit by removing a bit
of code from grouping_planner(), so that got broken.
I think the fix is as simple as using root->sort_pathkeys when adding
the Remote Subquery Scan node, and it seems to do the trick. There are
two or three more make_remotesubplan() calls using query_pathkeys, but
it seems those are still working fine, so I haven't changed those.
I have also considered using best_path->pathkeys instead, which would
work too, as the planner makes sure the best path produces properly
sorted results. But it's not quite right, as best_path->pathkeys is a
feature of a path, but we're interested in the sorting required by the
query. And we can produce best_path that is sorted by accident, not
because the user asked for it. Using best_path->pathkeys would make us
do an unnecessary sort, wasting CPU resources.
As an example, consider a simple GROUP BY query:
SELECT a, COUNT(*) FROM t GROUP BY t;
and let's assume it gets evaluated using a Group Aggregate, which
produces sorted output. But the user did not ask for sorting (there's no
GROUP BY clause), so we may do this:
Remote Subquery Scan on all
Output: a, count(*)
-> GroupAggregate
Output: a, count(*)
Group Key: a
-> Sort
Output: a
Sort Key: a
-> Seq Scan on t
Output: a
i.e. we ignore the ordering implied by GroupAggregate. Had we used
best_path->pathkeys, we'd do this
Remote Subquery Scan on all
Output: a, count(*)
Sort Key: a
-> GroupAggregate
Output: a, count(*)
Group Key: a
-> ...
I agree the additional overhead is likely quite small, but it's still
something we don't need to do.
This probably also means the fact that create_remotesubplan_path()
automatically inherits pathkeys from the subpath is not quite correct.
What we should do instead is generating two paths - one that preserves
the sorting and one that does not (eliminating the extra cost). That is
something we should fix eventually, I guess. But it would be a bit silly
to do the same mistake on another place knowingly.
The first patch fixes the standard_planner(), and a bunch of aggregate
queries that were actually affected by this. We knew some of those plans
were wrong (which is why we investigated the issue), but there were also
a few plans that we thought were correct and turns out we were wrong.
The second patch fixes a few additional queries where the ordering of
results got unpredictable, because those queries only had GROUP BY
clause but no ORDER BY clause.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Use-sort_pathkeys-instead-of-query_pathkeys-in-stand.patch
Type: text/x-patch
Size: 25761 bytes
Desc: not available
URL: <http://lists.postgres-xl.org/private.cgi/postgres-xl-developers-postgres-xl.org/attachments/20170731/84883e3f/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Stabilize-result-ordering-broken-by-800066edef31.patch
Type: text/x-patch
Size: 9997 bytes
Desc: not available
URL: <http://lists.postgres-xl.org/private.cgi/postgres-xl-developers-postgres-xl.org/attachments/20170731/84883e3f/attachment-0003.bin>
More information about the Postgres-xl-developers
mailing list