[Postgres-xl-developers] Fix ordering of results for queries with GROUP BY and ORDER BY clauses

Tomas Vondra tomas.vondra at 2ndquadrant.com
Sun Jul 30 18:07:29 PDT 2017


Hello,

The first pair of patches fixes an issue with queries producing results 
with incorrect ordering. For example given a query like this

     SELECT count(*) FROM xc_groupby_def GROUP BY a ORDER BY 1;

the master branch produces a plan like this:

                              QUERY PLAN
     ------------------------------------------------------------
      Remote Subquery Scan on all
        Output: count(*), a
        Sort Key: xc_groupby_def.a
        ->  Sort
              Output: (count(*)), a
              Sort Key: (count(*))
              ->  HashAggregate
                    Output: count(*), a
                    Group Key: xc_groupby_def.a
                    ->  Seq Scan on public.xc_groupby_def
                          Output: a, b

But that plan is incorrect, as the "ORDER BY 1" means we should be 
sorting by "count(*)", but the Remote Subquery Scan is clearly sorting 
by xc_groupby_def.a.

Interestingly enough, the Sort node uses the right Sort key, which 
however makes the things worse as the Remote Subquery Scan does merge 
sort, assuming input from remote nodes is sorted accordingly. As that's 
not true in this case, the output is likely not sorted at all.

Turns out, this happens due to a confusion between grouping keys and 
sort keys, as the top-level Remote Subquery Scan is added in 
standard_planner like this:

     top_plan = create_plan(root, best_path);

     if (root->distribution)
     {
         top_plan = (Plan *) make_remotesubplan(root, top_plan, NULL,
                                                root->distribution,
                                                root->query_pathkeys);
     }

and root->query_pathkeys are set to root->group_pathkeys (which is done 
by standard_qp_callback). This explains why the top-level sort key is 
set to "xc_groupby_def.a" in the example plan, as that's precisely the 
group by key.

This code worked fine in older PostgreSQL releases, but it seems the 
upper-planner pathification changed the behavior a bit by removing a bit 
of code from grouping_planner(), so that got broken.

I think the fix is as simple as using root->sort_pathkeys when adding 
the Remote Subquery Scan node, and it seems to do the trick. There are 
two or three more make_remotesubplan() calls using query_pathkeys, but 
it seems those are still working fine, so I haven't changed those.

I have also considered using best_path->pathkeys instead, which would 
work too, as the planner makes sure the best path produces properly 
sorted results. But it's not quite right, as best_path->pathkeys is a 
feature of a path, but we're interested in the sorting required by the 
query. And we can produce best_path that is sorted by accident, not 
because the user asked for it. Using best_path->pathkeys would make us 
do an unnecessary sort, wasting CPU resources.

As an example, consider a simple GROUP BY query:

     SELECT a, COUNT(*) FROM t GROUP BY t;

and let's assume it gets evaluated using a Group Aggregate, which 
produces sorted output. But the user did not ask for sorting (there's no 
GROUP BY clause), so we may do this:

      Remote Subquery Scan on all
        Output: a, count(*)
        ->  GroupAggregate
            Output: a, count(*)
            Group Key: a
            ->  Sort
                Output: a
                Sort Key: a
                ->  Seq Scan on t
                    Output: a

i.e. we ignore the ordering implied by GroupAggregate. Had we used 
best_path->pathkeys, we'd do this

      Remote Subquery Scan on all
        Output: a, count(*)
        Sort Key: a
        ->  GroupAggregate
            Output: a, count(*)
            Group Key: a
            -> ...

I agree the additional overhead is likely quite small, but it's still 
something we don't need to do.

This probably also means the fact that create_remotesubplan_path() 
automatically inherits pathkeys from the subpath is not quite correct. 
What we should do instead is generating two paths - one that preserves 
the sorting and one that does not (eliminating the extra cost). That is 
something we should fix eventually, I guess. But it would be a bit silly 
to do the same mistake on another place knowingly.

The first patch fixes the standard_planner(), and a bunch of aggregate 
queries that were actually affected by this. We knew some of those plans 
were wrong (which is why we investigated the issue), but there were also 
a few plans that we thought were correct and turns out we were wrong.

The second patch fixes a few additional queries where the ordering of 
results got unpredictable, because those queries only had GROUP BY 
clause but no ORDER BY clause.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Use-sort_pathkeys-instead-of-query_pathkeys-in-stand.patch
Type: text/x-patch
Size: 25761 bytes
Desc: not available
URL: <http://lists.postgres-xl.org/private.cgi/postgres-xl-developers-postgres-xl.org/attachments/20170731/84883e3f/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Stabilize-result-ordering-broken-by-800066edef31.patch
Type: text/x-patch
Size: 9997 bytes
Desc: not available
URL: <http://lists.postgres-xl.org/private.cgi/postgres-xl-developers-postgres-xl.org/attachments/20170731/84883e3f/attachment-0003.bin>


More information about the Postgres-xl-developers mailing list