[Postgres-xl-developers] cte bug

Tomas Vondra tomas.vondra at 2ndquadrant.com
Wed Jun 21 12:12:19 PDT 2017


Hello senhu,

Thank you for looking at this issue.

I fixed some related issues about a month ago, see commit 8fe2cc6647 (I 
think the analysis there mostly matches with your conclusions).

On 6/21/17 9:10 AM, senhu(胡森) wrote:
> Hello, all:
> 
>   When I run the with case of regression test, I get some errors about 
> ‘with recursive’, here is the sql:
> 
> with recursive q as (
> 
>        select * from department
> 
>      union all
> 
>        (with recursive x as (
> 
>             select * from department
> 
>           union all
> 
>             (select * from q union all select * from x)
> 
>          )
> 
>         select * from x)
> 
>      )
> 
> select * from q order by 1, 2, 3 limit 32;
> 
...
> Error message:
> 
>   unrecognized node type: 120
> 

FWIW this error message is caused by execRemote.c not knowing how to 
handle T_WorkTableScan. And XL 9.5 has the same issue, but it's masked 
by including that as expected output for the regression test :-(

> With the remote_subquery part(marked on red), there are two 
> worktablecans below the Recursive Union, one is on cte x, the other is 
> on cte q. when executing the query on remote, worktablescan on x is ok, 
> but worktablescan on q can not be executed, because there is no 
> worktable for that scan, and the query will fail.
> 

Yeah. The trouble is that there are two RecursiveUnion nodes, and both 
WorkTables are defined at the top one. But there's a RemoteQuery in 
between those two RecursiveUnion nodes, so the inner query can't access 
the WorkTable.

Which is why fixing the "unrecognized node type" error is not enough, 
because that just uncovers this problem.


> To fix this problem, I think the correct plan shoule be like 
> this(Recursive Union part):
> 
> ->  Recursive Union  (cost=0.00..9646.00 rows=241200 width=40)
> 
>                                    ->  Remote Subquery Scan on all 
> (datanode_1)  (cost=0.00..9646.00 rows=241200 width=40)
> 
>                                       ->  Seq Scan on department 
>   (cost=0.00..22.00 rows=1200 width=40)
> 
>                                    ->  Append  (cost=0.00..720.00 
> rows=24000 width=40)
> 
>                                       ->  WorkTable Scan on q q_1 
>   (cost=0.00..240.00 rows=12000 width=40)
> 
>                                       ->  WorkTable Scan on x 
>   (cost=0.00..240.00 rows=12000 width=40)
> 
> If Recursive Union has worktablscan, we have to make sure the Recursive 
> Union has all worktables which the worktablescans need. In this case, we 
> can add remote_subquery to Recursive Union, or just send the subquery to 
> remote.
> 

I haven't looked at the code very closely, but I can confirm that with 
this patch the query produces the correct output (almost - I'll get back 
to this in a minute).

But I'm wondering why we need the inner remote query at all?

Consider that we only allow recursive queries when all the tables are 
replicated (at least that's my understanding of the check at the end of 
subquery_planner). So why can't we simply take the regular PostgreSQL 
plan and put a single RemoteSubquery at the top?

I mean, why not to generate a plan like this?

                             QUERY PLAN
     -----------------------------------------------------------------
      Limit
        CTE q
          ->  Remote Subquery Scan on all (datanode1)
                ->  Recursive Union
                      ->  Seq Scan on department department_1
                      ->  CTE Scan on x x_1
                            CTE x
                              ->  Recursive Union
                                    ->  Seq Scan on department
                                    ->  Append
                                          ->  WorkTable Scan on q q_1
                                          ->  WorkTable Scan on x
        ->  Sort
              Sort Key: q.id, q.parent_department, q.name
              ->  CTE Scan on q
     (14 rows)

The extra recursive queries might be required for tables with different 
distributed strategies (and not replicated), but that's not supported at 
all at this point. (I'm not against allowing recursive queries on all 
tables, but that's not what this patch is doing).

It may also be required because the tables might be replicated on 
different groups of nodes, not necessarily on the same one?

Now, regarding the "correct output" - with the current query text, the 
'with' regression test runs for an extremely long time for me and 
eventually fails due to filling temporary space (which is used by the 
worktable tuplestores).

That seems to be caused by the extra "ORDER BY 1, 2, 3" clause, which 
was probably added to make the output deterministic (it's in XL 9.5 too, 
but as this query fails it does not affect the test duration).

With the patch it however matters, because it means we have to fetch all 
the rows to sort them, and there are infinitely many of them. But even 
without the ORDER BY the output seems to be deterministic, probably due 
to the tables being replicated. So I think it's safe to remove it, which 
solves the issue with the test duration.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


More information about the Postgres-xl-developers mailing list