[Postgres-xl-developers] Current state of the Postgres-XL 10 code (in terms of regression failures)

Tomas Vondra tomas.vondra at 2ndquadrant.com
Fri Oct 20 13:37:51 PDT 2017


Hi all,

Over the past few months we've been working on merging PostgreSQL 9.6
and 10 into Postgres-XL. Let me share a brief summary of the progress,
relying on some basic metrics from regression tests in the "master"
branch (which is where the 9.6 and 10 merge happens).

FWIW this overview may also be a good starting point for those of you
who are considering helping the XL community to get XL 10 out.


make check
==========

At this point we get about 11 failing tests, producing regression.diffs
of about 500 lines. Compared to the ~70 failing tests and ~30k lines we
started with, this shows fairly significant progress.

Moreover, while those issues certainly need to be fixed, none of them
seems fatal (i.e. no crashes or risk of data corruption).

Let's quickly walk through the failing tests and discuss the possible
root causes of the failures:

1) inherit

- We have a few failures due to PRIMARY KEY constraint conflicting with
the table distribution strategy. This is not really a bug, and it should
be enough to replicate the table (or something like that).

- Error, most likely due to sending anonymous composite type to another
node as part of remote plan.

  ERROR:  input of anonymous composite types is not implemented

- Plan change to Merge Append with nested Remote Subquery paths. Not
sure what's the root cause here, and it might be a legit change.

2) brin

- The brin_summarize_range() is not distributed to datanodes, so the
reported numbers are from the coordinator. One solution would be
distribute it internally (so that it executes itself on datanodes). Or
we may keep it the way it is and run it using EXECUTE DIRECT.

3) privileges

- I'm not sure what's happening here, but it might be an issue with SET
SESSION AUTHORIZATION on pooled connections, or something like that.

4) collate

- The root cause of the failure is that XL fails to detect an issue with
collations in this CTAS command, and just create the tables

  CREATE TABLE test_u AS SELECT a, b FROM collate_test1
  UNION ALL SELECT a, b FROM collate_test2;

On upstream this fails as it can't determine collation for "b", so we
get an extra table on XL and then a different number of objects in DROP
SCHEMA ... CASCADE.

We should detect the collate issue instead, and throw an error instead
of creating the table. The behavior difference is due to XL transforming
the CTAS into two commands, which then uses a different code path.

5) tsrf

- Firstly, we're getting error about "missing ORDER/GROUP BY
references". senhu already submitted a patch fixing this, but we're
looking if there is a better way to fix it.

- Secondly, we're getting this error

  ERROR:  set-valued function called in context that cannot accept a set

for queries like "INSERT INTO t VALUES (generate_series(1,10))". Not
sure what's the cause, have to investigate.

6) select_parallel

The failures are caused by an upstream bug with handling roles in
parallel workers, see:

https://www.postgresql.org/message-id/CABOikdOomRcZsLsLK%2BZ%2BqENM1zxyaWnAvFh3MJZzZnnKiF%2BREg%40mail.gmail.com

7) combocid

Apparently there's a subtle difference in how DECLARE + DELETE + FETCH
behaves. On upstream the FETCH will see all rows (including those
deleted by the second command), while on XL we'll see nothing.

8) limit

The results are unstable due to datanodes calling nextval() in different
order during each invocation. So this failure is expected, and we need
to modify the test somehow to fix it.

9) sequence

We're getting unexpected (early) failures when the sequence reaches
MINVAL or MAXVAL.

10) with

Recursive CTEs on replicated tables, failing due to execRemote node
expecting the WorkTable node. I believe the best fix is to simply push
down the whole CTE once (i.e. there should be no Remote Subquery).

11) stats

Minor differences in reported statistics, not sure what's the cause
(likely coordinator-datanode difference).


make check (contrib)
====================

I've also made some basic analysis of failures in regression tests for
contrib modules.

A bunch of tests fail because the contrib modules rely on features that
are unsupported on XL - that's the case of dblink, file_fdw and
postgres_fdw (all being FDW), and test_decoding (which requires logical
replication slots).

Then there's a handful of failures, with various causes:

1) contrib/btree_gin

Strange issue in the "date" test, when we get "2004-10-26" instead of
"10-26-2004" instead for some reason (possibly due to some confusion
between coordinator and datanode about the date format).

2) contrib/btree_gist

The "oid" fails with some strange result differences, apparently with
"money" data type. Not sure what the root cause.

The "timetz" test fails due to not being able to distribute an update of
distribution key (simple fix: make table replicated or distribute by
another column).

3) contrib/citext

REFRESH MATERIALIZED VIEW fails because of a different pg_temp_N name.

4) contrib/cube

We're getting different output than expected on PostgreSQL, but this
seems to be upstream bug (and XL actually seems to produce the correct
output). See:

https://www.postgresql.org/message-id/a9657f6a-b497-36ff-e569-482a2c7e3292%402ndquadrant.com

5) contrib/pageinspect

We get different results from the pageinspect queries, because the
function gets executed on coordinator (where we have no data), and not
on datanodes. We should probably make the tables replicated and execute
the functions using EXECUTE DIRECT to get the same behavior as on
PostgreSQL.

6) contrib/pg_stat_statements

ERROR:  set-valued function called in context that cannot accept a set

This is the same failure as in "tsrf" regression test, and it affects
some subsequent queries (as it affects "rows" in pg_stat_statements).

7) contrib/pg_trgm

Plan difference (Bitmap Index Scan -> Index Scan). This seems to be some
sort of estimation/costing issue, not sure.

8) contrib/pg_visibility

Similar to contrib/pageinspect, i.e. the functions get executed on the
coordinator. Making the tables replicated and doing EXECUTE DIRECT seems
like the simplest fix.

9) contrib/tsm_system_rows and contrib/tsm_system_time

We seem to pass the sampling parameters to datanodes directly, thus
getting a combination of sampled rows (and more than we expect). Not
sure what to do about this - perhaps we need to do the sampling in two
phases - first on datanodes, and then a second pass on the coordinator
(and considering the per-datanode reltuples).



kind regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


More information about the Postgres-xl-developers mailing list