[Postgres-xl-developers] PG-XL is in deadlock w/ one backend PID holding two Transactions

Kelvin kelvin8077 at gmail.com
Wed Jun 13 19:09:03 PDT 2018


Hi,

After setting up an instance of PG-XL 9.5r1.6 which is from official
website, I run sysbench's read/write workload with 16 threads, and found
that the 16 threads hang with waiting for locks. This issue can be
reproduced with both Postgres-XL 9.5r1.6 and PG-XL 10. Could anyone please
let me know if this is a known issue and is there any workaround?

I looked into the lock table and found that one PID 11707 holds two
transaction ids.

[pgxl at sandbox ~]$ /usr/local/pgxl/bin/psql -U ads -d adsdb -p 15432
adsdb=> select * from pg_locks where granted ='f' order by pid;
   locktype    | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |  pid  |
      mode         | granted | fastpath
---------------+----------+----------+------+-------+-------
-----+---------------+---------+-------+----------+---------
-----------+-------+---------------------+---------+----------
transactionid |          |          |      |       |            |
176247 |         |       |          | 6/3269             | 11707 |
ShareLock           | f       | f
tuple         |    16388 |    16391 | 1546 |    26 |            |
     |         |       |          | 8/3288             | 11715 |
ExclusiveLock       | f       | f
tuple         |    16388 |    16391 | 1546 |    26 |            |
     |         |       |          | 9/3174             | 11719 |
AccessExclusiveLock | f       | f
tuple         |    16388 |    16391 | 1546 |    26 |            |
     |         |       |          | 12/990             | 11727 |
ExclusiveLock       | f       | f
tuple         |    16388 |    16391 | 1546 |    26 |            |
     |         |       |          | 19/229             | 11755 |
ExclusiveLock       | f       | f
(5 rows)

adsdb=> select * from pg_locks where pid = 11707;
   locktype    | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |  pid  |
      mode         | granted | fastpath
---------------+----------+----------+------+-------+-------
-----+---------------+---------+-------+----------+---------
-----------+-------+---------------------+---------+----------
relation      |    16388 |    16400 |      |       |            |
     |         |       |          | 6/3269             | 11707 |
AccessShareLock     | t       | t
relation      |    16388 |    16398 |      |       |            |
     |         |       |          | 6/3269             | 11707 |
AccessShareLock     | t       | t
relation      |    16388 |    16391 |      |       |            |
     |         |       |          | 6/3269             | 11707 |
AccessShareLock     | t       | t
relation      |    16388 |    16391 |      |       |            |
     |         |       |          | 6/3269             | 11707 |
RowExclusiveLock    | t       | t
virtualxid    |          |          |      |       | 6/3269     |
     |         |       |          | 6/3269             | 11707 |
ExclusiveLock       | t       | t
transactionid |          |          |      |       |            |
176245 |         |       |          | 6/3269             | 11707 |
ExclusiveLock       | t       | f
tuple         |    16388 |    16391 | 1546 |    26 |            |
     |         |       |          | 6/3269             | 11707 |
AccessExclusiveLock | t       | f
transactionid |          |          |      |       |            |
176247 |         |       |          | 6/3269             | 11707 |
ShareLock           | f       | f
(8 rows)

adsdb=> select * from pg_stat_activity where pid =11707;
datid | datname |  pid  | usesysid | usename | application_name |
client_addr | client_hostname | client_port |         backend_start
 |          xact_start          |          query_start
    |         state_change          | waiting | state  | backend_xid |
backend_xmin |                query
-------+---------+-------+----------+---------+-------------
-----+-------------+-----------------+-------------+--------
-----------------------+------------------------------+-----
----------------------
----+-------------------------------+---------+--------+----
---------+--------------+-------------------------------------
16388 | adsdb   | 11707 |    16387 | ads     | pgxc             |
127.0.0.1   |                 |       53912 | 2018-06-08 23:57:32.830016+00
| 2018-06-08 23:57:38.50713+00 | 2018-06-08 23:57:38.602988
+00 | 2018-06-08 23:57:38.606531+00 | t       | active |      176245 |
 176242 | DELETE FROM sbtest1 WHERE (id = $1)

[pgxl at sandbox ~]$ ps -ef | grep pgxl | grep 127.0.0.1 | grep waiting | grep
8402
pgxl     11707  8402  0 Jun08 ?        00:00:00 postgres: ads adsdb
127.0.0.1(53912) DELETE waiting
pgxl     11715  8402  0 Jun08 ?        00:00:00 postgres: ads adsdb
127.0.0.1(53928) UPDATE waiting
pgxl     11719  8402  0 Jun08 ?        00:00:00 postgres: ads adsdb
127.0.0.1(53936) DELETE waiting
pgxl     11727  8402  0 Jun08 ?        00:00:00 postgres: ads adsdb
127.0.0.1(53952) UPDATE waiting
pgxl     11755  8402  0 Jun08 ?        00:00:00 postgres: ads adsdb
127.0.0.1(54008) UPDATE waiting

adsdb=> select count(1) from pg_stat_activity;
count
-------
    17
(1 row)

adsdb=> show max_connections;
max_connections
-----------------
100
(1 row)
dsdb=> select version();
                                                            version
------------------------------------------------------------
--------------------------------------------------------------------
PostgreSQL 9.5.8 (Postgres-XL 9.5r1.6) on x86_64-pc-linux-gnu, compiled by
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
(1 row)


[root at sandbox postgres-xl-9.5r1.6]# gdb -q --batch --ex "bt" -p 11707
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f9295685348 in poll () from /lib64/libc.so.6
#0  0x00007f9295685348 in poll () from /lib64/libc.so.6
#1  0x00000000007c5baa in WaitLatchOrSocket (latch=0x7f928b3887dc,
wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
#2  0x00000000007c59b6 in WaitLatch (latch=0x7f928b3887dc, wakeEvents=1,
timeout=0) at pg_latch.c:197
#3  0x0000000000848b0c in ProcSleep (locallock=0x1236f60,
lockMethodTable=0xbb9f20 <default_lockmethod>) at proc.c:1143
#4  0x0000000000842a82 in WaitOnLock (locallock=0x1236f60, owner=0x1334778)
at lock.c:1742
#5  0x000000000084176b in LockAcquireExtendedXC (locktag=0x7ffdb79a2ff0,
lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1
'\001', only_increment=0 '\000') at lock.c:1031
#6  0x0000000000840d47 in LockAcquireExtended (locktag=0x7ffdb79a2ff0,
lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1
'\001') at lock.c:718
#7  0x0000000000840cb0 in LockAcquire (locktag=0x7ffdb79a2ff0, lockmode=5,
sessionLock=0 '\000', dontWait=0 '\000') at lock.c:677
#8  0x000000000083fbf9 in XactLockTableWait (xid=176247,
rel=0x7f92963d5260, ctid=0x7ffdb79a3094, oper=XLTW_Delete) at lmgr.c:566
#9  0x00000000004c93b1 in heap_delete (relation=0x7f92963d5260,
tid=0x7ffdb79a3240, cid=2, crosscheck=0x0, wait=1 '\001',
hufd=0x7ffdb79a3180) at heapam.c:2929
#10 0x00000000006b5c78 in ExecDelete (tupleid=0x7ffdb79a3240, oldtuple=0x0,
planSlot=0x134a158, epqstate=0x1349358, estate=0x1348ff0, canSetTag=1
'\001') at nodeModifyTable.c:627
,,,
#25 0x0000000000709e3e in main (argc=8, argv=0x120f370) at main.c:228

[root at sandbox postgres-xl-9.5r1.6]# gdb -q --batch --ex "bt" -p 11715
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f9295685348 in poll () from /lib64/libc.so.6
#0  0x00007f9295685348 in poll () from /lib64/libc.so.6
#1  0x00000000007c5baa in WaitLatchOrSocket (latch=0x7f928b3890ac,
wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
#2  0x00000000007c59b6 in WaitLatch (latch=0x7f928b3890ac, wakeEvents=1,
timeout=0) at pg_latch.c:197
#3  0x0000000000848b0c in ProcSleep (locallock=0x1236de0,
lockMethodTable=0xbb9f20 <default_lockmethod>) at proc.c:1143
#4  0x0000000000842a82 in WaitOnLock (locallock=0x1236de0, owner=0x12c6558)
at lock.c:1742
#5  0x000000000084176b in LockAcquireExtendedXC (locktag=0x7ffdb79a2f60,
lockmode=7, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1
'\001', only_increment=0 '\000') at lock.c:1031
#6  0x0000000000840d47 in LockAcquireExtended (locktag=0x7ffdb79a2f60,
lockmode=7, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1
'\001') at lock.c:718
#7  0x0000000000840cb0 in LockAcquire (locktag=0x7ffdb79a2f60, lockmode=7,
sessionLock=0 '\000', dontWait=0 '\000') at lock.c:677
#8  0x000000000083f7f5 in LockTuple (relation=0x7f92963d5260,
tid=0x7ffdb79a3064, lockmode=7) at lmgr.c:440
#9  0x00000000004cd984 in heap_acquire_tuplock (relation=0x7f92963d5260,
tid=0x7ffdb79a3064, mode=LockTupleNoKeyExclusive,
wait_policy=LockWaitBlock, have_tuple_lock=0x7ffdb79a3053 "") at
heapam.c:4978
#10 0x00000000004ca51b in heap_update (relation=0x7f92963d5260,
otid=0x7ffdb79a3240, newtup=0x1348908, cid=0, crosscheck=0x0, wait=1
'\001', hufd=0x7ffdb79a3190, lockmode=0x7ffdb79a318c) at heapam.c:3547
#11 0x00000000006b6307 in ExecUpdate (tupleid=0x7ffdb79a3240, oldtuple=0x0,
slot=0x1348168, planSlot=0x1347240, epqstate=0x1339528, estate=0x13391c0,
canSetTag=1 '\001') at nodeModifyTable.c:923
,,,
#26 0x0000000000709e3e in main (argc=8, argv=0x120f370) at main.c:228


The reproduction steps is pretty simple.
step 1: download the Postgres-XL 9.5r1.6 release and compile it with the
command "./configure --prefix=/usr/local/pgxl --without-readline
--enable-debug --enable-cassert CFLAGS='-O0 -g'", "make&make install"

step 2: Initialize the database, and deploy 1 coordinator and 2 data nodes
in one box.
su - pgxl
mkdir /usr/local/pgxl/data_coord1
mkdir /usr/local/pgxl/data_datanode_1
mkdir /usr/local/pgxl/data_datanode_2
mkdir /usr/local/pgxl/data_gtm
chown pgxl /usr/local/pgxl/data_coord1
chown pgxl /usr/local/pgxl/data_datanode_1
chown pgxl /usr/local/pgxl/data_datanode_2
chown pgxl /usr/local/pgxl/data_gtm
/usr/local/pgxl/bin/initdb -D /usr/local/pgxl/data_coord1 \
  --nodename coord1
/usr/local/pgxl/bin/initdb -D /usr/local/pgxl/data_datanode_1 \
  --nodename datanode_1
/usr/local/pgxl/bin/initdb -D /usr/local/pgxl/data_datanode_2 \
  --nodename datanode_2
/usr/local/pgxl/bin/initgtm -D /usr/local/pgxl/data_gtm -Z gtm
/usr/local/pgxl/bin/gtm -D /usr/local/pgxl/data_gtm >logfile 2>&1 &
/usr/local/pgxl/bin/postgres --datanode -p 15432 -c pooler_port=40101 \
  -D /usr/local/pgxl/data_datanode_1 >logfile 2>&1 &
/usr/local/pgxl/bin/postgres --datanode -p 15433 -c pooler_port=40102 \
  -D /usr/local/pgxl/data_datanode_2 >logfile 2>&1 &
/usr/local/pgxl/bin/postgres --coordinator -c pooler_port=40100 \
  -D /usr/local/pgxl/data_coord1 >logfile 2>&1 &
/usr/local/pgxl/bin/psql -c "ALTER NODE coord1 \
  WITH (TYPE = 'coordinator', PORT = 5432)" postgres
/usr/local/pgxl/bin/psql -c "CREATE NODE datanode_1 \
  WITH (TYPE = 'datanode', PORT = 15432)" postgres
/usr/local/pgxl/bin/psql -c "CREATE NODE datanode_2 \
  WITH (TYPE = 'datanode', PORT = 15433)" postgres

/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_1) \
  'ALTER NODE datanode_1 WITH (TYPE = ''datanode'', PORT = 15432)'" postgres
/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_1) \
  'CREATE NODE datanode_2 WITH (TYPE = ''datanode'', PORT = 15433)'"
postgres
/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_1) \
  'CREATE NODE coord1 WITH (TYPE = ''coordinator'', PORT = 5432)'" postgres

/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_2) \
  'ALTER NODE datanode_2 WITH (TYPE = ''datanode'', PORT = 15433)'" postgres
/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_2) \
  'CREATE NODE datanode_1 WITH (TYPE = ''datanode'', PORT = 15432)'"
postgres
/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_2) \
  'CREATE NODE coord1 WITH (TYPE = ''coordinator'', PORT = 5432)'" postgres

/usr/local/pgxl/bin/psql -c "SELECT pgxc_pool_reload()" postgres
/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_1) \
  'SELECT pgxc_pool_reload()'" postgres
/usr/local/pgxl/bin/psql -c "EXECUTE DIRECT ON (datanode_2) \
  'SELECT pgxc_pool_reload()'" postgres


step 3: issue the following sysbench commands.
sysbench oltp_read_write --db-driver=pgsql --pgsql-host=127.0.0.1
--pgsql-port=5432 --pgsql-user=ads --pgsql-password=ads --pgsql-db=adsdb
--tables=10 --table-size=100000 prepare
sysbench oltp_read_write --db-driver=pgsql --pgsql-host=127.0.0.1
--pgsql-port=5432 --pgsql-user=ads --pgsql-password=ads --pgsql-db=adsdb
--threads=16  --time=9  run


Thanks,
-Yuanfei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.postgres-xl.org/pipermail/postgres-xl-developers-postgres-xl.org/attachments/20180614/88ae2bd2/attachment-0001.html>


More information about the Postgres-xl-developers mailing list