[Postgres-xl-general] Corrupted shared memory in 9.5

Shaun Thomas sthomas at peak6.com
Wed Jun 1 12:34:22 PDT 2016


We're having the same problem, though the cause is clearly different due to the tables involved. Our logs from the node that died are very similar:

32155|postgres|execution|10.2.2.31(42239)|2016-06-01 12:05:32 CDT|WARNING:  SQueue p_1_7d22_e, timeout while waiting for Consumers finishing
32155|postgres|execution|10.2.2.31(42239)|2016-06-01 12:05:32 CDT|ERROR:  canceling statement due to user request
29564||||2016-06-01 12:05:46 CDT|LOG:  server process (PID 32155) was terminated by signal 11: Segmentation fault
29564||||2016-06-01 12:05:46 CDT|DETAIL:  Failed process was running: Remote Subplan

But it doesn't always crash. Most of the time, we see errors like this at random:

29677|postgres|execution|10.2.2.31(49604)|2016-06-01 12:02:36 CDT|LOG:  Remote node "data0003", running with pid 29684 returned an error: Failed to read from SQueue p_1_73d7_7, consumer (node 2, pid 29684, status 2) - CONSUMER_ERROR set
29677|postgres|execution|10.2.2.31(49604)|2016-06-01 12:02:36 CDT|STATEMENT:  Remote Subplan
29677|postgres|execution|10.2.2.31(49604)|2016-06-01 12:02:36 CDT|ERROR:  Failed to read from SQueue p_1_73d7_7, consumer (node 2, pid 29684, status 2) - CONSUMER_ERROR set
29677|postgres|execution|10.2.2.31(49604)|2016-06-01 12:02:36 CDT|STATEMENT:  Remote Subplan
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|LOG:  Remote node "data0000", running with pid 29723 returned an error: could not send data to server
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|STATEMENT:  ROLLBACK TRANSACTION
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|LOG:  Remote node "data0000", running with pid 29723 returned an error: failed to send data to datanode
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|STATEMENT:  ROLLBACK TRANSACTION
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|ERROR:  Failed to synchronize data node
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|STATEMENT:  ROLLBACK TRANSACTION
29716|postgres|execution|10.2.2.31(49644)|2016-06-01 12:02:36 CDT|WARNING:  AbortTransaction while in ABORT state

Or this:

28674|postgres|execution|10.2.2.31(46092)|2016-06-01 11:26:38 CDT|LOG:  Remote node "data0000", running with pid 28039 returned an error: could not open relation with OID 1052136
28674|postgres|execution|10.2.2.31(46092)|2016-06-01 11:26:38 CDT|STATEMENT:  Remote Subplan
28674|postgres|execution|10.2.2.31(46092)|2016-06-01 11:26:38 CDT|ERROR:  could not open relation with OID 1052136
28674|postgres|execution|10.2.2.31(46092)|2016-06-01 11:26:38 CDT|STATEMENT:  Remote Subplan
28674|postgres|execution|10.2.2.31(46092)|2016-06-01 11:26:38 CDT|LOG:  Remote node "data0003", running with pid 28063 returned an error: Failed to synchronize data node
28674|postgres|execution|10.2.2.31(46092)|2016-06-01 11:26:38 CDT|LOG:  Remote node "data0001", running with pid 28030 returned an error: Failed to synchronize data node

Yet we can run the same function that's causing this ten times with no problem, and the 11th will fail with one of the above errors. I'm trying to isolate the cause and simplify the function logic to make a test case, and using pgbench to try and force it, but it's slow going. There's clearly a pretty serious problem here, though.

--
Shaun Thomas
PEAK6 Investments, LP | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-444-8105
sthomas at peak6.com



See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email




More information about the postgres-xl-general mailing list