[Postgres-xl-developers] SharedQueueUnBind race condition(Internet mail)
jasonysli at tencent.com
Thu Jan 1 04:48:18 PST 2015
This is the situation how it was produced in our cluster:
We found it was caused by SQueuesLock race condition between SharedQueueUnBind and SharedQueueBind. In rare situation, for example, when cluster memory is low, processes running much slower, when producer process timeout and wait on SQueuesLock, some consumers just enter SharedQueueBind and successfully attach to the shared queue. After SharedQueueBind release SQueuesLock SharedQueueUnBind just remove the shared queue from SharedQueues, and set sq_sync to NULL.When SharedQueueRead, the consumer coredump for SEGV.
The fix is when SharedQueueUnBind got SQueuesLock, recheck whether there are still consumers running on the shared queue, if so, SharedQueueUnBind need to wait until no more consumers running or timeout. The patch also fix the SharedQueues search failure elog in SharedQueueBind to ERROR to avoid unnecessary cluster reinitialize.
You can use gdb to reproduce the issue, but it will take a little time.
From: Pavan Deolasee<mailto:pavan.deolasee at gmail.com>
Date: 2015-01-01 18:23
To: Jov<mailto:amutu at amutu.com>
CC: postgres-xl-developers at lists.sourceforge.net<mailto:postgres-xl-developers at lists.sourceforge.net>
Subject: [Postgres-xl-developers] SharedQueueUnBind race condition(Internet mail)
I am looking at the patch to fix the race condition in SharedQ unbind. I wonder if you have a more detailed explanation regarding the issue or even better a scenario to reproduce this. Please let me know.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Postgres-xl-developers