[Postgres-xl-developers] more fun with GTM snapshots and memory management

Tomas Vondra tomas.vondra at 2ndquadrant.com
Sun Nov 12 04:46:11 PST 2017


Hi,

I've been running regression tests in a loop, to see if there are any
rare failures (e.g. race conditions). And indeed, I've repeatedly ran
into a situation a DDL would just get stuck.

This would happen in maybe 1% of runs (so ~10 cases out of 1000 runs) of
the runs, so quite rare - especially considering how intensive the tests
are and how many commands get executed.

Anyway, the stack trace always looked *exactly* the same:

    #0  0x00007fbcedb10a20 in __poll_nocancel () from /lib64/libc.so.6
    #1  0x0000000000ac6745 in gtmpqSocketPoll (sock=5, forRead=1,
        forWrite=0, end_time=-1) at fe-misc.c:996
    #2  0x0000000000ac664c in gtmpqSocketCheck (conn=0x167cd10,
        forRead=1, forWrite=0, end_time=-1) at fe-misc.c:943
    #3  0x0000000000ac6549 in gtmpqWaitTimed (forRead=1, forWrite=0,
        conn=0x167cd10, finish_time=-1) at fe-misc.c:884
    #4  0x0000000000ac651b in gtmpqWait (forRead=1, forWrite=0,
        conn=0x167cd10) at fe-misc.c:867
    #5  0x0000000000ac417b in GTMPQgetResult (conn=0x167cd10) at
        fe-protocol.c:280
    #6  0x0000000000ac0a5f in get_snapshot (conn=0x167cd10,
        gxid=5337732, canbe_grouped=1 '\001') at gtm_client.c:1104
    #7  0x0000000000562622 in GetSnapshotGTM (gxid=5337732,
        canbe_grouped=1 '\001') at gtm.c:446
    ...

Initially this might seems like a networking issue - the process gets
stuck because it expects more data to arrive, but that never happens and
so it gets stuck indefinitely. Interestingly, it seems that some of the
data arrives (so that the GTM timeout check does not fire), but the data
seems incomplete and so GTMPQgetResult waits for the remaining part.

An interesting observation is that the stack trace is always exactly the
same (except for the parameter values, of course). It's get_snapshot
every time, which suggests this is not a networking issue - that would
affect other methods too, not just this one.

And indeed, this turns out to be an issue in GTM_GetTransactionSnapshot
(and ProcessGetSnapshotCommandMulti). The problem is that when computing
a snapshot, GTM_GetTransactionSnapshot feels free to re-use the existing
snapshot for the one of the handles.

That probably assumes multiple calls will never include the same GXID
(which would guarantee that two threads will not scribble on the same
memory), but that assumption seems to be incorrect. For example when
doing DDL, each node seems to acquire the snapshot independently (for
the same GXID).

Which matches another observation - I've only ever seen these lockups
for sessions doing some sort of DDL (create/drop commands, grants, ...)
and never ever regular DML etc.

So, why does it actually get stuck? Assume you have two different GTM
connections requesting snapshot for the same GXID, and that both end up
reusing the same snapshot for the GXID (in GTM_Transactions).

As the two calls are independent (happen at slightly different times),
the list of running transactions may change a bit between them. For
example some transactions may complete in between, so that the second
snapshot will be a bit smaller.

So, it may happen that

1) first GTM_GetTransactionSnapshot builds a snapshot with sn_xcnt=2

2) some other transaction completes

3) ProcessGetSnapshotCommandMulti starts building a network response

4) second GTM_GetTransactionSnapshot builds a snapshot with sn_xcnt=1
   and modifies the same snapshot as used by (3)

Now, the change from sn_xcnt=2 to sn_xcnt=1 may happen between these two
steps in ProcessGetSnapshotCommandMulti:

    pq_sendint(&buf, snapshot->sn_xcnt, sizeof (int));
    pq_sendbytes(&buf, (char *)snapshot->sn_xip,
                 sizeof(GlobalTransactionId) * snapshot->sn_xcnt);

When this happens, the network message ends up saying "I'll send you
GXIDs for 2 running transactions" but then only includes a single GXID.

Which means the GTMPQgetResult() will wait for the data indefinitely,
because the GTM did not really send them.

The attached patch simply abandons the idea of reusing existing
snapshots for transactions (i.e. stored in GTMTransactions), and always
uses the thread-specific snapshot (stored in GTM_ThreadInfo). That is
quite cheap, as it does not require any memory allocation (the snapshot
is allocated statically) and solves the problem of multiple threads
scribbling over the same chunk of memory while building the snapshot.

But the more I look at GTM_GetTransactionSnapshot, the more I'm
wondering if the code is broken ... Or perhaps I'm missing something, in
which case I'd be grateful if someone could explain why.

For example, the code only acquires gt_TransArrayLock in read mode, but
then proceeds to copy the built snapshots into the array. If there are
multiple threads building snapshot for the same GXID, this can easily
lead to memory leaks (as both threads may see sn_xip as NULL and will
allocate memory in global context). That's annoying, but the amount of
memory we can leak is somewhat limited (it'll only happen once for each
global transaction slot).

What seems more dangerous is the possibility that two threads may
interleave in unpredictable ways when copying the snapshot data into
GTMTransactions, for example like this:

1) thread 1 builds snapshot with sn_xcnt=3 and GXIDs [1,2,3]
2) thread 2 builds snapshot with sn_xcnt=2 and GXIDs [1,3]
3) thread 2 stores sn_xcnt and gets suspended
4) thread 1 stores both sn_xcnt=3 and GXIDs
5) thread 2 gets woken and stores GIXDs

So we end up with sn_xcnt=3 and GXIDs [1,3,3]. I suspect this may get
much messier considering (a) we also copy xmin/xmax for the snapshot,
(b) compilers may reorder the calls in various ways and (c) the memcpy
is likely not atomic (i.e. threads may see the memory in some
intermediate state). So perhaps this needs stricter locking? (A comment
in the code claims "shared lock" is enough, but that seems more like a
copy-paste from GetSnapshotData.)

I also suspect it's not handling snapshots for SERIALIZABLE transactions
correctly - it's true we only store the initial snapshot, but the code
still returns the freshly built one. Shouldn't it return the existing
snapshot instead?

That being said, with the patch applied I can no longer reproduce the
get_snapshot lockups. But there are some remaining crashes that might be
related. For example every now and then I see this assert failure:

    #0  0x00007fdaf2fe9eb5 in raise () from /lib64/libc.so.6
    #1  0x00007fdaf2febbda in abort () from /lib64/libc.so.6
    #2  0x0000000000a8612e in ExceptionalCondition
(conditionName=0xcc3138 "!(((allPgXact[proc->pgprocno].xid) !=
((TransactionId) 0)))", errorType=0xcc3128 "FailedAssertion",
fileName=0xcc30f9 "procarray.c", lineNumber=562) at assert.c:54
    #3  0x00000000008da62a in ProcArrayGroupClearXid
(proc=0x7fdaf2c396c0, latestXid=17945) at procarray.c:562
    #4  0x00000000008da432 in ProcArrayEndTransaction
(proc=0x7fdaf2c396c0, latestXid=17945) at procarray.c:488
    #5  0x0000000000549c0e in CommitTransaction () at xact.c:2532
    #6  0x000000000054ad5c in CommitTransactionCommand () at xact.c:3550

I wonder if this might be caused by different nodes seeing different
snapshots for the same GXID ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmt-snap-fix.patch
Type: text/x-patch
Size: 6670 bytes
Desc: not available
URL: <http://lists.postgres-xl.org/pipermail/postgres-xl-developers-postgres-xl.org/attachments/20171112/34344057/attachment.bin>


More information about the Postgres-xl-developers mailing list