[Postgres-xl-developers] trouble with the GTM / pthread failures and elog infrastructure

Tomas Vondra tomas.vondra at 2ndquadrant.com
Fri Aug 18 08:06:38 PDT 2017



On 08/18/2017 01:48 PM, Pavan Deolasee wrote:
> 
> 
> On Fri, Aug 11, 2017 at 9:06 PM, Tomas Vondra 
> <tomas.vondra at 2ndquadrant.com <mailto:tomas.vondra at 2ndquadrant.com>> wrote:
> 
>     Hi,
> 
>     I've been doing some TPC-H tests this week, to identify issues in
>     the XL 10 code base, and while doing that I've made a mistake of
>     using a schema that was not tuned for Postgres-XL (distribution not
>     chosen explicitly for tables, etc.). This led to some unexpected and
>     rather interesting failures in the GTM, which I'd like to share and
>     discuss here.
> 
>     The primary issue I ran into is that due to not tuning the schema,
>     some of the queries had to perform a lot of redistributions when
>     joining tables. That is, the queries involved many steps like this:
> 
>                                  QUERY PLAN
>          ----------------------------------------------------------
>           Remote Subquery Scan on all
>             ->  Hash Join
>                   Hash Cond: (n1.n_regionkey = region.r_regionkey)
>                   ->  Remote Subquery Scan on all
>                         Distribute results by H: n_regionkey
>                         ->  Seq Scan on nation n1
>                   ->  Hash
>                         ->  Seq Scan on region
>          (8 rows)
> 
>     And when I say "many" I mean e.g. 10 joins with redistributions, or
>     so. Those redistributions require opening connections to other
>     datanodes, and each of those connections (or rather the backend
>     handling it) will connect to the GTM.
> 
>     As GTM is using thread-per-connection model, I quickly ended up with
>     about 500 threads in the GTM, and then pthread_create() started to
>     fail due to hitting some system limits (GTM + 4 coordinators + 4
>     datanodes on a single machine).
> 
>     Now, hitting limits is fine - it's expected, we check
>     pthread_create() return value and responds to it in some way.
>     Rejecting the connections is fine in such cases. But there seems to
>     be something wrong as what happens is the GTM quickly allocates
>     large amounts of memory (several gigabytes) and then crashes / gets
>     killed. That's bad.
> 
> 
> Yeah, that sounds bad. I wonder if allocating memory contexts per thread 
> causes excessive memory usage. I don't think we designed GTM to have 
> more than a handful threads (one per physical node), but probably did 
> not enforce limits in a reasonable way.
> 
>     Adding a gtm_proxy (which is generally recommended anyway) fixes
>     this issue, as the proxy uses only a limited number of threads to
>     handle the connections, and connections only one GTM connection per
>     thread.
> 
> 
> 
> TBH I think we should just make GTM proxy a required component of
> the architecture since there is hardly any benefit of running without
> it and the problems could be much worse. There were some suggestions
> in the past to turn it into a backgrounder worker process at each
> node so that we can reduce the communication overhead between backend
> processes and the GTM proxy and also simplify things a bit by
> reducing one component.
> 

I agree making gtm_proxy a required part may be a good idea, although 
I'm wondering if turning it into a background worker might be too 
restrictive. It would make it easier to operate an XL cluster (not 
having to deal with GTM proxies, etc.). OTOH it would be impossible to 
run gtm_proxy independently. I wonder if this might be an issue for some 
deployments.

> 
>     GTM is also typically placed on a different system, not collocated
>     with the coordinators/datanodes, so hitting the nproc limit should
>     not happen this easily.
> 
>     You might also argue that perhaps GTM_MAX_THREADS should be set
>     lower (it's 1024 by default). Maybe, but I don't think it would be a
>     reliable solution - pthread_create() might still fail for a number
>     of reasons.
> 
>     But it suggests the current GTM code is not really handling those
>     failures sufficiently.
> 
> 
> That sounds bad too.
> 
>     While investigating those issues, I've however ran into other
>     issues, related to how GTM uses some of the Postgres infrastructure
>     (elog and memory contexts) and uses it in multi-threaded environment.
> 
>     For example, elog() relies on a memory contexts being set correctly,
>     but when it gets called from GTM_ThreadRemove() you get a failure
>     like this one (on assert-enabled build):
> 
>     TRAP: BadArgument("!(((context) != ((void *)0)))", File: "mcxt.c",
>     Line: 691)
> 
>     So clearly we end up with NULL MemoryContext. I haven't investigated
>     this deeper, but this means the existing elog(LOG) in
>     GTM_ThreadRemove may do more harm than good.
> 
>     I'm wondering if importing a lot of infrastructure from
>     single-threaded codebase into a multi-threaded GTM is a good idea
>     overall. What exactly are the benefits and how much does it cost us?
> 
> 
> TBH I don't think a lot of thought went into that when we wrote that 
> piece initially. We were using a lot of code from Postgres and
> quickly adding these infrastructure pieces helped us avoid rewriting
> other parts of the code. It served us well so far, but I am sure we
> can do it in a better way if we decide to reimplement parts or all of
> that.
> 

Sure. To be clear, I understand reusing this code allowed getting it 
working fast. But perhaps we can now look at the code, remove bits that 
are not needed and tweak the implementation of the remaining part. My 
feeling is that after cutting the unnecessary parts the amount of code 
will be fairly small.

> 
>     I suspect it might be better to simply get rid of this code, replace
>     it with fully thread-ware implementation, and be done with it. We
>     probably need only a few bits for the GTM:
> 
>     * logging (elog)
>     * memory management (per-thread/request context should be fine)
>     * error handling (although GTM doesn't seem to be using PG_TRY/PG_CATCH)
> 
>     So that seems quite doable. But before doing any radical changes,
>     I'd like to know what are your thoughts on this. I assume there are
>     reasons why it was originally implemented this way.
> 
> 
> My only concern would be that we might end up taking up a rather
> large scale refactoring without really understanding what we are
> trying to achieve. You talked about several problems above, but I am
> not sure if memory context management or elog infrastructure is an
> answer to most of them. Would it be more appropriate to fix the
> problems rather than reimplementing everything? I think the answer to
> that will depend on how many and what kind of problems the current
> infrastructure has and how much effort it will be to rewrite that,
> including further testing and stabilisation.
> 

I understand we need to be careful. If there's a simple way to fix the 
issues, sure - let's do that. But I'm afraid the combination of 
threading and large amounts of code that was not designed with threading 
in mind is not particularly easy to fix.

I certainly don't think we should do one huge refactoring and change 
everything at once. That's a recipe for disaster.

So what I think we should do first is eliminating unnecessary code (like 
the try/catch support from elog), and then eventually refactor the 
remaining bits.


>     Another idea is to abandon the thread-per-connection model, and
>     instead adopt the gtm_proxy model, where a limited number of threads
>     handles many connections. It works fine for the proxy, so why
>     wouldn't it work for the GTM too?
> 
> 
> Yeah, it should work. As I said upthread, we did not imagine GTM with 
> hundreds of threads and also thought that since one thread will be 
> serving one proxy, it will have enough to do. So multiplexing multiple 
> connections did not seem very interesting. But I agree that might not 
 > be true.
> 
> Are there other models to consider if we indeed attempt to rewrite
> some parts of this code?
> 

I don't think so. We might use something like libev to do event-based 
approach, or maybe use Go with it's coroutines/channels to handle the 
network events. But those seem like a fairly radical changes, meaning 
that we'd have to rewrite pretty much all code. Switching from the 
current thread-per-connection to the multiplexing (as in GTM proxy) 
seems comparatively simple. And we know it works nicely, thanks to the 
experience with GTM proxy.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


More information about the Postgres-xl-developers mailing list