[Postgres-xl-developers] SQueue X, timeout while waiting for Consumers finishing

Tomas Vondra tomas.vondra at 2ndquadrant.com
Fri Aug 11 15:49:51 PDT 2017


Hi,

Over the past few days I've been looking into what's causing error 
messages like this

   WARNING: SQueue p_1_2_3, timeout while waiting for Consumers finishing

Typically associated with a 10-second stall due to a timeout. It's quite 
annoying, because it's somewhat unpredictable.

It's related to a particularly complex part of the codebase, and I 
haven't been able to solve it yet. But I'd like to present what I've 
found so far - the root cause and how to reproduce it.

The root cause for this is fairly trivial - assume you have two tables:

     create table t1 as
     select i AS id1, 1 AS id2 from generate_series(1,10) s(i);

     create table t2 as select 1 AS id2;

     vacuum analyze;

That is, t1 is distributed by (id1) while t2 is distributed on (id2). 
Now, let's join the tables like this:

     explain (costs off) select * from t1 join t2 using (id2);

As only one of the tables is distributed on the join column, the other 
one needs to be redistributed - that's what outer side of the hash join 
in the query plan does (note the 'Distribute results by' line).

                          QUERY PLAN
     ---------------------------------------------------
      Remote Subquery Scan on all (dn1,dn2)
        ->  Hash Join
              Hash Cond: (t1.id2 = t2.id2)
              ->  Remote Subquery Scan on all (dn1,dn2)
                    Distribute results by H: id2
                    ->  Seq Scan on t1
              ->  Hash
                    ->  Seq Scan on t2

This redistribution is done using something called "shared queue" (aka 
SQueue, mentioned in the warning message), and it's based on a 
producer-consumer pattern.

Each datanode acts as a producer, feeding data into a local SQueue, and 
expecting the other datanodes to connect and fetch their part of the 
data (and then join them with data from the other table).

The problem is that "t2" only has a single row, which means it's empty 
on one of the datanodes (say, dn1)  What seems to be happening is dn1 
starts executing the Hash Join, and the first thing it does is building 
the hash table.

dn1 then realizes it can't possibly produce any rows with an empty hash 
table (for an inner join), and entirely skips the Remote Subquery bit. 
Which means it never even connects to the SQueue on dn2, does not fetch 
any data from it nor does it notify dn2 not to wait for it.

So dn2 ends up with data for dn1 still in the queue, and the way this is 
handled right now is "wait for 10 seconds and then give up". Which is 
exactly what the warning message is about.

Now, do this:

     insert into t2 values (3);

This should insert another row into t2, located on the other datanode. 
You can verify it by checking xc_node_id system column in t2 - there 
should be two different values:

     select xc_node_id, * from t2;
      xc_node_id  | id2
     -------------+-----
       -675012441 |   1
      -1047623914 |   3
     (2 rows)

If you execute the query again, it should complete almost immediately.

This is pretty simple example of what is happening. Of course, actual 
clusters probably have more datanodes and larger amounts of data, but 
the root cause is simple: The query does not match any data on one of 
the datanodes, and decides not to connect to a remote shared queue. The 
other datanodes then wait for 10 seconds before giving up.

This explains why the issue is more common in OLTP - those systems are 
working with smaller amounts of data by default, so it's more likely to 
not match any data on one of the datanodes. OLAP systems usually deal 
with much more data, making this less probable.

The question is how to fix this. I think what we should do is simply 
connect to the SQueue and notify the other nodes and notify them they 
can throw away some of the data. I'm not sure how to do that, yet :-(

What I'm pretty sure about is that we should get rid of the timeouts 
altogether, as it's rather unreliable way to deal with this. For 
example, what if building the hash table takes 15 seconds on one of the 
datanodes, for whatever reason (much more data, the node being 
temporarily overloaded, ...)? I'd bet the other datanodes will wait for 
the 10 seconds, and then give up. I don't think think could produce 
incorrect results, though.

This also means lowering the timeout value is not a viable approach, as 
it makes such failures more likely.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


More information about the Postgres-xl-developers mailing list