[Postgres-xl-developers] SQueue X, timeout while waiting for Consumers finishing
Tomas Vondra
tomas.vondra at 2ndquadrant.com
Fri Aug 11 15:49:51 PDT 2017
Hi,
Over the past few days I've been looking into what's causing error
messages like this
WARNING: SQueue p_1_2_3, timeout while waiting for Consumers finishing
Typically associated with a 10-second stall due to a timeout. It's quite
annoying, because it's somewhat unpredictable.
It's related to a particularly complex part of the codebase, and I
haven't been able to solve it yet. But I'd like to present what I've
found so far - the root cause and how to reproduce it.
The root cause for this is fairly trivial - assume you have two tables:
create table t1 as
select i AS id1, 1 AS id2 from generate_series(1,10) s(i);
create table t2 as select 1 AS id2;
vacuum analyze;
That is, t1 is distributed by (id1) while t2 is distributed on (id2).
Now, let's join the tables like this:
explain (costs off) select * from t1 join t2 using (id2);
As only one of the tables is distributed on the join column, the other
one needs to be redistributed - that's what outer side of the hash join
in the query plan does (note the 'Distribute results by' line).
QUERY PLAN
---------------------------------------------------
Remote Subquery Scan on all (dn1,dn2)
-> Hash Join
Hash Cond: (t1.id2 = t2.id2)
-> Remote Subquery Scan on all (dn1,dn2)
Distribute results by H: id2
-> Seq Scan on t1
-> Hash
-> Seq Scan on t2
This redistribution is done using something called "shared queue" (aka
SQueue, mentioned in the warning message), and it's based on a
producer-consumer pattern.
Each datanode acts as a producer, feeding data into a local SQueue, and
expecting the other datanodes to connect and fetch their part of the
data (and then join them with data from the other table).
The problem is that "t2" only has a single row, which means it's empty
on one of the datanodes (say, dn1) What seems to be happening is dn1
starts executing the Hash Join, and the first thing it does is building
the hash table.
dn1 then realizes it can't possibly produce any rows with an empty hash
table (for an inner join), and entirely skips the Remote Subquery bit.
Which means it never even connects to the SQueue on dn2, does not fetch
any data from it nor does it notify dn2 not to wait for it.
So dn2 ends up with data for dn1 still in the queue, and the way this is
handled right now is "wait for 10 seconds and then give up". Which is
exactly what the warning message is about.
Now, do this:
insert into t2 values (3);
This should insert another row into t2, located on the other datanode.
You can verify it by checking xc_node_id system column in t2 - there
should be two different values:
select xc_node_id, * from t2;
xc_node_id | id2
-------------+-----
-675012441 | 1
-1047623914 | 3
(2 rows)
If you execute the query again, it should complete almost immediately.
This is pretty simple example of what is happening. Of course, actual
clusters probably have more datanodes and larger amounts of data, but
the root cause is simple: The query does not match any data on one of
the datanodes, and decides not to connect to a remote shared queue. The
other datanodes then wait for 10 seconds before giving up.
This explains why the issue is more common in OLTP - those systems are
working with smaller amounts of data by default, so it's more likely to
not match any data on one of the datanodes. OLAP systems usually deal
with much more data, making this less probable.
The question is how to fix this. I think what we should do is simply
connect to the SQueue and notify the other nodes and notify them they
can throw away some of the data. I'm not sure how to do that, yet :-(
What I'm pretty sure about is that we should get rid of the timeouts
altogether, as it's rather unreliable way to deal with this. For
example, what if building the hash table takes 15 seconds on one of the
datanodes, for whatever reason (much more data, the node being
temporarily overloaded, ...)? I'd bet the other datanodes will wait for
the 10 seconds, and then give up. I don't think think could produce
incorrect results, though.
This also means lowering the timeout value is not a viable approach, as
it makes such failures more likely.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
More information about the Postgres-xl-developers
mailing list