[Postgres-xl-developers] 答复: join restriction(Internet mail)

senhu(胡森) senhu at tencent.com
Wed Jun 21 18:25:38 PDT 2017


Hi:
   I will make more researches on this in few days, and use the tpch to test the performance, then
I will give some cases to discuss.

Regards
Senhu(senhu at tencent.com)

-----邮件原件-----
发件人: Tomas Vondra [mailto:tomas.vondra at 2ndquadrant.com] 
发送时间: 2017年6月22日 4:12
收件人: senhu(胡森); postgres-xl-developers at lists.postgres-xl.org
主题: Re: [Postgres-xl-developers] join restriction(Internet mail)

Hello senhu,

On 6/21/17 9:36 AM, senhu(胡森) wrote:
> Hello, all:
> 
> When I run some join queries on PG-XL 9.6, I find that if the data 
> types of two join args are different, the
> 
> data redistribution is forbidded, and I find this restriction on 
> source code. In function ‘set_joinpath_distribution’,
> 

Can you point us to the query triggering this?

 >
> there is piece of code:
> 
>    /*
>     * Check if both parts are of the same data type and choose
>     * distribution type to redistribute.
>     * XXX We may want more sophisticated algorithm to choose
>     * the best condition to redistribute parts along.
>     * For now use simple but reliable approach.
>     */
> 
>     if (leftType != rightType)
>         continue;
> 
> If we can remove this restriction to make join more efficient with 
> different join data types, such as int join bigint.
> 

I guess the reason is that even if the two values end up being equal (after casing one of the values to the other type), that may not apply to the hash value. And if the hash values are different, the values may end up on different data nodes, breaking the whole distribution logic.

It may work for hashint4() and hashint8(), but it probably does not work for hashint4() and hashtext(). So the question is if we can identify which combinations of hash functions are safe, and which are not.

That should be fairly stable, because hash partitioning relies on that too, I believe.

> On the other hand, we can add some other conditions to make the 
> decision. For example, we can use the row counts of both
> 
> Left and right rels of join.
> 
> If both rels have small row counts, data redistributed is not needed, 
> gather the tuples of two rels, and join on one node.
 >
> Otherwise, we try to redistribute the smaller rel as possible.
> 

I think this should be doable by simply constructing two paths with different costs and distributions. But I'm not sure it's worth it, as it will move all the work on a single node.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



More information about the Postgres-xl-developers mailing list