[Postgres-xl-general] Segfault on query

Dennis dennisr at visi.com
Thu Oct 9 10:53:04 PDT 2014


1. Yes, this is the coordinator that is throwing the seg fault (I was surprised too, I thought it would be on the datanode myself)
2.  As simpler query that was throwing the error until up’d the memory from defaults on the coordinator is:
select
t1.store_tag,
t2.product_key,
count(*)
from sn_dm.t_quad_f t1,sn_dm.nb_products t2, sn_dm.nb_markets t3, sn_dm.sn_nb_periods t4
where (t2.product_key = t1.upc)
and (t3.mkt_tag = t1.store_tag)
and (t4.period = t1.period)
and t4.orderno <= 26
and t1.store_tag = 55
group by t1.store_tag, t2.product_key;

3.  I will have to recompile with debug and send you the stack trace in another email.

Thanks

On Oct 9, 2014, at 10:53 AM, Mason Sharp <msharp at translattice.com> wrote:

> 
> 
> On Thu, Oct 9, 2014 at 12:17 PM, Dennis <dennisr at visi.com> wrote:
> Hi I am getting segfaults on coordinator nodes with queries that need to process large group by clauses and aggregates.  Is there anything more that I can do with the configuration param’s to alleviate the seg faults or is this a bug that will just need to be fixed?
> 
> I have both the a coordinator and datanode runnning on the same box.  Overall architecture is two coordinators, three datanodes, gtm node with two of the datanodes on the same boxes as the coordinators.
> 
> I am running with the latest GIT hub pull and Pavan's patches from earlier this month. 
> 
> Explain Plan for a query that causes the excpetion:
>                                                                            QUERY PLAN                                                                
>              
> ------------------------------------------------------------------------------------------------------------------------------------------------------
> -------------
>  Sort  (cost=16800461.97..16819557.02 rows=7638020 width=874)
>    Sort Key: b.tdp52
>    ->  Subquery Scan on b  (cost=11726523.01..15927254.22 rows=7638020 width=874)
>          ->  GroupAggregate  (cost=11726523.01..15163452.22 rows=7638020 width=147)
>                Filter: (sum((sum(CASE WHEN (t4.orderno <= 13) THEN t1.b_units ELSE 0::numeric END))) > 2::numeric)
>                ->  Remote Subquery Scan on all (datanode_pgres01,datanode_pgres02,datanode_pgres04)  (cost=11726523.01..11768430.59 rows=16763032 widt
> h=147)
>                      ->  GroupAggregate  (cost=11726523.01..11768430.59 rows=7638020 width=147)
>                            ->  Sort  (cost=11726523.01..11768430.59 rows=16763032 width=147)
>                                  Sort Key: t1.store_tag, t2.product_key, t3.description, t2.description, t2.category
>                                  ->  Nested Loop  (cost=121437.05..9715061.44 rows=16763032 width=147)
>                                        ->  Hash Join  (cost=121432.78..9505511.96 rows=8381516 width=111)
>                                              Hash Cond: (t1.period = "*SELECT* 1".period)
>                                              ->  Hash Join  (cost=121403.48..9390236.82 rows=8381516 width=107)
>                                                    Hash Cond: ((t1.upc)::text = (t2.product_key)::text)
>                                                    ->  Index Scan using qd_f_s_idx on t_quad_f t1  (cost=0.00..9069772.33 rows=8381516 width=56)
>                                                          Index Cond: (store_tag = 55)
>                                                    ->  Hash  (cost=114365.99..114365.99 rows=562999 width=66)
>                                                          ->  Seq Scan on nb_products t2  (cost=0.00..114365.99 rows=562999 width=66)
>                                              ->  Hash  (cost=28.30..28.30 rows=80 width=12)
>                                                    ->  Append  (cost=7.34..28.30 rows=80 width=12)
>                                                          ->  Subquery Scan on "*SELECT* 1"  (cost=7.34..8.78 rows=16 width=12)
>                                                                Filter: ("*SELECT* 1".orderno <= 26)
>                                                                ->  WindowAgg  (cost=7.34..8.18 rows=48 width=45)
>                                                                      ->  Sort  (cost=7.34..7.46 rows=48 width=45)
>                                                                            Sort Key: sn_dm.nb_periods.period
>                                                                            ->  Remote Subquery Scan on all (datanode_pgres02)  (cost=0.00..6.00 rows=4
> 8 width=45)
>                                                                                  ->  Seq Scan on nb_periods  (cost=0.00..6.00 rows=48 width=45)
>                                                                                       Filter: ((lvl)::text = 'P'::text)
>                                                          ->  Subquery Scan on "*SELECT* 2"  (cost=13.28..19.52 rows=64 width=12)
>                                                                Filter: ("*SELECT* 2".orderno <= 26)
>                                                                ->  WindowAgg  (cost=13.28..17.12 rows=192 width=45)
>                                                                      ->  Sort  (cost=13.28..13.76 rows=192 width=45)
>                                                                            Sort Key: sn_dm.nb_periods.period
>                                                                            ->  Remote Subquery Scan on all (datanode_pgres02)  (cost=0.00..6.00 rows=1
> 92 width=45)
>                                                                                  ->  Seq Scan on nb_periods  (cost=0.00..6.00 rows=192 width=45)
>                                                                                       Filter: ((lvl)::text = 'W'::text)
>                                        ->  Materialize  (cost=4.27..11.58 rows=2 width=40)
>                                              ->  Bitmap Heap Scan on nb_markets t3  (cost=4.27..11.57 rows=2 width=40)
>                                                    Recheck Cond: (mkt_tag = 55)
>                                                    ->  Bitmap Index Scan on mkt_tf_idx  (cost=0.00..4.27 rows=2 width=0)
>                                                          Index Cond: (mkt_tag = 55)
> (41 rows)
> 
> 
> 
> ========
> Logs from coordinator where the query was run.
> 
> DETAIL:  Failed process was running: select comp_tag,upc,market,item,category,weeks12,weeks24,weeks52,b_dollars12,b_dollars24,b_dollars52,tdp12,tdp24,tdp52,b_units12,b_units24,
>         b_units52,b_eq_units12,b_eq_units24,b_eq_units52,p_b_dollars12,p_b_dollars24,p_b_dollars52,p_tdp12,p_tdp24,p_tdp52,p_b_units12,p_b_units24,p_b_units52,
>         (case when b_units12 > 0 then b_dollars12/b_units12 else 0 end) cur_arp_12,
>         (case when b_units24 > 0 then b_dollars24/b_units24 else 0 end) cur_arp_24,
>         (case when b_units52 > 0 then b_dollars52/b_units52 else 0 end) cur_arp_52,
>         (case when p_b_units12 > 0 then p_b_dollars12/p_b_units12 else 0 end) pre_arp_12,
>         (case when p_b_units24 > 0 then p_b_dollars24/p_b_units24 else 0 end) pre_arp_24,
>         (case when p_b_units52 > 0 then p_b_dollars52/p_b_units52 else 0 end) pre_arp_52,
>         (case when b_units52 >0 and p_b_units52>0  then ((b_dollars52/b_units52)/(p_b_dollars52/p_b_units52))-1 else 0 end) arp_chg_52,
>         (case when b_units24 >0 and p_b_units24>0  then ((b_dollars24/b_units24)/(p_b_dollars24/p_b_units24))-1 else 0 end) arp_chg_24,
>         (
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> FATAL:  the database system is in recovery mode
> LOG:  all server processes terminated; reinitializing
> LOG:  database system was interrupted; last known up at 2014-10-09 09:46:02 CDT
> LOG:  database system was not properly shut down; automatic recovery in progress
> LOG:  redo starts at 0/3FBDED0
> LOG:  record with zero length at 0/3FBE070
> LOG:  redo done at 0/3FBE040
> LOG:  last completed transaction was at log time 2014-10-09 09:49:07.195225-05
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> 
> 
> I have the following non-default param's set in my coordinators postgresql.conf:
> 
> maintenance_work_mem = 12GB 
> constraint_exclusion = on
> checkpoint_completion_target = 0.9
> effective_cache_size = 176GB 
> #work_mem = 4GB 
> work_mem = 8GB 
> wal_buffers = 32MB 
> checkpoint_segments = 256
> shared_buffers = 16GB
> 
> maintence_work_mem is set to large value for some data loads that we are doing.
> 
> 
> Just to confirm, these are values on the coordinators and not datanodes? Normally the heavy lifting is done in the datanodes, but it does look like the coordinator is busy, too.  It looks like you uncovered a problem.
>  
> Hardware:
> Memory: 252GB
> 32 core.
> 
> 
> 
> 
> If it is a segmentation fault, then yeah, we have a problem that we definitely need to fix. If you built with debug symbols (configure with --enable-debug) and have a stack trace from the core file (gdb backtrace), that may help us to pinpoint the issue.
> 
> Also, if you can simplify your query while still triggering the issue, it may be easier for us to reproduce here.
> 
> 
> 
> -- 
> Mason Sharp
> 
> TransLattice - http://www.translattice.com
> Clustered and Distributed Database Solutions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.postgres-xl.org/pipermail/postgres-xl-general-postgres-xl.org/attachments/20141009/537821d5/attachment.htm>


More information about the postgres-xl-general mailing list