[Postgres-xl-general] Segfault on query

Mason Sharp msharp at translattice.com
Thu Oct 9 09:53:26 PDT 2014


On Thu, Oct 9, 2014 at 12:17 PM, Dennis <dennisr at visi.com> wrote:

> Hi I am getting segfaults on coordinator nodes with queries that need to
> process large group by clauses and aggregates.  Is there anything more that
> I can do with the configuration param’s to alleviate the seg faults or is
> this a bug that will just need to be fixed?
>
> I have both the a coordinator and datanode runnning on the same box.
> Overall architecture is two coordinators, three datanodes, gtm node with
> two of the datanodes on the same boxes as the coordinators.
>
> I am running with the latest GIT hub pull and Pavan's patches from earlier
> this month.
>
> Explain Plan for a query that causes the excpetion:
>
>  QUERY PLAN
>
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------
> -------------
>  Sort  (cost=16800461.97..16819557.02 rows=7638020 width=874)
>    Sort Key: b.tdp52
>    ->  Subquery Scan on b  (cost=11726523.01..15927254.22 rows=7638020
> width=874)
>          ->  GroupAggregate  (cost=11726523.01..15163452.22 rows=7638020
> width=147)
>                Filter: (sum((sum(CASE WHEN (t4.orderno <= 13) THEN
> t1.b_units ELSE 0::numeric END))) > 2::numeric)
>                ->  Remote Subquery Scan on all
> (datanode_pgres01,datanode_pgres02,datanode_pgres04)
> (cost=11726523.01..11768430.59 rows=16763032 widt
> h=147)
>                      ->  GroupAggregate  (cost=11726523.01..11768430.59
> rows=7638020 width=147)
>                            ->  Sort  (cost=11726523.01..11768430.59
> rows=16763032 width=147)
>                                  Sort Key: t1.store_tag, t2.product_key,
> t3.description, t2.description, t2.category
>                                  ->  Nested Loop
> (cost=121437.05..9715061.44 rows=16763032 width=147)
>                                        ->  Hash Join
> (cost=121432.78..9505511.96 rows=8381516 width=111)
>                                              Hash Cond: (t1.period =
> "*SELECT* 1".period)
>                                              ->  Hash Join
> (cost=121403.48..9390236.82 rows=8381516 width=107)
>                                                    Hash Cond:
> ((t1.upc)::text = (t2.product_key)::text)
>                                                    ->  Index Scan using
> qd_f_s_idx on t_quad_f t1  (cost=0.00..9069772.33 rows=8381516 width=56)
>                                                          Index Cond:
> (store_tag = 55)
>                                                    ->  Hash
> (cost=114365.99..114365.99 rows=562999 width=66)
>                                                          ->  Seq Scan on
> nb_products t2  (cost=0.00..114365.99 rows=562999 width=66)
>                                              ->  Hash  (cost=28.30..28.30
> rows=80 width=12)
>                                                    ->  Append
> (cost=7.34..28.30 rows=80 width=12)
>                                                          ->  Subquery Scan
> on "*SELECT* 1"  (cost=7.34..8.78 rows=16 width=12)
>                                                                Filter:
> ("*SELECT* 1".orderno <= 26)
>                                                                ->
> WindowAgg  (cost=7.34..8.18 rows=48 width=45)
>                                                                      ->
> Sort  (cost=7.34..7.46 rows=48 width=45)
>
> Sort Key: sn_dm.nb_periods.period
>
> ->  Remote Subquery Scan on all (datanode_pgres02)  (cost=0.00..6.00 rows=4
> 8 width=45)
>
>       ->  Seq Scan on nb_periods  (cost=0.00..6.00 rows=48 width=45)
>
>             Filter: ((lvl)::text = 'P'::text)
>                                                          ->  Subquery Scan
> on "*SELECT* 2"  (cost=13.28..19.52 rows=64 width=12)
>                                                                Filter:
> ("*SELECT* 2".orderno <= 26)
>                                                                ->
> WindowAgg  (cost=13.28..17.12 rows=192 width=45)
>                                                                      ->
> Sort  (cost=13.28..13.76 rows=192 width=45)
>
> Sort Key: sn_dm.nb_periods.period
>
> ->  Remote Subquery Scan on all (datanode_pgres02)  (cost=0.00..6.00 rows=1
> 92 width=45)
>
>       ->  Seq Scan on nb_periods  (cost=0.00..6.00 rows=192 width=45)
>
>             Filter: ((lvl)::text = 'W'::text)
>                                        ->  Materialize  (cost=4.27..11.58
> rows=2 width=40)
>                                              ->  Bitmap Heap Scan on
> nb_markets t3  (cost=4.27..11.57 rows=2 width=40)
>                                                    Recheck Cond: (mkt_tag
> = 55)
>                                                    ->  Bitmap Index Scan
> on mkt_tf_idx  (cost=0.00..4.27 rows=2 width=0)
>                                                          Index Cond:
> (mkt_tag = 55)
> (41 rows)
>
>
>
> ========
> Logs from coordinator where the query was run.
>
> DETAIL:  Failed process was running: select
> comp_tag,upc,market,item,category,weeks12,weeks24,weeks52,b_dollars12,b_dollars24,b_dollars52,tdp12,tdp24,tdp52,b_units12,b_units24,
>
>   b_units52,b_eq_units12,b_eq_units24,b_eq_units52,p_b_dollars12,p_b_dollars24,p_b_dollars52,p_tdp12,p_tdp24,p_tdp52,p_b_units12,p_b_units24,p_b_units52,
>         (case when b_units12 > 0 then b_dollars12/b_units12 else 0 end)
> cur_arp_12,
>         (case when b_units24 > 0 then b_dollars24/b_units24 else 0 end)
> cur_arp_24,
>         (case when b_units52 > 0 then b_dollars52/b_units52 else 0 end)
> cur_arp_52,
>         (case when p_b_units12 > 0 then p_b_dollars12/p_b_units12 else 0
> end) pre_arp_12,
>         (case when p_b_units24 > 0 then p_b_dollars24/p_b_units24 else 0
> end) pre_arp_24,
>         (case when p_b_units52 > 0 then p_b_dollars52/p_b_units52 else 0
> end) pre_arp_52,
>         (case when b_units52 >0 and p_b_units52>0  then
> ((b_dollars52/b_units52)/(p_b_dollars52/p_b_units52))-1 else 0 end)
> arp_chg_52,
>         (case when b_units24 >0 and p_b_units24>0  then
> ((b_dollars24/b_units24)/(p_b_dollars24/p_b_units24))-1 else 0 end)
> arp_chg_24,
>         (
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> FATAL:  the database system is in recovery mode
> LOG:  all server processes terminated; reinitializing
> LOG:  database system was interrupted; last known up at 2014-10-09
> 09:46:02 CDT
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> LOG:  redo starts at 0/3FBDED0
> LOG:  record with zero length at 0/3FBE070
> LOG:  redo done at 0/3FBE040
> LOG:  last completed transaction was at log time 2014-10-09
> 09:49:07.195225-05
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
>
>
> I have the following non-default param's set in my coordinators
> postgresql.conf:
>
> maintenance_work_mem = 12GB
> constraint_exclusion = on
> checkpoint_completion_target = 0.9
> effective_cache_size = 176GB
> #work_mem = 4GB
> work_mem = 8GB
> wal_buffers = 32MB
> checkpoint_segments = 256
> shared_buffers = 16GB
>
> maintence_work_mem is set to large value for some data loads that we are
> doing.
>
>
Just to confirm, these are values on the coordinators and not datanodes?
Normally the heavy lifting is done in the datanodes, but it does look like
the coordinator is busy, too.  It looks like you uncovered a problem.


> Hardware:
> Memory: 252GB
> 32 core.
>
>
>

If it is a segmentation fault, then yeah, we have a problem that we
definitely need to fix. If you built with debug symbols (configure with
--enable-debug) and have a stack trace from the core file (gdb backtrace),
that may help us to pinpoint the issue.

Also, if you can simplify your query while still triggering the issue, it
may be easier for us to reproduce here.



-- 
Mason Sharp

TransLattice - http://www.translattice.com
Clustered and Distributed Database Solutions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.postgres-xl.org/pipermail/postgres-xl-general-postgres-xl.org/attachments/20141009/1beaa55d/attachment.htm>


More information about the postgres-xl-general mailing list