[Postgres-xl-general] Best XL Configuration For Large Volume INSERT Statements

Mason Sharp msharp at translattice.com
Tue Sep 30 09:27:29 PDT 2014


Hi Richard,

On Tue, Sep 30, 2014 at 11:03 AM, Silva, Richard <rsilva at javelin.mg> wrote:

>  Postgres-XL Community,
>
>  Our XL environment currently has 10 data-nodes, 1 coordinator, 1 gtm
> proxy, 1 gtm, 1 gtm standby all running on 8 core machines with 16GB of RAM
> and 1TB SSD drives. Up to this point, everything has been considerably
> faster than our single server production environment with the exception of
> INSERT statements.
>
>
Great to hear that it is faster in most cases.


>  *Senario:*
> We conduct a simple join of two tables and write the results out to a new
> table.
>
>  CREATE TABLE myschema.mytable AS
> SELECT m.id
>       ,m.project_id
>       ,m.account_id
>       ,m.company_name
>       ,p.start_date
>       ,p.end_date
> FROM (SELECT smb_project_id
>             ,start_date
>             ,end_date
>       FROM myschema.projects) AS p
>   INNER JOIN (SELECT id
>                    ,project_id
>                    ,account_id
>                    ,company_name
>              FROM myschema.customers) AS m
>     ON m.project_id = p.project_id;
>
>  The resulting SELECT statement returns close to 100 million records
> which then get inserted into a table for further aggregation (among other
> things). The current run time averages about 45 minutes for both physical
> and temporary tables. I understand there are lots of different ways to
> accomplish inserts of this nature such as COPY; however, thats not really
> what I’m looking for.
>
>  *Question:*
> What are the recommended configuration settings for XL when it comes to
> large volume inserts?
>
>  Here are the postgresql.conf settings I’ve changed on the datanodes,
> which have had little impact.
>
>  wal_level = minimal
> wal_sync_method = fdatasync
> synchronous_commit = off
> default_statistics_target = 50
> maintenance_work_mem = 208MB
> constraint_exclusion = on
> checkpoint_completion_target = 0.9
> effective_cache_size = 2560MB
> work_mem = 20MB
> wal_buffers = -1
> checkpoint_segments = 16
> shared_buffers = 832MB
> max_prepared_transactions = 100
>
>
 Any suggestions or possible links to suggestions?
>
>
As Koichi Suzuki mentioned, joining on respective partition keys would
help. While not ideal, a bit of denormalization could help, say, putting
customer id in the projects table and including that in the WHERE clause.

You could try increasing work_mem some in the current session using SET. I
suspect you won't get too much of a bump in performance, however.

Another parameter to try is to increase shared_queue_size. Increasing it
will take up more shared memory, however, do so cautiously.  Here, too, I
expect you won't get too much of an increase, however.

Are you CPU bound when this runs?  I recently was looking at some code and
saw something I think I can improve in execution of row shipping to make it
more efficient.  I will post a patch when I can.

Regards,

Mason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.postgres-xl.org/pipermail/postgres-xl-general-postgres-xl.org/attachments/20140930/ae06bbe2/attachment.htm>


More information about the postgres-xl-general mailing list