Tens of gigabytes of data. An amount that doesn’t quite fit into an Excel spreadsheet but doesn’t warrant the use of big data tools like Hadoop either. PostgreSQL can easily handle this volume. This post is about speeding up the initial import of a large number of rows.
Important settings in postgresql.conf
PostgreSQL’s configuration is by default geared towards reliability rather than speed. But in our scenario, all the data is in CSV files and it won’t matter if something happens to the PostgreSQL database. In the case of data corruption, we can simply re-run the import. By adjusting the following settings the data might be written to disk after a transaction is committed. Relaxing synchronization constraints allows higher data throughput. It’s a good trade-off for one-time import. Make sure you use these only on a server that doesn’t have any other important transactions running.
These are the relevant settings in postgresql.conf:
wal_level = minimal
fsync = off
synchronous_commit = off
wal_writer_delay = 2000ms
commit_delay = 100000
It’s recommended to return them back to default values after finishing the import.
Indices slow down imports
Proper indices on a table can help with queries tremendously. Their dark side is that they need to be updated on each insert. For efficiently inserting data in bulk, it’s better to drop indices before the import and recreate them afterward.
COPY instead of INSERT
I was importing data from CSV files with some light preprocessing in Python. pgcopy is a library that takes care of encoding the data in a binary format for the most efficient transfer to the PostgreSQL server. The whole import script looks something like this:
#!/usr/bin/env python3
import csv
from pgcopy import CopyManager
import psycopg2
def tmpfile():
return open('./tmp.tmp', 'wb+', buffering=4096*1024)
def prepare_data(row):
"""Preprocess each row"""
return row
def csv_enumerator(csvfile):
csvreader = csv.reader(
open(csvfile, 'r', buffering=4096*1024), delimiter=';')
for index, row in enumerate(csvreader):
if index > 0:
data = prepare_data(row)
yield data
dbc = psycopg2.connect("host=db port=5432 dbname=mydb user=postgres password=mypwd")
columns = ["name", "surname"]
mgr = CopyManager(dbc, 'my_table', columns)
mgr.copy(csv_enumerator("input.csv"), tmpfile)
dbc.commit()
Thanks to Aryeh Leib Taurog for coming up with this great library.