postgres_upsert

Allows your rails app to load data in a very fast way, avoiding calls to ActiveRecord.

Using the PG gem and postgres's powerful COPY command, you can create thousands of rails objects in your db in a single query.

Install

Put it in your Gemfile

gem 'postgres_upsert'

Run the bundle command

bundle

Usage

The gem will add the aditiontal class method to ActiveRecord::Base

  • pg_upsert io_object_or_file_path, [options]

io_object_or_file_path => is a file path or an io object (StringIO, FileIO, etc.)

options: :delimiter - the string to use to delimit fields. Default is "," :format - the format of the file (valid formats are :csv or :binary). Default is :csv :header => specifies if the file/io source contains a header row. Either :header option must be true, or :columns list must be passed. Default true :key_column => the primary key or unique key column on your ActiveRecord table, used to distinguish new records from existing records. Default is the primary_key of your ActiveRecord model class. :update_only => when true, postgres_upsert will ONLY update existing records, and not insert new. Default is false.

pg_upsert will allow you to copy data from an arbritary IO object or from a file in the database server (when you pass the path as string). Let's first copy from a file in the database server, assuming again that we have a users table and that we are in the Rails console:

User.pg_upsert "/tmp/users.csv"

This command will use the headers in the CSV file as fields of the target table, so beware to always have a header in the files you want to import. If the column names in the CSV header do not match the field names of the target table, you can pass a map in the options parameter.

User.pg_upsert "/tmp/users.csv", :map => {'name' => 'first_name'}

In the above example the header name in the CSV file will be mapped to the field called first_name in the users table.

To copy a binary formatted data file or IO object you can specify the format as binary

User.pg_upsert "/tmp/users.dat", :format => :binary, :columns => ["id, "name"]

Which will generate the following SQL command:

COPY users (id, name) FROM '/tmp/users.dat' WITH BINARY

NOTE: binary files do not include header columns, so passing a :columns array is required for binary files.

pg_upsert supports 'upsert' or 'merge' operations. In other words, the data source can contain both new and existing objects, and pg_upsert will handle either case. Since the Postgres native COPY command does not handle updating existing records, pg_upsert accomplishes update and insert using an intermediary temp table:

This merge/upsert happend in 5 steps (assume your data table is called "users")

  • create a temp table named users_temp_### where "###" is a random number. In postgres temp tables are only visible to the current database session, so naming conflicts should not be a problem.
  • COPY the data to user_temp
  • issue a query to insert all new records from users_temp_### into users (newness is determined by the presence of the primary key in the users table)
  • issue a query to update all records in users with the data in users_temp_### (matching on primary key)
  • drop the temp table.

overriding the key_column

By default pg_upsert uses the primary key on your ActiveRecord table to determine if each record should be inserted or updated. You can override the column using the :key_field option:

User.pg_upsert "/tmp/users.dat", :format => :binary, :key_column => ["external_twitter_id"]

obviously, the field you pass must be a unique key in your database (this is not enforced at the moment, but will be)

passing :update_only = true will ensure that no new records are created, but records will be updated.

Note on Patches/Pull Requests

  • Fork the project
  • add your feature/fix to your fork(rpsec tests pleaze)
  • submit a PR
  • If you find an issue but can't fix in in a PR, please log an issue. I'll do my best.