Oplogjam

Current version: 0.1.0
Supported Ruby versions: 2.0, 2.1, 2.2
Supported MongoDB versions: 2.4, 2.6, 3.0, 3.2, 3.4
Supported PostgreSQL versions: 9.5, 9.6

An experiment in writing a "safe" MongoDB oplog tailer that converts documents to PostgreSQL JSONB in Ruby.

Based on experiences running Stripe's now deprecated MoSQL project in production, this project provides a core library which stores all MongoDB documents in the same standard table schema in PostgreSQL but leaves all configuration and orchestration to the user. This means that this library can be used to power an end-to-end MoSQL replacement but does not provide all functionality itself.

At its heart, the library connects to a MongoDB replica set oplog and provides an abstraction to users so they can iterate over the operations in the oplog and transform those into equivalent PostgreSQL SQL statements.

DB = Sequel.connect('postgres:///acme')
mongo = Mongo::Client.new('mongodb://localhost')

Oplogjam::Oplog.new(mongo).operations.each do |operation|
  operation.apply('acme.widgets' => DB[:widgets], 'acme.anvils' => DB[:anvils])
end

Requirements

A MongoDB replica set;
PostgreSQL 9.5 or newer (for INSERT ON CONFLICT and jsonb_set support);
A PostgreSQL database with the uuid-ossp extension.

Why does `apply` take a mapping?

This library expects to replay operations on MongoDB collections on equivalent PostgreSQL tables. As the MongoDB oplog contains all operations on a replica set in a single collection, you must provide a mapping between MongoDB namespaces (e.g. a database and collection name such as foo.bar for a collection bar in the database foo) and PostgreSQL tables (represented by Sequel datasets). Any operations for namespaces not included in the mapping will be ignored.

For example, if we only want to replay operations on foo.bar to a table foo_bar in PostgreSQL, we might have a mapping like so:

DB = Sequel.connect('postgres:///oplogjam_test')
mapping = { 'foo.bar' => DB[:foo_bar] }

Then we can pass this mapping when we call apply on an operation, e.g.

oplog.operations.each do |operation|
  operation.apply(mapping)
end

In order for this to work, the PostgreSQL table foo_bar must have the following schema:

                             Table "public.foo_bar"
   Column   |            Type             |              Modifiers
------------+-----------------------------+-------------------------------------
 uuid       | uuid                        | not null default uuid_generate_v1()
 id         | jsonb                       | not null
 document   | jsonb                       | not null
 created_at | timestamp without time zone | not null
 updated_at | timestamp without time zone | not null
 deleted_at | timestamp without time zone |
Indexes:
    "foo_bar_pkey" PRIMARY KEY, btree (uuid)
    "foo_bar_id_deleted_at_key" UNIQUE CONSTRAINT, btree (id, deleted_at)
    "foo_bar_id_index" UNIQUE, btree (id) WHERE deleted_at IS NULL

We can create this ourselves or use Oplogjam::Schema to do it for us:

schema = Oplogjam::Schema.new(DB)
schema.create_table(:foo_bar)
schema.import(collection, :foo_bar) # Optionally import data from a MongoDB collection
schema.add_indexes(:foo_bar)

Why does this project exist?

Since maintenance of MoSQL by Stripe was ended, there have been several major changes that affect anyone designing a system to replay a MongoDB oplog in PostgreSQL:

The MongoDB driver ecosystem was overhauled and the Ruby driver API changed significantly;
PostgreSQL 9.5 introduced new JSONB operations such as jsonb_set for updating fields in JSONB objects;
PostgreSQL 9.5 also introduced INSERT ON CONFLICT for effectively "upserting" duplicate records on INSERT.

Running MoSQL in production also revealed that we didn't need its rich support for transforming MongoDB documents into typical relational schema with typed columns but instead relied entirely on its JSONB support: effectively mirroring the MongoDB document by storing it in a single JSONB column.

With that specific use case in mind, I wanted to explore whether a library to safely transform arbitrary MongoDB operations into SQL could be done in Ruby and remain somewhat idiomatic.

Why doesn't this project come with some sort of executable?

While the library is more opinionated about the data schema of PostgreSQL, it doesn't attempt to make any decisions about how you decide which collections you want to replicate and where they should be replicated to. Similarly, connecting to databases, logging, etc. are all left up to the user as something that can might differ wildly.

While in future I may add an executable, for now this is up to the user to manage.

API Documentation

Oplogjam::Oplog
- Oplogjam::Oplog.new(client)
- Oplogjam::Oplog#operations([query])
Oplogjam::Schema
Oplogjam::Operation
- Oplogjam::Operation.from(bson)
Oplogjam::Noop
Oplogjam::Insert
Oplogjam::Update
Oplogjam::Delete
Oplogjam::ApplyOps
Oplogjam::Command

`Oplogjam::Oplog`

An object representing a MongoDB oplog.

`Oplogjam::Oplog.new(client)`

mongo = Mongo::Client.new('mongodb://localhost')
Oplogjam::Oplog.new(mongo)

Return a new Oplogjam::Oplog for the given Mongo::Client client connected to a replica set.

`Oplogjam::Oplogjam#operations([query])`

oplog.operations.each do |operation|
  # Do something with operation
end

oplog.operations('ts' => { '$gt' => BSON::Timestamp.new(123456, 1) })

Return an infinite Enumerator yielding Operations from the Oplog with an optional MongoDB query which will affect the results from the underlying oplog.

`Oplogjam::Schema`

A class to manage the PostgreSQL schema used by Oplogjam (e.g. creating tables, importing data, etc.).

`Oplogjam::Schema.new(db)`

DB = Sequel.connect('postgres:///oplogjam_test')
schema = Oplogjam::Schema.new(DB)

Return a new Oplogjam::Schema for the given Sequel database connection.

`Oplogjam::Schema#create_table(name)`

schema.create_table(:foo_bar)

Attempt to create a table for Oplogjam's use in PostgreSQL with the given name if it doesn't already exist. Note that the name may be a single String, Symbol or a Sequel qualified identifier if you're using PostgreSQL schema.

A table will be created with the following schema:

uuid: a UUID v1 primary key (v1 so that they are sequential);
id: a jsonb representation of the primary key of the MongoDB document;
document: a jsonb representation of the entire MongoDB document;
created_at: the timestamp when this row was created by Oplogjam (not by MongoDB);
updated_at: the timestamp when this row was last updated by Oplogjam (not by MongoDB);
deleted_at: the timestamp when this row was deleted by Oplogjam (not by MongoDB).

If the table already exists, the method will do nothing.

`Oplogjam::Schema#add_indexes(name)`

schema.add_indexes(name)

Add the following indexes and constraints to the table with the given name if they don't already exist:

A unique index on id and deleted_at so no two records can have the same MongoDB ID and deletion time;
A partial unique index on id where deleted_at is NULL so no two records can have the same ID and not be deleted.

Note that the name may be a single String, Symbol or a Sequel qualified identifier if you're using PostgreSQL schema.

If the indexes already exist on the given table, the method will do nothing.

`Oplogjam::Schema#import(collection, name)`

schema.import(mongo[:bar], :foo_bar)

Batch import all existing documents from a given Mongo::Collection collection into the PostgreSQL table with the given name. Note that the name may be a single String, Symbol or Sequel qualified identifier if you're using PostgreSQL schema.

For performance, it's better to import existing data before adding indexes to the table (hence the separate create_table and add_indexes methods).

`Oplogjam::Operation`

A class representing a single MongoDB oplog operation.

`Oplogjam::Operation.from(bson)`

Oplogjam::Operation.from(document)

Convert a BSON document representing a MongoDB oplog operation into a corresponding Ruby object:

Oplogjam::Noop
Oplogjam::Insert
Oplogjam::Update
Oplogjam::Delete
Oplogjam::ApplyOps
Oplogjam::Command

Raises a Oplogjam::InvalidOperation if the type of operation is not recognised.

`Oplogjam::Noop`

A class representing a MongoDB no-op.

`Oplogjam::Noop.from(bson)`

Oplogjam::Noop.from(document)

Convert a BSON document representing a MongoDB oplog no-op into an Oplogjam::Noop instance.

Raises a Oplogjam::InvalidNoop error if the given document is not a valid no-op.

`Oplogjam::Noop#message`

noop.message
#=> "initiating set"

Return the internal message of the no-op.

`Oplogjam::Noop#id`

noop.id
#=> -2135725856567446411

Return the internal, unique identifier for the no-op.

`Oplogjam::Noop#timestamp`

noop.timestamp
#=> 2017-09-09 16:11:18 +0100

Return the time of the no-op as a Time.

`Oplogjam::Noop#ts`

noop.ts
#=> #<BSON::Timestamp:0x007fcadfa44500 @increment=1, @seconds=1479419535>

Return the raw, underlying BSON Timestamp of the no-op.

`Oplogjam::Noop#==(other)`

noop == other_noop
#=> false

Compares the identifiers of two no-ops and returns true if they are equal.

`Oplogjam::Noop#apply(mapping)`

noop.apply('foo.bar' => DB[:bar])

Apply this no-op to a mapping of MongoDB namespaces (e.g. foo.bar) to Sequel datasets representing PostgreSQL tables. As no-ops do nothing, this performs no operation.

`Oplogjam::Insert`

A class representing a MongoDB insert.

`Oplogjam::Insert.from(bson)`

Oplogjam::Insert.from(document)

Convert a BSON document representing a MongoDB oplog insert into an Oplogjam::Insert instance.

Raises a Oplogjam::InvalidInsert error if the given document is not a valid insert.

`Oplogjam::Insert#id`

insert.id
#=> -2135725856567446411

Return the internal, unique identifier for the insert.

`Oplogjam::Insert#namespace`

insert.namespace
#=> "foo.bar"

Return the namespace the insert affects. This will be a String of the form database.collection, e.g. foo.bar.

`Oplogjam::Insert#document`

insert.document
#=> {"_id"=>1}

Return the BSON::Document being inserted.

`Oplogjam::Insert#timestamp`

insert.timestamp
#=> 2017-09-09 16:11:18 +0100

Return the time of the insert as a Time.

`Oplogjam::Insert#ts`

insert.ts
#=> #<BSON::Timestamp:0x007fcadfa44500 @increment=1, @seconds=1479419535>

Return the raw, underlying BSON Timestamp of the insert.

`Oplogjam::Insert#==(other)`

insert == other_insert
#=> false

Compares the identifiers of two inserts and returns true if they are equal.

`Oplogjam::Insert#apply(mapping)`

insert.apply('foo.bar' => DB[:bar])

Apply this insert to a mapping of MongoDB namespaces (e.g. foo.bar) to Sequel datasets representing PostgreSQL tables. If the namespace of the insert maps to a dataset in the mapping, insert this document into the dataset with the following values:

A unique UUID v1 identifier;
The value of the document's _id stored as a JSONB value;
The entire document stored as a JSONB value;
The current time as created_at;
The current time as updated_at.

`Oplogjam::Update`

A class representing a MongoDB update.

`Oplogjam::Update.from(bson)`

Oplogjam::Update.from(document)

Convert a BSON document representing a MongoDB oplog update into an Oplogjam::Update instance.

Raises a Oplogjam::InvalidUpdate error if the given document is not a valid update.

`Oplogjam::Update#id`

update.id
#=> -2135725856567446411

Return the internal, unique identifier for the update.

`Oplogjam::Update#namespace`

update.namespace
#=> "foo.bar"

Return the namespace the update affects. This will be a String of the form database.collection, e.g. foo.bar.

`Oplogjam::Update#update`

update.update
#=> {"$set"=>{"name"=>"Alice"}}

Return the update to be applied as a BSON::Document.

`Oplogjam::Update#query`

update.query
#=> {"_id"=>1}

Return the query identifying which documents should be updated as a BSON::Document.

`Oplogjam::Update#timestamp`

update.timestamp
#=> 2017-09-09 16:11:18 +0100

Return the time of the update as a Time.

`Oplogjam::Update#ts`

update.ts
#=> #<BSON::Timestamp:0x007fcadfa44500 @increment=1, @seconds=1479419535>

Return the raw, underlying BSON Timestamp of the update.

`Oplogjam::Update#==(other)`

update == other_update
#=> false

Compares the identifiers of two updates and returns true if they are equal.

`Oplogjam::Update#apply(mapping)`

update.apply('foo.bar' => DB[:bar])

Apply this update to a mapping of MongoDB namespaces (e.g. foo.bar) to Sequel datasets representing PostgreSQL tables. If the namespace of the update maps to a dataset in the mapping, perform the update by finding the relevant row based on the query and transforming the MongoDB update into an equivalent PostgreSQL update.

This will also update the updated_at column of the target document to the current time.

`Oplogjam::Delete`

A class representing a MongoDB deletion.

`Oplogjam::Delete.from(bson)`

Oplogjam::Delete.from(document)

Convert a BSON document representing a MongoDB oplog delete into an Oplogjam::Delete instance.

Raises a Oplogjam::InvalidDelete error if the given document is not a valid delete.

`Oplogjam::Delete#id`

delete.id
#=> -2135725856567446411

Return the internal, unique identifier for the delete.

`Oplogjam::Delete#namespace`

delete.namespace
#=> "foo.bar"

Return the namespace the delete affects. This will be a String of the form database.collection, e.g. foo.bar.

`Oplogjam::Delete#query`

delete.query
#=> {"_id"=>1}

Return the query identifying which documents should be deleted as a BSON::Document.

`Oplogjam::Delete#timestamp`

delete.timestamp
#=> 2017-09-09 16:11:18 +0100

Return the time of the delete as a Time.

`Oplogjam::Delete#ts`

delete.ts
#=> #<BSON::Timestamp:0x007fcadfa44500 @increment=1, @seconds=1479419535>

Return the raw, underlying BSON Timestamp of the delete.

`Oplogjam::Delete#==(other)`

delete == other_delete
#=> false

Compares the identifiers of two deletes and returns true if they are equal.

`Oplogjam::Delete#apply(mapping)`

delete.apply('foo.bar' => DB[:bar])

Apply this delete to a mapping of MongoDB namespaces (e.g. foo.bar) to Sequel datasets representing PostgreSQL tables. If the namespace of the delete maps to a dataset in the mapping, perform a soft deletion by finding the relevant row based on the query and setting deleted_at to the current time.

This will also update the updated_at column of the target document to the current time.

`Oplogjam::ApplyOps`

A class representing a series of MongoDB operations in a single operation.

`Oplogjam::ApplyOps.from(bson)`

Oplogjam::ApplyOps.from(document)

Convert a BSON document representing a MongoDB oplog apply ops into an Oplogjam::ApplyOps instance.

Raises a Oplogjam::InvalidApplyOps error if the given document is not a valid apply ops.

`Oplogjam::ApplyOps#id`

apply_ops.id
#=> -2135725856567446411

Return the internal, unique identifier for the apply ops.

`Oplogjam::ApplyOps#namespace`

apply_ops.namespace
#=> "foo.bar"

Return the namespace the apply ops affects. This will be a String of the form database.collection, e.g. foo.bar.

`Oplogjam::ApplyOps#operations`

apply_ops.operations
#=> [#<Oplogjam::Insert ...>]

Return the operations within the apply ops as Oplogjam operations of the appropriate type.

`Oplogjam::ApplyOps#timestamp`

apply_ops.timestamp
#=> 2017-09-09 16:11:18 +0100

Return the time of the apply ops as a Time.

`Oplogjam::ApplyOps#ts`

apply_ops.ts
#=> #<BSON::Timestamp:0x007fcadfa44500 @increment=1, @seconds=1479419535>

Return the raw, underlying BSON Timestamp of the apply ops.

`Oplogjam::ApplyOps#==(other)`

apply_ops == other_apply_ops
#=> false

Compares the identifiers of two apply ops and returns true if they are equal.

`Oplogjam::ApplyOps#apply(mapping)`

apply_ops.apply('foo.bar' => DB[:bar])

Apply all of the operations inside this apply ops to a mapping of MongoDB namespaces (e.g. foo.bar) to Sequel datasets representing PostgreSQL tables. If the namespace of the operations map to a dataset in the mapping, apply them as described in each operation type's apply method.

`Oplogjam::Command`

A class representing a MongoDB command.

`Oplogjam::Command.from(bson)`

Oplogjam::Command.from(document)

Convert a BSON document representing a MongoDB oplog command into an Oplogjam::Command instance.

Raises a Oplogjam::InvalidCommand error if the given document is not a valid command.

`Oplogjam::Command#id`

command.id
#=> -2135725856567446411

Return the internal, unique identifier for the command.

`Oplogjam::Command#namespace`

command.namespace
#=> "foo.bar"

Return the namespace the command affects. This will be a String of the form database.collection, e.g. foo.bar.

`Oplogjam::Command#command`

command.command
#=> {"create"=>"bar"}

Return the contents of the command as a BSON::Document.

`Oplogjam::Command#timestamp`

command.timestamp
#=> 2017-09-09 16:11:18 +0100

Return the time of the command as a Time.

`Oplogjam::Command#ts`

command.ts
#=> #<BSON::Timestamp:0x007fcadfa44500 @increment=1, @seconds=1479419535>

Return the raw, underlying BSON Timestamp of the command.

`Oplogjam::Command#==(other)`

command == other_command
#=> false

Compares the identifiers of two commands and returns true if they are equal.

`Oplogjam::Command#apply(mapping)`

command.apply('foo.bar' => DB[:bar])

Apply this command to a mapping of MongoDB namespaces (e.g. foo.bar) to Sequel datasets representing PostgreSQL tables. As commands have no equivalent in PostgreSQL, this performs no operation.

Acknowledgements

License

Distributed under the MIT License.

Oplogjam

Requirements

Why does apply take a mapping?

Why does this project exist?

Why doesn't this project come with some sort of executable?

API Documentation

Oplogjam::Oplog

Oplogjam::Oplog.new(client)

Oplogjam::Oplogjam#operations([query])

Oplogjam::Schema

Oplogjam::Schema.new(db)

Oplogjam::Schema#create_table(name)

Oplogjam::Schema#add_indexes(name)

Oplogjam::Schema#import(collection, name)

Oplogjam::Operation

Oplogjam::Operation.from(bson)

Oplogjam::Noop

Oplogjam::Noop.from(bson)

Oplogjam::Noop#message

Oplogjam::Noop#id

Oplogjam::Noop#timestamp

Oplogjam::Noop#ts

Oplogjam::Noop#==(other)

Oplogjam::Noop#apply(mapping)

Oplogjam::Insert

Oplogjam::Insert.from(bson)

Oplogjam::Insert#id

Oplogjam::Insert#namespace

Oplogjam::Insert#document

Oplogjam::Insert#timestamp

Oplogjam::Insert#ts

Oplogjam::Insert#==(other)

Oplogjam::Insert#apply(mapping)

Oplogjam::Update

Oplogjam::Update.from(bson)

Oplogjam::Update#id

Oplogjam::Update#namespace

Oplogjam::Update#update

Oplogjam::Update#query

Oplogjam::Update#timestamp

Oplogjam::Update#ts

Oplogjam::Update#==(other)

Oplogjam::Update#apply(mapping)

Oplogjam::Delete

Oplogjam::Delete.from(bson)

Oplogjam::Delete#id

Oplogjam::Delete#namespace

Oplogjam::Delete#query

Oplogjam::Delete#timestamp

Oplogjam::Delete#ts

Oplogjam::Delete#==(other)

Oplogjam::Delete#apply(mapping)

Oplogjam::ApplyOps

Oplogjam::ApplyOps.from(bson)

Oplogjam::ApplyOps#id

Oplogjam::ApplyOps#namespace

Oplogjam::ApplyOps#operations

Oplogjam::ApplyOps#timestamp

Oplogjam::ApplyOps#ts

Oplogjam::ApplyOps#==(other)

Oplogjam::ApplyOps#apply(mapping)

Oplogjam::Command

Oplogjam::Command.from(bson)

Oplogjam::Command#id

Oplogjam::Command#namespace

Oplogjam::Command#command

Oplogjam::Command#timestamp

Oplogjam::Command#ts

Oplogjam::Command#==(other)

Oplogjam::Command#apply(mapping)