ar-ondemand

The ar-ondemand gem adds functionality to ActiveRecord to help deal with AR's bloat.

Gem Version Build Status

Getting Started

require 'ar-ondemand'

Please note that this library has been written for our needs, and even though it has gotten significant usage in production environments, it hasn't seen the myriad ways others use and abuse ActiveRecord, so please experiment locally first.

It has been used with ActiveRecord 3.2, MRI 1.9.3, MRI 2.3.3, JRuby 1.7, JRuby 9, all with the MySQL adapter.

Functionality

on_demand

This was the original impetus for the gem. The issue was that we had to compare ~500k records between the source dataset and our database, and we had no idea which records were new, changed or deleted. We'd preload everything from the database, but due to ActiveRecord's massive bloat, we'd constantly run into OOM exceptions. To get around that, the concept of a lightweight ActiveRecord object was introduced that had the absolute bare minimum needed to handle comparing the source data with the database.

With this new type of object, we could easily interate over the 500k records extremely quickly and determine what has changed. The on-demand aspect comes when .save is called. If changes were noticed, it secretly instantiates an actual ActiveRecord model so that all the real functionality you'd expect, such as validation and callbacks, occurs.

Usage

assets = Widget.on_demand :identifier, {customer_id: 1, account_id: 42}
source.each do |dso|
  w = assets[dso[:identifier]]
  w.name = dso[:name]
  w.foo = dso[:foo]
  w.bar = dso[:bar]
  ar_obj = w.save
  # If new or changed, ar_obj will be an ActiveRecord instance, and ar_obj.id will now be set
end

for_reading

How often do you just need to load a bunch of objects and read some properties? Pretty often, right? Now, have you ever looked at how much memory ActiveRecord itself is consuming, as well as how much extra time it takes to create all the instances of the objects compared to how long it took to extract from the database? It's bonkers! The for_reading method makes it easy to get access to ActiveRecord-like functionality at 100th the cost.

Usage

Let's say you have some Widgets. Instead of:

Widget.where(customer_id: 1).each { |r| ... }

use for_reading and get a significant speed boost, and use far less memory:

Widget.where(customer_id: 1).for_reading.each { |r| ... }

The big limitation is that you can't use .includes so you have to be writing a query for just that class.

Batch Results

Widget.where(customer_id: 1).for_reading(batch_size: 50000).each { |b| b.each { |r| } }

for_enumeration_reading

This version of for_reading allows even faster access to the record when you just need to pull out some properties while looping over the dataset. Access to the data is only available in the block passed to for_enumeration_reading.

res = Widget.where(customer_id: 1).for_enumeration_reading.inject([]) do |i, r|
    i << [r.id, r.name]
end

for_streaming

Another use case was needing to iterate through 30,000,000 records. The code that needed this data was not setup to work with batching, so the concept of streaming results was introduced. It simply uses an Enumerator to hide the fact that batching is actually happening behind the scenes. The downside is that Enumerator made it ~10% slower. If speed matters, change your code to use batching.

Usage

def run
  get_objects.each do |r|
    do_something r
  end
end

def get_objects
  Widget.where(customer_id: 1).for_streaming
end

Additional usage:

Widget.where(customer_id: 1).for_streaming(batch_size: 100_000).each { |r| }
Widget.where(customer_id: 1).for_streaming(for_reading: true).each { |r| }
Widget.where(customer_id: 1).for_streaming(for_reading: true, batch_size: 1_000_000).each { |r| }

for_enumeration_streaming

Just as for_reading has an enumeration version, for_streaming does as well. This helper function is aimed at queries of millions of records that need to stream over the results, and only need to pull out values witin the supplied block.

Usage

res = []
Widget.where(customer_id: 1).for_enumeration_streaming(batch_size: 200_000).each do |r|
  res.add [r.id, r.name]
end

raw_results

This is just a nice little helper to get the raw database results, which you'd get by calling ActiveRecord::Base.connection.select_all but for some reason you can't call that on a model.

ActiveRecord::Base.connection.select_all "select * from widgets"
ActiveRecord::Base.connection.select_all "select * from widgets where customer_id = 1"
ActiveRecord::Base.connection.select_all "select * from widgets where customer_id = 1 limit 100000"

Usage

Widget.raw_results
Widget.where(customer_id: 1).raw_results
Widget.where(customer_id: 1).limit(100_000).raw_results

delete_all_by_pk

Deleting many records or even a few records in a massive table can be an expensive operation, and can even lock up your table during the duration of the delete, as well as perform a complete table scan. A common pattern to deal with this is querying the table first to find the primary keys that meet the criteria and then doing a delete specifying the primary keys as the where condition. This function does that all for you.

Usage

Widget.delete_all_by_pk
Widget.where(Widget[:customer_id].eq(1).and(Widget[:usage].gt(42))).delete_all_by_pk

If you know you could be deleting millions, then we recommend batching the deletes:

Widget.where(Widget[:customer_id].eq(1).and(Widget[:usage].gt(42))).delete_all_by_pk(batch_size: 250_000)