Faster, Smaller, Gzip-able

This adds marshal_serialize and json_serialize to ActiveRecord::Base (class methods), which act similar to rails' own serialize except they don't use yaml, which is slooooow and biiiiig (byte-wise). Also supports adding gzip to the serialization process for really large stuff (can save a ton of database memory with minimal added insert time). Note that gzip columns must be of type “binary”, “blob”, etc (not text).

Note that there's a limitation with the plugin: you can't call destructive actions on attributes using better_serialization (ex. << for an array), and update_attribute etc won't work. See TODO for an explanation.

Example

see tests, but here's the gist:

class OrderLog < ActiveRecord::Base
  marshal_serialize :line_items_cache, :gzip => true
  marshal_serialize :product_cache
  json_serialize :customer_cache
end

order_log = OrderLog.new
product = Product.new(:name => "Woot")
order_log.product_cache = product

order_log.product_cache # => normal
order_log[:product_cache] # => raw Marshal.dump of product

Notes for json_serialize

JSON serialization is faster and smaller, but needs more configuration. This is because Marshal and the default YAML serialization dump the whole object, instance variables and all. JSON just stores an instances' attribute values. Thus, you need to tell it what class to instantiate, or tell it not to instantiate anything at all for storing a hash or array of data.

Also, when serializing ActiveRecord objects, if ActiveRecord::Base.include_root_in_json == true (the new rails defalut), it'll all 'just work'. Otherwise, you will need to use the :class_name option if it's not inferable by the association name, and will probably want modify that class's to_json to include the id.

The Problem With YAML Serialization: speed and size

NOTE: this code does not represent the BetterSerialization api, but rather the underlying ideas

The following examples came from the real-world objects that inspired the plugin.

lis is an 18-member array of line items that have many associations loaded into memory, the point being that they're relatively big objects.

marshal_data, yaml_data, json_data = nil

Benchmark.bm(7) do |x|
  x.report("Marshal") { marshal_data = Marshal.dump(lis) }
  x.report("to_yaml") { yaml_data    = lis.to_yaml       }
  x.report('to_json') { json_data    = lis.to_json       }
end

             user     system      total        real
Marshal  0.590000   0.000000   0.590000 (  0.605567)
to_yaml 11.730000   0.010000  11.740000 ( 12.179310)
to_json  0.010000   0.000000   0.010000 (  0.006129)

The reason JSON is so fast is that rails' ActiveRecord::Base#to_json only jsonifies the attributes hash, not the actual ruby object. If you want the full object though, Marshal is waaaaay faster than to_yaml, and it is representing the full ruby object.

Marshal is also a lot smaller than yaml:

>> marshal_data.size
=> 1119279
>> yaml_data.size
=> 1958610
>> json_data.size
=> 7886

Whammy. What about deserialization?

Benchmark.bm(7) do |x|
  x.report("Marshal") { Marshal.load(marshal_data) }
  x.report("to_yaml") { YAML::load(yaml_data) }
  x.report('to_json') { ActiveSupport::JSON.decode(json_data).collect {|hash| BuylistLineItem.new(hash)} }
end

             user     system      total        real
Marshal  0.040000   0.000000   0.040000 (  0.046616)
to_yaml  0.380000   0.000000   0.380000 (  0.380686)
to_json  0.020000   0.000000   0.020000 (  0.019096)

With json you can't just unmarshal it 'cuz it's just representing an attribute array, but even compensating with object creation it's still the fastest.

Finally, if you need to save some space, you can use gzip with the serialization:

>> Benchmark.measure {gz_marshal_data = Zlib::Deflate.deflate(marshal_data)}
=> #<Benchmark::Tms:0x10a7ee54 @utime=0.0300000000000011, @cstime=0.0, @cutime=0.0, @total=0.0300000000000011, @label="", @stime=0.0, @real=0.0292189121246338>
>> marshal_data.size - gz_marshal_data.size
=> 1036979
>> Benchmark.measure {Zlib::Inflate.inflate(marshal_data)}
=> #<Benchmark::Tms:0x10a8db98 @utime=0.0, @cstime=0.0, @cutime=0.0, @total=0.0, @label="", @stime=0.0, @real=0.00509905815124512>

So on this machine it would add 30ms on insert and 5ms on load, but save ~1mb of database space. Seems worth it if you're saving this much stuff.

EDIT: Also note that from personal experience, gzipping json will still use 1/10th the overall space with ~.5ms performance hit - WIN

Summary

If you want speed/small data and don't need to flash-freeze the whole object, use json. Otherwise, use Marshal. If you want to use minimal database space, add gzip to those (adds a relatively small performance hit). Use the default yaml if you really, really need to save the whole object and also have a more human-readable datastore… eh, actually, don't use yaml.

TODO

  • deserialization should be cached, and it would be nice if the deserialized value was stored in the attributes hash.

  • Allow destructive actions

    • The problem is that serializing and de-serializing creates new objects, so if you do object.array_attribute << “blah”, you will have deserialized a new object, destructively added a value to it, and thrown the new object away. ActiveRecord avoids this issue by calling to_yaml as a catch-all on the result of read_attribute in activerecord/lib/active_record/connection_adapters/abstract/quoting.rb#quote (as of 2.3.4). What we need is something like being able to specify what AR's catch-all case is (although that wouldn't work for some option/value combos, ex. gzip option on a string or date). Gratefully accepting ideas :)

    • Also, just thought of another way. have the raw data be in the raw_#attribute column, and object.attribute be a method that loads raw_#attribute. That + save hooks = win.

Copyright © 2009 Woody Peterson, released under the MIT license