ScrubRb
Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
Installation
Add this line to your application's Gemfile:
gem 'scrub_rb'
And then execute:
$ bundle
Or install it yourself as:
$ gem install scrub_rb
What it is
Ruby 2.1 introduces String#scrub, a method to replace bytes in a string that are invalid for it's specified encoding. See docs in MRI ruby source
If you need String#scrub in MRI ruby 2.0, you can use the string-scrub gem, which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.
What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in any other ruby platform that does not (or does not yet) support String#scrub? What if you need to write code that will work on any of these platforms?
This gem provides a pure-ruby implementation of String#scrub
and #scrub!
, monkey-patched into
String, that should work on any ruby platform. It will only monkey-patch String
if String does not already have a #scrub method -- so it's safe to include
this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will
still be the original stdlib implementation.
# Encoding: utf-8
"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
Performance
This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than string-scrub
C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.
Discrepency with MRI 2.1 String#scrub
If there is a sequence of multiple contiguous invalid bytes in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?
I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into certain sub-sequences for replacement, as represented in the test suite. The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.
scrub_rb
always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.
Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with scrub_rb
as compared to official MRI 2.1 String#scrub or string-scrub
.
For most uses, this discrepency is probably not of consequence.
If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.
JRuby may raise
Due to an apparent JRuby bug, some invalid strings cause an internal exception from JRuby when trying to scrub_rb. This bug should be fixed in jruby 1.7.11
In Jruby versions prior to that, The entire original MRI test suite does passes against scrub_rb in JRuby -- but one test original to us, involving input tagged 'ascii' encoding, fails raising an ArrayIndexOutOfBoundsException from inside of JRuby. I have filed an issue with JRuby.
I believe this problem is likely to be rare -- so far, the only reproduction case involves an input string tagged 'ascii' encoding, which probably isn't a common use case. But it's unfortunate
that scrub_rb
isn't reliable on jruby. I haven't been able to figure out any workaround in ruby to the jruby bug -- you could theoretically provide a Java alternate implementation usable in jruby, but I'm not sure what Java tools are available and how hard it would be to match the scrub api.
Contributions
Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.