hbase-jruby

hbase-jruby is a simple JRuby binding for HBase.

hbase-jruby provides the followings:

  • Easy, Ruby-esque interface for the fundamental HBase operations
  • ActiveRecord-like method chaining for data retrieval
  • Automatic Hadoop/HBase dependency resolution

Installation

gem install hbase-jruby

Using hbase-jruby in HBase shell

You can use this gem in HBase shell without external JRuby installation.

First, clone this repository,

git clone https://github.com/junegunn/hbase-jruby.git

then start up the shell (hbase shell) and type in the following lines:

$LOAD_PATH << 'hbase-jruby/lib'
require 'hbase-jruby'

Now, you're all set.

# Start using it!
hbase = HBase.new

hbase.list
hbase[:my_table].create! :f

A quick example

require 'hbase-jruby'

# Load required JAR files from CDH distribution using Maven
HBase.resolve_dependency! 'cdh4.3.0'

# Connect to HBase on localhost
hbase = HBase.new

# Define table schema for easier data access
hbase.schema = {
  # Schema for `book` table
  book: {
    # Columns in cf1 family
    cf1: {
      title:    :string,     # String (UTF-8)
      author:   :string,
      category: :string,
      year:     :short,      # Short integer (2-byte)
      pages:    :int,        # Integer (4-byte)
      price:    :bigdecimal, # BigDecimal
      weight:   :float,      # Double-precision floating-point number
      in_print: :boolean,    # Boolean (true | false)
      image:    :raw         # Java byte array; no automatic type conversion
    },
    # Columns in cf2 family
    cf2: {
      summary:  :string,
      reviews:  :fixnum,     # Long integer (8-byte)
      stars:    :fixnum,
      /^comment\d+/ => :string
    }
  }
}

# Create book table with two column families
table = hbase[:book]
unless table.exists?
  table.create! cf1: { min_versions: 2 },
                cf2: { bloomfilter: :rowcol, versions: 5 }
end

# PUT
table.put 1,
  title:     'The Golden Bough: A Study of Magic and Religion',
  author:    'Sir James G. Frazer',
  category:  'Occult',
  year:      1890,
  pages:     1006,
  price:     BigDecimal('21.50'),
  weight:    3.0,
  in_print:  true,
  image:     File.open('thumbnail.png', 'rb') { |f| f.read }.to_java_bytes,
  summary:   'A wide-ranging, comparative study of mythology and religion',
  reviews:   52,
  stars:     226,
  comment1:  'A must-have',
  comment2:  'Rewarding purchase'

# GET (using schema)
book     = table.get(1)
title    = book[:title]
comment2 = book[:comment2]
as_hash  = book.to_h

# GET (not using schema)
title    = book.string('cf1:title')       # cf:cq notation
year     = book.short('cf1:year')
reviews  = book.fixnum('cf2:reviews')
stars    = book.fixnum(['cf2', 'stars'])  # Array notation of [cf, cq]

# SCAN
table.range(0..100)
     .project(:cf1, :reviews, :summary)
     .filter(year:     1880...1900,
             in_print: true,
             category: ['Comics', 'Fiction', /cult/i],
             price:    { lt: BigDecimal('30.00') },
             summary:  /myth/i)
     .each do |book|

  # Update columns
  table.put book.rowkey, price: book[:price] + BigDecimal('1')

  # Atomic increment
  table.increment book.rowkey, reviews: 1, stars: 5

  # Delete two columns
  table.delete book.rowkey, :comment1, :comment2
end

# Delete row
table.delete 1

Setting up

Resolving Hadoop/HBase dependency

To be able to access HBase from JRuby, Hadoop/HBase dependency must be satisfied. This can be done by either setting up CLASSPATH variable beforehand or by requireing relevant JAR files after launching JRuby.

HBase.resolve_dependency!

Well, there's an easier way. Call HBase.resolve_dependency! helper method passing one of the arguments listed below.

Argument Dependency Default version Required executable
cdh4.3[.*] Cloudera CDH4.3 cdh4.3.0 mvn
cdh4.2[.*] Cloudera CDH4.2 cdh4.2.1 mvn
cdh4.1[.*] Cloudera CDH4.1 cdh4.1.4 mvn
cdh3[u*] Cloudera CDH3 cdh3u6 mvn
0.95[.*] Apache HBase 0.95 0.95.0 mvn
0.94[.*] Apache HBase 0.94 0.94.9 mvn
0.92[.*] Apache HBase 0.92 0.92.2 mvn
POM PATH Custom Maven POM file - mvn
:local Local HBase installation - hbase

(Default version is used when an argument prefix is given without specific patch version. e.g. cdh4.2 defaults to cdh4.2.0)

Examples

# Load JAR files from CDH4 using Maven
HBase.resolve_dependency! 'cdh4.3.0'
HBase.resolve_dependency! 'cdh4.2'

# Load JAR files of HBase 0.94.x using Maven
HBase.resolve_dependency! '0.94.7'
HBase.resolve_dependency! '0.94.2', verbose: true

# Dependency resolution with custom POM file
HBase.resolve_dependency! '/path/to/my/pom.xml'
HBase.resolve_dependency! '/path/to/my/pom.xml', profile: 'trunk'

# Load JAR files from local HBase installation
# (equivalent to: export CLASSPATH=$CLASSPATH:`hbase classpath`)
HBase.resolve_dependency! :local

(If you're behind an http proxy, set up your ~/.m2/settings.xml file as described in this page)

Log4j logs from HBase

You may want to suppress (or customize) log messages from HBase.

# With an external log4j.properties or log4j.xml file
HBase.log4j = '/your/log4j.properties'
HBase.log4j = '/your/log4j.xml'

# With a Hash
HBase.log4j = { 'log4j.threshold' => 'ERROR' }

Connecting to HBase

# HBase on localhost
hbase = HBase.new

# HBase on remote host
hbase = HBase.new 'hbase.zookeeper.quorum' => 'remote-server.mydomain.net'

# Extra configuration
hbase = HBase.new 'hbase.zookeeper.quorum'       => 'remote-server.mydomain.net',
                  'hbase.client.retries.number'  => 3,
                  'hbase.client.scanner.caching' => 1000,
                  'hbase.rpc.timeout'            => 120000

# Close HBase connection
hbase.close

Accessing data with HBase::Table instance

HBase#[] method (or HBase#table) returns an HBase::Table instance which represents the table of the given name.

table = hbase.table(:test_table)

# Or simply,
table = hbase[:test_table]

Creating a table

# Drop table if exists
table.drop! if table.exists?

# Create table with two column families
table.create! cf1: {},
              cf2: { compression: :snappy, bloomfilter: :row }

List of operations

Operation Description
PUT Puts data into the table
GET Retrieves data from the table by one or more rowkeys
SCAN Scans the table for a given range of rowkeys
DELETE Deletes data in the table
INCREMENT Atomically increments one or more columns
APPEND Appends values to one or more columns within a single row
Checked PUT/DELETE Atomically checks if the pre-exising data matches the expected value and puts or deletes data
MUTATE Performs multiple mutations (PUTS and DELETES) atomically on a single row
Batch execution Performs multiple actions (PUT, GET, DELETE, INCREMENT, APPEND, and MUTATE) at once

Defining table schema for easier data access

HBase stores everything as plain Java byte arrays. So it's completely up to users to encode and decode column values of various types into and from byte arrays, and that is a quite tedious and error-prone task.

To remedy this situation, hbase-jruby implements the concept of table schema.

Using table schema greatly simplifies the way you access data:

  • With schema, byte array conversion becomes automatic
  • It allows you to omit column family names (e.g. :title instead of "cf1:title")

We'll use the following schema throughout the examples.

hbase.schema = {
  # Schema for `book` table
  book: {
    # Columns in cf1 family
    cf1: {
      title:    :string,     # String (UTF-8)
      author:   :string,
      category: :string,
      year:     :short,      # Short integer (2-byte)
      pages:    :int,        # Integer (4-byte)
      price:    :bigdecimal, # BigDecimal
      weight:   :float,      # Double-precision floating-point number
      in_print: :boolean,    # Boolean (true | false)
      image:    :raw         # Java byte array; no automatic type conversion
    },
    # Columns in cf2 family
    cf2: {
      summary:  :string,
      reviews:  :fixnum,     # Long integer (8-byte)
      stars:    :fixnum,
      /^comment\d+/ => :string
    }
  }
}

Columns that are not defined in the schema can be referenced using FAMILY:QUALIFIER notation or 2-element Array of column family name (as Symbol) and qualifier, however since there's no type information, they are returned as Java byte arrays, which have to be decoded manually.

PUT

# Putting a single row
# - Row keys can be of any type, in this case, we use String type
table.put 'rowkey1', title: "Hello World", year: 2013

# Putting multiple rows
table.put 'rowkey1' => { title: 'foo',    year: 2013 },
          'rowkey2' => { title: "bar",    year: 2014 },
          'rowkey3' => { title: 'foobar', year: 2015 }

# Putting values with timestamps
table.put 'rowkey1',
  title: {
    1353143856665 => "Hello world",
    1352978648642 => "Goodbye world"
  },
  year: 2013

GET

book = table.get('rowkey1')

# Rowkey
rowkey = row.rowkey         # Rowkey as raw Java byte array
rowkey = row.rowkey :string # Rowkey as String

# Access columns in schema
title  = book[:title]
author = book[:author]
year   = book[:year]

# Convert to simple Hash
hash = book.to_h

# Convert to Hash containing all versions of values indexed by their timestamps
all_hash = book.to_H

# Columns not defined in the schema are returned as Java byte arrays
# They need to be decoded manually
extra = HBase::Util.from_bytes(:bigdecimal, book['cf2:extra'])
# or, simply
extra = book.bigdecimal 'cf2:extra'

Batch-GET

# Pass an array of row keys as the parameter
books = table.get(['rowkey1', 'rowkey2', 'rowkey3'])

to_h

to_h and to_H return the Hash representation of the row. (The latter returns all values with their timestamp)

If a column is defined in the schema, it is referenced using its quailifier in Symbol type. If a column is not defined, it is represented as a 2-element Array of column family in Symbol and column qualifier as ByteArray. Even so, to make it easier to reference those columns, an extended version of Hash is returned with which you can also reference them with FAMILY:QUALIFIER notation or [cf, cq] array notation.

table.put 1000,
  title:      'Hello world', # Known column
  comment100: 'foo',         # Known column
  'cf2:extra' => 'bar',      # Unknown column
  [:cf2, 10]  => 'foobar'    # Unknown column, non-string qualifier

book = table.get 10000
hash = book.to_h
  # {
  #   :title => "Hello world",
  #   [:cf2, HBase::ByteArray<0, 0, 0, 0, 0, 0, 0, 10>] =>
  #       byte[102, 111, 111, 98, 97, 114]@6f28bb44,
  #   :comment100 => "foo",
  #   [:cf2, HBase::ByteArray<101, 120, 116, 114, 97>] =>
  #       byte[98, 97, 114]@77190cfc}
  # }

hash['cf2:extra']
  # byte[98, 97, 114]@77190cfc

hash[%w[cf2 extra]]
  # byte[98, 97, 114]@77190cfc

hash[[:cf2, HBase::ByteArray['extra']]]
  # byte[98, 97, 114]@77190cfc

hash['cf2:extra'].to_s
  # 'bar'

# Columns with non-string qualifiers must be referenced using 2-element Array notation
hash['cf2:10']
  # nil
hash[[:cf2, 10]]
  # byte[102, 111, 111, 98, 97, 114]@6f28bb44

hash_with_versions = book.to_H
  # {
  #   :title => {1369019227766 => "Hello world"},
  #   [:cf2, HBase::ByteArray<0, 0, 0, 0, 0, 0, 0, 10>] =>
  #       {1369019227766 => byte[102, 111, 111, 98, 97, 114]@6f28bb44},
  #   :comment100 => {1369019227766 => "foo"},
  #   [:cf2, HBase::ByteArray<101, 120, 116, 114, 97>]  =>
  #       {1369019227766 => byte[98, 97, 114]@77190cfc}}
  # }

Intra-row scan

Intra-row scan can be done using each method which yields HBase::Cell instances.

# Intra-row scan (all versions)
row.each do |cell|
  family    = cell.family
  qualifier = cell.qualifier :string  # Column qualifier as String
  timestamp = cell.timestamp
  value     = cell.value
end

# Array of HBase::Cells
cells = row.to_a

DELETE

# Delete a row
table.delete('rowkey1')

# Delete all columns in the specified column family
table.delete('rowkey1', :cf1)

# Delete a column
table.delete('rowkey1', :author)

# Delete multiple columns
table.delete('rowkey1', :author, :title, :image)

# Delete a column with empty qualifier.
# (!= deleing the entire columns in the family. See the trailing colon.)
table.delete('rowkey1', 'cf1:')

# Delete a version of a column
table.delete('rowkey1', :author, 1352978648642)

# Delete multiple versions of a column
table.delete('rowkey1', :author, 1352978648642, 1352978649642)

# Delete multiple versions of multiple columns
# - Two versions of :author
# - One version of :title
# - All versions of :image
table.delete('rowkey1', :author, 1352978648642, 1352978649642, :title, 1352978649642, :image)

# Batch delete; combination of aforementioned arguments each given as an Array
table.delete(['rowkey1'], ['rowkey2'], ['rowkey3', :author, 1352978648642, 135297864964])

However, the last syntax seems a bit unwieldy when you just wish to delete a few rows. In that case, use simpler delete_row method.

table.delete_row 'rowkey1'

table.delete_row 'rowkey1', 'rowkey2', 'rowkey3'

INCREMENT: Atomic increment of column values

# Atomically increase cf2:reviews by one
inc = table.increment('rowkey1', reviews: 1)
puts inc[:reviews]

# Atomically increase two columns by one and five respectively
inc = table.increment('rowkey1', reviews: 1, stars: 5)
puts inc[:stars]

APPEND

ret = table.append 'rowkey1', title: ' (limited edition)', summary: ' ...'
puts ret[:title]   # Updated title

Checked PUT and DELETE

table.check(:rowkey, in_print: false)
     .put(in_print: true, price: BigDecimal('10.0'))

table.check(:rowkey, in_print: false)
     .delete(:price, :image)
       # Takes the same parameters as those of HBase::Table#delete
       # except for the first rowkey
       #   https://github.com/junegunn/hbase-jruby#delete

MUTATE: Atomic mutations on a single row (PUTs and DELETEs)

# Currently Put and Delete are supported
# - Refer to mutateRow method of org.apache.hadoop.hbase.client.HTable
table.mutate(rowkey) do |m|
  m.put comment3: 'Nice', comment4: 'Great'
  m.delete :comment1, :comment2
end

Batch execution

Disclaimer: The ordering of execution of the actions is not defined. Refer to the documentation of batch method of HTable class.

ret = table.batch do |b|
  b.put rowkey1, 'cf1:a' => 100, 'cf1:b' => 'hello'
  b.get rowkey2
  b.append rowkey3, 'cf1:b' => 'world'
  b.delete rowkey3, 'cf2', 'cf3:z'
  b.increment rowkey3, 'cf1:a' => 200, 'cf1:c' => 300
  b.mutate(rowkey4) do |m|
    m.put 'cf3:z' => 3.14
    m.delete 'cf3:y', 'cf4'
  end
end

batch method returns an Array of Hashes which contains the results of the actions in the order they are specified in the block. Each Hash has :type entry (:get, :put, :append, etc.) and :result entry. If the type of an action is :put, :delete, or :mutate, the :result will be given as a boolean. If it's an :increment or :append, a plain Hash will be returned as the :result, just like in increment and append methods. For :get action, HBase::Row instance will be returned or nil if not found.

If one or more actions has failed, HBase::BatchException will be raised. Although you don't get to receive the return value from batch method, you can still access the partial results using results method of HBase::BatchException.

results =
  begin
    table.batch do |b|
      # ...
    end
  rescue HBase::BatchException => e
    e.results
  end

SCAN

HBase::Table itself is an enumerable object.

# Full scan
table.each do |row|
  p row.to_h
end

# Returns Enumerator when block is not given
table.each.with_index.each_slice(10).to_a

Scoped access

You can control how you retrieve data by chaining the following methods of HBase::Table (or HBase::Scoped).

Method Description
range Specifies the rowkey range of scan
project To retrieve only a subset of columns
filter Filtering conditions of scan
while Allows early termination of scan (server-side)
at Only retrieve data with the specified timestamp
time_range Only retrieve data within the specified time range
limit Limits the number of rows
versions Limits the number of versions of each column
caching Sets the number of rows for caching during scan
batch Limits the maximum number of values returned for each iteration
with_java_scan (ADVANCED) Access Java Scan object in the given block
with_java_get (ADVANCED) Access Java Get object in the given block

Each invocation to these methods returns an HBase::Scoped instance with which you can retrieve data with the following methods.

Method Description
get Fetches rows by the given rowkeys
each Scans the scope of the table (HBase::Scoped instance is Enumerable)
count Efficiently counts the number of rows in the scope
aggregate Performs aggregation using Coprocessor (To be described shortly)

Example of scoped access

import org.apache.hadoop.hbase.filter.RandomRowFilter

table.range('A'..'Z').                      # Row key range,
      project(:author).                     # Select cf1:author column
      project('cf2').                       # Select cf2 family as well
      filter(category: 'Comics').           # Filter by cf1:category value
      filter(year: [1990, 2000, 2010]).     # Set-inclusion condition on cf1:year
      filter(weight: 2.0..4.0).             # Range filter on cf1:weight
      filter(RandomRowFilter.new(0.5)).     # Any Java HBase filter
      while(reviews: { gt: 20 }).           # Early termination of scan
      time_range(Time.now - 600, Time.now). # Scan data of the last 10 minutes
      limit(10).                            # Limits the size of the result set
      versions(2).                          # Only fetches 2 versions for each value
      batch(100).                           # Batch size for scan set to 100
      caching(1000).                        # Caching 1000 rows
      with_java_scan { |scan|               # Directly access Java Scan object
        scan.setCacheBlocks false
      }.
      to_a                                  # To Array of HBase::Rows

range

HBase::Scoped#range method is used to filter rows based on their row keys.

# 100 ~ 900 (inclusive end)
table.range(100..900)

# 100 ~ 900 (exclusive end)
table.range(100...900)

# 100 ~ 900 (exclusive end)
table.range(100, 900)

# 100 ~
table.range(100)

#     ~ 900 (exclusive end)
table.range(nil, 900)

Optionally, prefix filter can be applied as follows.

# Prefix filter
# Row keys with "APPLE" prefix
#   Start key is automatically set to "APPLE",
#   stop key "APPLF" to avoid unnecessary disk access
table.range(prefix: 'APPLE')

# Row keys with "ACE", "BLUE" or "APPLE" prefix
#   Start key is automatically set to "ACE",
#   stop key "BLUF"
table.range(prefix: ['ACE', 'BLUE', 'APPLE'])

# Prefix filter with start key and stop key.
table.range('ACE', 'BLUEMARINE', prefix: ['ACE', 'BLUE', 'APPLE'])

Subsequent calls to #range override the range previously defined.

# Previous ranges are discarded
scope.range(1, 100).
      range(50..100).
      range(prefix: 'A').
      range(1, 1000)
  # Same as `scope.range(1, 1000)`

filter

You can configure server-side filtering of rows and columns with HBase::Scoped#filter calls. Multiple calls have conjunctive effects.

# Range scanning the table with filters
table.range(nil, 1000).
      filter(
        # Equality match
        year: 2013,

        # Range of numbers or characters: Checks if the value falls within the range
        weight: 2.0..4.0,
        author: 'A'..'C'

        # Will match rows *without* price column
        price: nil,

        # Regular expression: Checks if the value matches the regular expression
        summary: /classic$/i,

        # Hash: Tests the value with 6 types of operators (:gt, :lt, :gte, :lte, :eq, :ne)
        reviews: { gt: 100, lte: 200 },

        # Array of the aforementioned types: OR condition (disjunctive)
        category: ['Fiction', 'Comic', /science/i, { ne: 'Political Science' }]).

      # Multiple calls for conjunctive filtering
      filter(summary: /instant/i).

      # Any number of Java filters can be applied
      filter(org.apache.hadoop.hbase.filter.RandomRowFilter.new(0.5)).
  each do |record|
  # ...
end

while

HBase::Scoped#while method takes the same parameters as filter method, the difference is that each filtering condition passed to while method is wrapped by WhileMatchFilter, which aborts scan immediately when the condition is not met at a certain row. See the following example.

(0...30).each do |idx|
  table.put idx, year: 2000 + idx % 10
end

table.filter(year: { lte: 2001 }).map { |r| r.rowkey :fixnum }
  # [0, 1, 10, 11, 20, 21]
table.while(year: { lte: 2001 }).map { |r| r.rowkey :fixnum }
  # [0, 1]
  #   Scan terminates immediately when condition not met.

project

HBase::Scoped#project allows you to fetch only a subset of columns from each row. Multiple calls have additive effects.

# Fetches cf1:title, cf1:author, and all columns in column family cf2 and cf3
scoped.project(:title, :author, :cf2).
       project(:cf3)

HBase filters can not only filter rows but also columns. Since column filtering can be thought of as a kind of projection, it makes sense to internally apply column filters in HBase::Scoped#project, instead of in HBase::Scoped#filter, although it's still perfectly valid to pass column filter to filter method.

# Column prefix filter:
#   Fetch columns whose qualifiers start with the specified prefixes
scoped.project(prefix: 'alice').
       project(prefix: %w[alice bob])

# Column range filter:
#   Fetch columns whose qualifiers within the ranges
scoped.project(range: 'a'...'c').
       project(range: ['i'...'k', 'x'...'z'])

# Column pagination filter:
#   Fetch columns within the specified intra-scan offset and limit
scoped.project(offset: 1000, limit: 10)

When using column filters on fat rows with many columns, it's advised that you limit the batch size with HBase::Scoped#batch call to avoid fetching all columns at once. However setting batch size allows multiple rows with the same row key are returned during scan.

# Let's say that we have rows with more than 10 columns whose qualifiers start with `str`
puts scoped.range(1..100).
            project(prefix: 'str').
            batch(10).
            map { |row| [row.rowkey(:fixnum), row.count].map(&:to_s).join ': ' }

  # 1: 10
  # 1: 10
  # 1: 5
  # 2: 10
  # 2: 2
  # 3: 10
  # ...

Scoped SCAN / GET

scoped = table.versions(1)                 # Limits the number of versions
              .filter(year: 1990...2000)
              .range('rowkey0'..'rowkey2') # Range of rowkeys.
              .project('cf1', 'cf2:x')     # Projection

# Scoped GET
#   Nonexistent or filtered rows are returned as nils
scoped.get(['rowkey1', 'rowkey2', 'rowkey4'])

# Scoped SCAN
scoped.each do |row|
  row.each do |cell|
    # Intra-row scan
  end
end

# Scoped COUNT
#   When counting the number of rows, use `HTable::Scoped#count`
#   instead of just iterating through the scope, as it internally
#   minimizes the amount of data transfer using KeyOnlyFilter
#   (and FirstKeyOnlyFilter when no filter is set)
scoped.count

# This should be even faster as it dramatically reduces the number of RPC calls
scoped.caching(1000).count

# count method takes an options Hash:
# - :caching (default: nil)
# - :cache_blocks (default: true)
scoped.count(caching: 5000, cache_blocks: false)

Basic aggregation using coprocessor

You can perform some basic aggregation using the built-in coprocessor called org.apache.hadoop.hbase.coprocessor.AggregateImplementation.

To enable this feature, call enable_aggregation! method, which adds the coprocessor to the table.

table.enable_aggregation!
# Just a shorthand notation for
#   table.add_coprocessor! 'org.apache.hadoop.hbase.coprocessor.AggregateImplementation'

Then you can get the sum, average, minimum, maximum, row count, and standard deviation of the projected columns.

# cf1:a must hold 8-byte integer values
table.project(:reviews).aggregate(:sum)
table.project(:reviews).aggregate(:avg)
table.project(:reviews).aggregate(:min)
table.project(:reviews).aggregate(:max)
table.project(:reviews).aggregate(:std)
table.project(:reviews).aggregate(:row_count)

# Aggregation of multiple columns
table.project(:reviews, :stars).aggregate(:sum)

By default, aggregate method assumes that the projected values are 8-byte integers. For other data types, you can pass your own ColumnInterpreter.

table.project(:price).aggregate(:sum, MyColumnInterpreter.new)

Table inspection

# Table properties
table.properties
  # {:max_filesize       => 2147483648,
  #  :readonly           => false,
  #  :memstore_flushsize => 134217728,
  #  :deferred_log_flush => false}

# Properties of the column families
table.families
  # {"cf"=>
  #   {:blockcache            => true,
  #    :blocksize             => 65536,
  #    :bloomfilter           => "NONE",
  #    :cache_blooms_on_write => false,
  #    :cache_data_on_write   => false,
  #    :cache_index_on_write  => false,
  #    :compression           => "NONE",
  #    :compression_compact   => "NONE",
  #    :data_block_encoding   => "NONE",
  #    :evict_blocks_on_close => false,
  #    :in_memory             => false,
  #    :keep_deleted_cells    => false,
  #    :min_versions          => 0,
  #    :replication_scope     => 0,
  #    :ttl                   => 2147483647,
  #    :versions              => 3}}

There are also raw_ variants of properties and families. They return properties in their internal String format (mainly used in HBase shell). (See HTableDescriptor.values and HColumnDescriptor.values)

table.raw_properties
  # {"IS_ROOT"      => "false",
  #  "IS_META"      => "false",
  #  "MAX_FILESIZE" => "2147483648"}

table.raw_families
  # {"cf" =>
  #   {"DATA_BLOCK_ENCODING" => "NONE",
  #    "BLOOMFILTER"         => "NONE",
  #    "REPLICATION_SCOPE"   => "0",
  #    "VERSIONS"            => "3",
  #    "COMPRESSION"         => "NONE",
  #    "MIN_VERSIONS"        => "0",
  #    "TTL"                 => "2147483647",
  #    "KEEP_DELETED_CELLS"  => "false",
  #    "BLOCKSIZE"           => "65536",
  #    "IN_MEMORY"           => "false",
  #    "ENCODE_ON_DISK"      => "true",
  #    "BLOCKCACHE"          => "true"}}

These String key-value pairs are not really a part of the public API of HBase, and thus might change over time. However, they are most useful when you need to create a table with the same properties as the existing one.

hbase[:dupe_table].create!(table.raw_families, table.raw_properties)

With regions method, you can even presplit the new table just like the old one.

hbase[:dupe_table].create!(
  table.raw_families,
  table.raw_properties.merge(splits: table.regions.map { |r| r[:start_key] }.compact))

Table administration

HBase#Table provides a number of bang_methods! for table administration tasks. They run synchronously, except when mentioned otherwise (e.g. HTable#split!). Some of them take an optional block to allow progress monitoring and come with non-bang, asynchronous counterparts.

Creation and alteration

# Create a table with configurable table-level properties
table.create!(
    # 1st Hash: Column family specification
    {
      cf1: { compression: snappy },
      cf2: { bloomfilter: row }
    },

    # 2nd Hash: Table properties
    max_filesize:       256 * 1024 ** 2,
    deferred_log_flush: false,
    splits:             [1000, 2000, 3000]
)

# Alter table properties (synchronous with optional block)
table.alter!(
  max_filesize:       512 * 1024 ** 2,
  memstore_flushsize: 64 * 1024 ** 2,
  readonly:           false,
  deferred_log_flush: true
) { |progress, total|
  # Progress report with an optional block
  puts [progress, total].join('/')
}

# Alter table properties (asynchronous)
table.alter(
  max_filesize:       512 * 1024 ** 2,
  memstore_flushsize: 64 * 1024 ** 2,
  readonly:           false,
  deferred_log_flush: true
)

List of column family properties

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html

Some of the properties are only available on recent versions of HBase.

Property Type Description
:blockcache Boolean If MapFile blocks should be cached
:blocksize Fixnum Blocksize to use when writing out storefiles/hfiles on this column family
:bloomfilter Symbol/String Bloom filter type: :none, :row, :rowcol, or uppercase Strings
:cache_blooms_on_write Boolean If we should cache bloomfilter blocks on write
:cache_data_on_write Boolean If we should cache data blocks on write
:cache_index_on_write Boolean If we should cache index blocks on write
:compression Symbol/String Compression type: :none, :gz, :lzo, :lz4, :snappy, or uppercase Strings
:compression_compact Symbol/String Compression type: :none, :gz, :lzo, :lz4, :snappy, or uppercase Strings
:data_block_encoding Symbol/String Data block encoding algorithm used in block cache: :none, :diff, :fast_diff, :prefix, or uppercase Strings
:encode_on_disk Boolean If we want to encode data block in cache and on disk
:evict_blocks_on_close Boolean If we should evict cached blocks from the blockcache on close
:in_memory Boolean If we are to keep all values in the HRegionServer cache
:keep_deleted_cells Boolean If deleted rows should not be collected immediately
:min_versions Fixnum The minimum number of versions to keep (used when timeToLive is set)
:replication_scope Fixnum Replication scope
:ttl Fixnum Time-to-live of cell contents, in seconds
:versions Fixnum The maximum number of versions. (By default, all available versions are retrieved.)

List of table properties

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html

Property Type Description
:max_filesize Fixnum The maximum size upto which a region can grow to after which a region split is triggered
:readonly Boolean If the table is read-only
:memstore_flushsize Fixnum The maximum size of the memstore after which the contents of the memstore are flushed to the filesystem
:deferred_log_flush Boolean Defer the log edits syncing to the file system
:splits Array Region split points

Managing column families

# Add column family
table.add_family! :cf3, compression: :snappy, bloomfilter: :row

# Alter column family
table.alter_family! :cf2, bloomfilter: :rowcol

# Remove column family
table.delete_family! :cf1

Coprocessors

# Add Coprocessor
unless table.has_coprocessor?(cp_class_name1)
  table.add_coprocessor! cp_class_name1
end
table.add_coprocessor! cp_class_name2, path: path, priority: priority, params: params

# Remove coprocessor
table.remove_coprocessor! cp_class_name1

Region splits (asynchronous)

table.split!(1000)
table.split!(2000, 3000)

Snapshots

# Returns a list of all snapshot information
hbase.snapshots

# Table snapshots
table.snapshots
# Equivalent to
#   hbase.snapshots.select { |info| info[:table] == table.name }

# Creating a snapshot
table.snapshot! 'my_table_snapshot'

Advanced table administration

You can perform other types of administrative tasks with the native Java HBaseAdmin object, which can be obtained by HBase#admin method. Optionally, a block can be given so that the object is automatically closed at the end of the given block.

admin = hbase.admin
# ...
admin.close

# Access native HBaseAdmin object within the block
hbase.admin do |admin|
  admin.snapshot       'my_snapshot', 'my_table'
  admin.cloneSnapshot  'my_snapshot', 'my_clone_table'
  admin.deleteSnapshot 'my_snapshot'
  # ...
end

Advanced topics

Lexicographic scan order

HBase stores rows in the lexicographic order of the rowkeys in their byte array representations. Therefore, the type of the row key affects the scan order.

(1..15).times do |i|
  table.put i, data
  table.put i.to_s, data
end

table.range(1..3).map { |r| r.rowkey :fixnum }
  # [1, 2, 3]
table.range('1'..'3').map { |r| r.rowkey :string }
  # %w[1 10 11 12 13 14 15 2 3]

Non-string column qualifier

If a column qualifier is not a String, a 2-element Array should be used.

table.put 'rowkey',
  [:cf1, 100  ] => "Byte representation of an 8-byte integer",
  [:cf1, bytes] => "Qualifier is an arbitrary byte array"

table.get('rowkey')[:cf1, 100]
# ...

Shorter integers

A Ruby Fixnum is an 8-byte integer, which is equivalent long type in Java. When you want to use shorter integer types such as int, short, or byte, you can then use the special Hash representation of integers.

# 4-byte int value as the rowkey
table.put({ int: 12345 }, 'cf1:a' => { byte: 100 },   # 1-byte integer
                          'cf1:b' => { short: 200 },  # 2-byte integer
                          'cf1:c' => { int: 300 },    # 4-byte integer
                          'cf1:d' => 400)             # Ordinary 8-byte integer

row = table.get(int: 12345)

The use of these Hash-notations can be minimized if we define table schema as follows.

hbase.schema[table.name] = {
  cf1: {
    a: :byte,
    b: :short,
    c: :int,
    d: :fixnum
  }
}

table.put({ int: 12345 }, a: 100, b: 200, c: 300, d: 400)
row = table.get(int: 12345)

Working with byte arrays

In HBase, virtually everything is stored as a byte array. Although hbase-jruby tries hard to hide the fact, at some point you may need to get your hands dirty with native Java byte arrays. For example, it's a common practice to use a composite row key, which is a concatenation of several components of different types.

HBase::ByteArray is a boxed class for native Java byte arrays, which makes byte array manipulation much easier.

A ByteArray can be created as a concatenation of any number of objects.

ba = HBase::ByteArray[100, 3.14, {int: 300}, "Hello World"]

Then you can slice it and decode each part,

# Slicing
first  = ba[0, 8]
second = ba[8...16]

first.decode(:fixnum)  # 100
second.decode(:float)  # 3.14

append, prepend more elements to it,

ba.unshift 200, true
ba << { short: 300 }

concatenate another ByteArray,

ba += HBase::ByteArray[1024]

or shift decoded objects from it.

ba.shift(:fixnum)
ba.shift(:boolean)
ba.shift(:fixnum)
ba.shift(:float)
ba.shift(:int)
ba.shift(:string, 11)  # Byte length must be given as Strings are not fixed in size

ByteArray#java method returns the underlying native Java byte array.

ba.java  # Returns the native Java byte array (byte[])

Test

#!/bin/bash

# Test HBase 0.94 on localhost
export HBASE_JRUBY_TEST_ZK='127.0.0.1'
export HBASE_JRUBY_TEST_DIST='0.94'

# Test both for 1.8 and 1.9
for v in --1.8 --1.9; do
  export JRUBY_OPTS=$v
  rake test
done

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request