VectorEmbed

Vector embedding of strings, booleans, numerics, and arrays into LIBSVM / LIBLINEAR format.

Inspired by Sally, except VectorEmbed is meant to handle categorical and continuous data at the same time.

csv2libsvm

$ cat houses.csv 
label,household_income,zip_code_id,year
+1,35893,53703,1904
-1,43708,53711,1977
+1,103214,53719,NULL
-1,49250,53704,1950

$ csv2libsvm houses.csv 
1 1243483:35893 6439848:53703 8227451:1904
-1 1243483:43708 6439848:53711 8227451:1977
1 1243483:103214 6439848:53719 8227451:0
-1 1243483:49250 6439848:53704 8227451:1950

Usage

Create a VectorEmbed instance, which auto-detects and then remembers what kind of data goes into each feature:

>> require 'vector_embed'
=> true
>> v = VectorEmbed.new
=> #<VectorEmbed:0x007fd605815208 [...]>

Output a line with a label and arbitrary features:

>> label = 1
=> 1
>> features = { color: 'red', year: 1995, weight: 5.4e9 }
=> {:color=>"red", :year=>1995, :weight=>5400000000.0}
>> v.line(label, features)
=> "1 1997960:1 5556418:5400000000.0 8227451:1995"

Output another line:

>> label = 0
=> 0
>> features = { color: 'blue', year: 1821, weight: 3.3 }
=> {:color=>"blue", :year=>1821, :weight=>3.3}
>> v.line(label, features)
=> "0 1089740:1 5556418:3.3 8227451:1821"

Note that color: 'red' and color: 'blue' are being translated into categories:

1997960:1 # murmur3("color\x00red"):1
1089740:1 # murmur3("color\x00blue"):1

A similar thing happens with true/false:

>> v.line(1, yes: true, no: false)
=> "1 1559987:1 3324244:1"

i.e.

1559987:1 # murmur3("yes\x00true"):1
3324244:1 # murmur3("no\x00false"):1

N-grams

Currently uses same parameter names as Sally.

Word ngrams

>> v = VectorEmbed.new ngram_len: 1, ngram_delim: /\s+/, stop_words: %w{the and or}
=> #<VectorEmbed:0x007fd6033b77f8 [...]>
>> v.line(1, notes: 'the quick brown fox')
=> "1 1512788:1 3426202:1 5079692:1"

You get the idea: ("the" has been filtered out by stop words)

1512788:1 # murmur3("notes\x00quick"):1
3426202:1 # murmur3("notes\x00brown"):1
5079692:1 # murmur3("notes\x00fox"):1

Byte n-grams

>> v = VectorEmbed.new ngram_len: 2, ngram_delim: ''
=> #<VectorEmbed:0x007fd60337ea20 [...]>
>> v.line(1, notes: 'foobar')
=> "1 2148745:1 2878919:1 3600333:1 3621715:1 5885921:1"

So therefore:

2148745:1 # murmur3("notes\x00fo"):1
2878919:1 # murmur3("notes\x00oo"):1
3600333:1 # murmur3("notes\x00ob"):1
3621715:1 # murmur3("notes\x00ba"):1
5885921:1 # murmur3("notes\x00ar"):1

Debugging

VectorEmbed tries to do the right thing, but if it's not, try turning on debugging:

>> v = VectorEmbed.new
=> #<VectorEmbed:0x007fd6034020a0 [...]>
>> v.logger.level = Logger::DEBUG
=> 0
>> v.line(1, '3' => 7, foo: 'bar', truthy: false, nullity: nil)
D, [2013-02-20T16:55:00.139299 #21595] DEBUG -- : Interpreting "3" as Number given first value 7
D, [2013-02-20T16:55:00.139561 #21595] DEBUG -- : Interpreting :foo as Phrase given first value "bar"
D, [2013-02-20T16:55:00.139671 #21595] DEBUG -- : Interpreting :truthy as Boolean given first value false
D, [2013-02-20T16:55:00.139755 #21595] DEBUG -- : Interpreting :nullity as Boolean given first value nil
D, [2013-02-20T16:55:00.139872 #21595] DEBUG -- : Interpreting "label" as Number given first value 1
=> "1 2647413:7 4091306:1 7123386:1 9259635:1"

One thing it doesn't like: (assuming you have already performed the lines above)

>> v.line(1, '3' => 'bar')
ArgumentError: Can't embed "bar" in number feature "3".

It's saying that, given you first passed it 7, it thought "3" was a feature that held numbers.

Gotchas

  • Following Sally, it only uses the first 22 bits of the murmur hash for feature indices... more and LIBSVM seems to choke.
  • Stop words are currently filtered out of feature indices... probably shouldn't be.

Copyright 2013 Seamus Abshere