Class: Ngram::Dictionary
- Inherits:
-
Object
- Object
- Ngram::Dictionary
- Defined in:
- lib/ngrams/ngrams.rb
Overview
The Dictionary holds an indexed collection of bigrams (2-letter combinations) and trigrams (3-letter combinations) extracted from a dictionary of words.
Example usage:
dict = Dictionary.load
word = dict.ngram( :first, 3 )
5.times { word << dict.next_char( word ) }
puts word
of course a simpler way to achieve the same would be to use dict.word(8)
Constant Summary collapse
- DEFAULT_STORE =
File.join( File.dirname( __FILE__ ), '..', '..', 'data', 'ngrams.yml' )
Instance Attribute Summary collapse
-
#ngrams ⇒ Object
Returns the value of attribute ngrams.
-
#ridx ⇒ Object
Returns the value of attribute ridx.
-
#walk ⇒ Object
Returns the value of attribute walk.
Class Method Summary collapse
-
.load(file = DEFAULT_STORE) ⇒ Object
Return an Dictionary instance initialized using the YAML data in the specified file.
Instance Method Summary collapse
-
#add_from_file(file) ⇒ Object
Add ngrams to the current dictionary corresponding to the words found in the specified file.
-
#add_from_word(word) ⇒ Object
Add ngrams to the current dictionary using the given word as a source.
-
#build_indices ⇒ Object
Used to build the reverse index and trees that are used to by the random selection and walk code.
-
#initialize ⇒ Dictionary
constructor
Initialize a new, empty, Dictionary.
-
#next_char(a, b = nil) ⇒ Object
Returns a randomly selected character to follow the input.
-
#ngram(type, length) ⇒ Object
Returns a randomly selected 2 or 3 character ngram string.
-
#save(file) ⇒ Object
Store the Ngram dictionary and indices to a file using YAML.
-
#word(length) ⇒ Object
Returns a word created by selecting a starting ngram and then doing a random walk to add the remaining characters to the specified length.
Constructor Details
#initialize ⇒ Dictionary
Initialize a new, empty, Dictionary.
Use #add_from_file or #add_from_word to load new ngrams into the dictionary. Once all words have been loaded call #build_indices to ready the dictionary for use and #store to save it to disk.
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/ngrams/ngrams.rb', line 56 def initialize @ngrams = { :first => { 2 => Hash.new( 0 ), 3 => Hash.new( 0 ) }, :any => { 2 => Hash.new( 0 ), 3 => Hash.new( 0 ) } } init_reverse_index init_walk_tree end |
Instance Attribute Details
#ngrams ⇒ Object
Returns the value of attribute ngrams.
42 43 44 |
# File 'lib/ngrams/ngrams.rb', line 42 def ngrams @ngrams end |
#ridx ⇒ Object
Returns the value of attribute ridx.
42 43 44 |
# File 'lib/ngrams/ngrams.rb', line 42 def ridx @ridx end |
#walk ⇒ Object
Returns the value of attribute walk.
42 43 44 |
# File 'lib/ngrams/ngrams.rb', line 42 def walk @walk end |
Class Method Details
.load(file = DEFAULT_STORE) ⇒ Object
Return an Dictionary instance initialized using the YAML data in the specified file.
47 48 49 |
# File 'lib/ngrams/ngrams.rb', line 47 def self.load( file = DEFAULT_STORE ) File.open( file ) { |file| YAML::load( file ) } end |
Instance Method Details
#add_from_file(file) ⇒ Object
Add ngrams to the current dictionary corresponding to the words found in the specified file. The file should contain one word per line and (ideally) only use alpha characters.
126 127 128 129 130 |
# File 'lib/ngrams/ngrams.rb', line 126 def add_from_file( file ) File.open( file, "r" ) do |file| file.each { |line| add_from_word( line.chomp.downcase ) } end end |
#add_from_word(word) ⇒ Object
Add ngrams to the current dictionary using the given word as a source.
133 134 135 136 137 138 139 140 141 142 |
# File 'lib/ngrams/ngrams.rb', line 133 def add_from_word( word ) 2.upto( 3 ) do |n| ngrams = word.ngrams( n ) unless ngrams.size == 0 inc( :first, n, ngrams.first ) ngrams.each { |ngram| inc( :any, n, ngram ) } end end end |
#build_indices ⇒ Object
Used to build the reverse index and trees that are used to by the random selection and walk code. If using a new dictionary (rather than a dictionary obtained via #load) call this before using #word, #ngram, or #next_char
148 149 150 151 |
# File 'lib/ngrams/ngrams.rb', line 148 def build_indices build_reverse_index build_walk_tree end |
#next_char(a, b = nil) ⇒ Object
Returns a randomly selected character to follow the input. Repeated calls to this method implement a random-walk through the ngrams in the dictionary given a specified starting point.
Either supply a string parameter containing a word for completion or two single characters. The following calls are equivalent:
next_char( 'a', 'b' )
next_char( 'ab' )
In both cases the call will return a randomly selected character to follow the specified characters. The Dictionary tracks the frequency of each ngram and the random selection is weighted such that the probability of any following character being selected is proportional to the frequency with which it follows the specified characters in the source dictionary.
100 101 102 103 104 105 106 |
# File 'lib/ngrams/ngrams.rb', line 100 def next_char( a, b = nil ) if b.nil? a, b = a[-2,1], a[-1,1] end r = Integer( @walk[a][b].first * rand ) @walk[a][b].last.detect { |sum,c| sum >= r }.last.dup end |
#ngram(type, length) ⇒ Object
Returns a randomly selected 2 or 3 character ngram string
Specifying type :first will select only ngrams that appear at the beginning of words from the source dictonary. Type :any will select ngrams that appear anywhere in a word.
length can be either 2 (bigram) or 3 (trigram)
The Dictionary tracks the frequency of each ngram and the random selection is weighted such that the probability of any ngram being selected is proportional to its frequency in the source dictionary.
82 83 84 85 |
# File 'lib/ngrams/ngrams.rb', line 82 def ngram( type, length ) r = Integer( @sigma[type][length] * rand ) @ridx[type][length].detect { |sum,_| sum >= r }.last.dup end |
#save(file) ⇒ Object
Store the Ngram dictionary and indices to a file using YAML
117 118 119 120 121 |
# File 'lib/ngrams/ngrams.rb', line 117 def save( file ) File.open( file, "w" ) do |file| YAML::dump( self, file ) end end |
#word(length) ⇒ Object
Returns a word created by selecting a starting ngram and then doing a random walk to add the remaining characters to the specified length.
110 111 112 113 114 |
# File 'lib/ngrams/ngrams.rb', line 110 def word( length ) s = ngram( :first, 3 ) ( length - 3 ).times { s << next_char( s ) } s end |