Class: Ngram::Dictionary

Inherits:
Object
  • Object
show all
Defined in:
lib/ngrams/ngrams.rb

Overview

The Dictionary holds an indexed collection of bigrams (2-letter combinations) and trigrams (3-letter combinations) extracted from a dictionary of words.

Example usage:

dict = Dictionary.load
word = dict.ngram( :first, 3 )
5.times { word << dict.next_char( word ) }
puts word

of course a simpler way to achieve the same would be to use dict.word(8)

Constant Summary collapse

DEFAULT_STORE =
File.join( File.dirname( __FILE__ ), '..', '..', 'data', 'ngrams.yml' )

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeDictionary

Initialize a new, empty, Dictionary.

Use #add_from_file or #add_from_word to load new ngrams into the dictionary. Once all words have been loaded call #build_indices to ready the dictionary for use and #store to save it to disk.



56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/ngrams/ngrams.rb', line 56

def initialize
  @ngrams = {
    :first => {
      2 => Hash.new( 0 ),
      3 => Hash.new( 0 )
    },
    :any => {
      2 => Hash.new( 0 ),
      3 => Hash.new( 0 )
    }
  }
  
  init_reverse_index
  init_walk_tree
end

Instance Attribute Details

#ngramsObject

Returns the value of attribute ngrams.



42
43
44
# File 'lib/ngrams/ngrams.rb', line 42

def ngrams
  @ngrams
end

#ridxObject

Returns the value of attribute ridx.



42
43
44
# File 'lib/ngrams/ngrams.rb', line 42

def ridx
  @ridx
end

#walkObject

Returns the value of attribute walk.



42
43
44
# File 'lib/ngrams/ngrams.rb', line 42

def walk
  @walk
end

Class Method Details

.load(file = DEFAULT_STORE) ⇒ Object

Return an Dictionary instance initialized using the YAML data in the specified file.



47
48
49
# File 'lib/ngrams/ngrams.rb', line 47

def self.load( file = DEFAULT_STORE )
  File.open( file ) { |file| YAML::load( file ) }
end

Instance Method Details

#add_from_file(file) ⇒ Object

Add ngrams to the current dictionary corresponding to the words found in the specified file. The file should contain one word per line and (ideally) only use alpha characters.



126
127
128
129
130
# File 'lib/ngrams/ngrams.rb', line 126

def add_from_file( file )
  File.open( file, "r" ) do |file|
    file.each { |line| add_from_word( line.chomp.downcase ) }
  end
end

#add_from_word(word) ⇒ Object

Add ngrams to the current dictionary using the given word as a source.



133
134
135
136
137
138
139
140
141
142
# File 'lib/ngrams/ngrams.rb', line 133

def add_from_word( word )
  2.upto( 3 ) do |n|
    ngrams = word.ngrams( n )
    
    unless ngrams.size == 0
      inc( :first, n, ngrams.first )
      ngrams.each { |ngram| inc( :any, n, ngram ) }
    end
  end
end

#build_indicesObject

Used to build the reverse index and trees that are used to by the random selection and walk code. If using a new dictionary (rather than a dictionary obtained via #load) call this before using #word, #ngram, or #next_char



148
149
150
151
# File 'lib/ngrams/ngrams.rb', line 148

def build_indices
  build_reverse_index
  build_walk_tree
end

#next_char(a, b = nil) ⇒ Object

Returns a randomly selected character to follow the input. Repeated calls to this method implement a random-walk through the ngrams in the dictionary given a specified starting point.

Either supply a string parameter containing a word for completion or two single characters. The following calls are equivalent:

next_char( 'a', 'b' )
next_char( 'ab' )

In both cases the call will return a randomly selected character to follow the specified characters. The Dictionary tracks the frequency of each ngram and the random selection is weighted such that the probability of any following character being selected is proportional to the frequency with which it follows the specified characters in the source dictionary.



100
101
102
103
104
105
106
# File 'lib/ngrams/ngrams.rb', line 100

def next_char( a, b = nil )
  if b.nil?
    a, b = a[-2,1], a[-1,1]
  end      
  r = Integer( @walk[a][b].first * rand )
  @walk[a][b].last.detect { |sum,c| sum >= r }.last.dup
end

#ngram(type, length) ⇒ Object

Returns a randomly selected 2 or 3 character ngram string

Specifying type :first will select only ngrams that appear at the beginning of words from the source dictonary. Type :any will select ngrams that appear anywhere in a word.

length can be either 2 (bigram) or 3 (trigram)

The Dictionary tracks the frequency of each ngram and the random selection is weighted such that the probability of any ngram being selected is proportional to its frequency in the source dictionary.



82
83
84
85
# File 'lib/ngrams/ngrams.rb', line 82

def ngram( type, length )
  r = Integer( @sigma[type][length] * rand )
  @ridx[type][length].detect { |sum,_| sum >= r }.last.dup
end

#save(file) ⇒ Object

Store the Ngram dictionary and indices to a file using YAML



117
118
119
120
121
# File 'lib/ngrams/ngrams.rb', line 117

def save( file )
  File.open( file, "w" ) do |file|
    YAML::dump( self, file )
  end
end

#word(length) ⇒ Object

Returns a word created by selecting a starting ngram and then doing a random walk to add the remaining characters to the specified length.



110
111
112
113
114
# File 'lib/ngrams/ngrams.rb', line 110

def word( length )
  s = ngram( :first, 3 )
  ( length - 3 ).times { s << next_char( s ) }
  s
end