Class: Loremarkov

Inherits:
Object
  • Object
show all
Defined in:
lib/loremarkov.rb

Overview

Loremarkov uses Markov chains to generate plausible-sounding text, given an input corpus. It comes with a few built-in sample texts.

It is based upon Kernighan & Pike’s *The Practice of Programming* Chapter 3

Install Loremarkov with Rubygems:

gem install loremarkov

Once installed, the ‘destroy` command can be used to generate plausible- sounding text. The input text may be provided by filename, STDIN, or naming one of the built-in sample texts.

destroy lorem_ipsum
destroy ~/my_first_corpus.txt
man ls | destroy

Constant Summary collapse

TOKENS =

TOKENS - These tokens are what splits the text into words. In contrast to ruby’s String#split, these tokens are included in the resulting array.

["\n", "\t", ' ', "'", '"']

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(num_prefix_words) ⇒ Loremarkov

More prefix_words means tighter alignment to original text



105
106
107
108
# File 'lib/loremarkov.rb', line 105

def initialize(num_prefix_words)
  @num_prefix_words = num_prefix_words
  @markov = {}
end

Instance Attribute Details

#markovObject (readonly)

Useful for testing at the very least



102
103
104
# File 'lib/loremarkov.rb', line 102

def markov
  @markov
end

Class Method Details

.analyze(text, num_prefix_words) ⇒ Object

analyze - Generate a markov data structure

  • Arrays of string for keys and values

  • Keys are prefixes – ordered word sequence of constant length

  • Values are an accumulation of the next word after the prefix, however many times it may occur.

  • e.g. If a prefix occurs twice, then the value will be an array of two words – possibly the same word twice.



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/loremarkov.rb', line 73

def self.analyze(text, num_prefix_words)
  markov = {}
  words = lex(text)

  # Go through the possible valid prefixes.  Adding 1 gives you the final
  # key: *num_prefix_words* words with a nil value  -- signifying EOF
  (words.length - num_prefix_words + 1).times { |i|
    prefix_words = []
    num_prefix_words.times { |j| prefix_words << words[i + j] }
    # Set to empty array on a new prefix.
    # Add the target word, which will be nil on the last iteration
    markov[prefix_words] ||= []
    markov[prefix_words] << words[i + num_prefix_words]
  }
  markov
end

.lex(str, tokens = TOKENS) ⇒ Object

lex - Decompose text into an array of tokens and words Words are merely the string of characters between the nearest two TOKENS e.g.

lex %q{"Hello", he said.}

becomes

- %q{"}      # TOKEN
- %q{Hello}  # word
- %q{"}      # TOKEN
- %q{,}      # word
- %q{ }      # TOKEN
- %q{he}     # word
- %q{ }      # TOKEN
- %q{said.}  # word

This operation can be losslessly reversed by calling #join on the resulting array. i.e. ‘lex(str).join == str`



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/loremarkov.rb', line 45

def self.lex(str, tokens = TOKENS)
  final_ary = []
  word = ''
  # This code makes no attempt to deal with non-ASCII string encodings.
  # i.e.  byte-per-char
  str.each_byte { |b|
    # This byte is either a token, thereby ending the current word
    # or it is part of the current word
    if tokens.include?(b.chr)
      final_ary << word if !word.empty?
      final_ary << b.chr
      word = ''
    else
      word << b.chr
    end
  }
  final_ary << word if !word.empty?
  final_ary
end

.sample_text(name) ⇒ Object

Given ‘lorem_ipsum`, return the string from reading `text/lorem_ipsum`



97
98
99
# File 'lib/loremarkov.rb', line 97

def self.sample_text(name)
  File.read File.join(__dir__, '..', 'text', name)
end

.start_prefix(text, num_prefix_words) ⇒ Object

Given the entire text, use an extremely conservative heuristic to grab only the first chunk to pass to lex



92
93
94
# File 'lib/loremarkov.rb', line 92

def self.start_prefix(text, num_prefix_words)
  lex(text[0, 999 * num_prefix_words])[0, num_prefix_words]
end

Instance Method Details

#analyze(text) ⇒ Object

Generate Markov structure from text. Text should have a definite end, not just a convenient buffer split. This method may be called several times, but note that several EOFs will be present in the markov structure, any one of which will trigger a conclusion by #generate_all.



115
116
117
# File 'lib/loremarkov.rb', line 115

def analyze(text)
  @markov.merge!(self.class.analyze(text, @num_prefix_words))
end

#destroy(text) ⇒ Object

Do it, you know you want to



134
135
136
137
# File 'lib/loremarkov.rb', line 134

def destroy(text)
  analyze(text)
  generate_all(self.class.start_prefix(text, @num_prefix_words))
end

#generate_all(start_prefix_words) ⇒ Object

Given the start prefix, generate words until EOF



125
126
127
128
129
130
131
# File 'lib/loremarkov.rb', line 125

def generate_all(start_prefix_words)
  words = start_prefix_words
  while tmp = generate_one(words[-1 * @num_prefix_words, @num_prefix_words])
    words << tmp
  end
  words.join
end

#generate_one(prefix_words) ⇒ Object

Generate the next word for a given prefix



120
121
122
# File 'lib/loremarkov.rb', line 120

def generate_one(prefix_words)
  @markov.fetch(prefix_words).sample
end