Class: Loremarkov
- Inherits:
-
Object
- Object
- Loremarkov
- Defined in:
- lib/loremarkov.rb
Overview
Loremarkov uses Markov chains to generate plausible-sounding text, given an input corpus. It comes with a few built-in sample texts.
It is based upon Kernighan & Pike’s *The Practice of Programming* Chapter 3
Install Loremarkov with Rubygems:
gem install loremarkov
Once installed, the ‘destroy` command can be used to generate plausible- sounding text. The input text may be provided by filename, STDIN, or naming one of the built-in sample texts.
destroy lorem_ipsum
destroy ~/my_first_corpus.txt
man ls | destroy
Constant Summary collapse
- TOKENS =
TOKENS - These tokens are what splits the text into words. In contrast to ruby’s String#split, these tokens are included in the resulting array.
["\n", "\t", ' ', "'", '"']
Instance Attribute Summary collapse
-
#markov ⇒ Object
readonly
Useful for testing at the very least.
Class Method Summary collapse
-
.analyze(text, num_prefix_words) ⇒ Object
analyze - Generate a markov data structure * Arrays of string for keys and values * Keys are prefixes – ordered word sequence of constant length * Values are an accumulation of the next word after the prefix, however many times it may occur.
-
.lex(str, tokens = TOKENS) ⇒ Object
lex - Decompose text into an array of tokens and words Words are merely the string of characters between the nearest two TOKENS e.g.
-
.sample_text(name) ⇒ Object
Given ‘lorem_ipsum`, return the string from reading `text/lorem_ipsum`.
-
.start_prefix(text, num_prefix_words) ⇒ Object
Given the entire text, use an extremely conservative heuristic to grab only the first chunk to pass to lex.
Instance Method Summary collapse
-
#analyze(text) ⇒ Object
Generate Markov structure from text.
-
#destroy(text) ⇒ Object
Do it, you know you want to.
-
#generate_all(start_prefix_words) ⇒ Object
Given the start prefix, generate words until EOF.
-
#generate_one(prefix_words) ⇒ Object
Generate the next word for a given prefix.
-
#initialize(num_prefix_words) ⇒ Loremarkov
constructor
More prefix_words means tighter alignment to original text.
Constructor Details
#initialize(num_prefix_words) ⇒ Loremarkov
More prefix_words means tighter alignment to original text
105 106 107 108 |
# File 'lib/loremarkov.rb', line 105 def initialize(num_prefix_words) @num_prefix_words = num_prefix_words @markov = {} end |
Instance Attribute Details
#markov ⇒ Object (readonly)
Useful for testing at the very least
102 103 104 |
# File 'lib/loremarkov.rb', line 102 def markov @markov end |
Class Method Details
.analyze(text, num_prefix_words) ⇒ Object
analyze - Generate a markov data structure
-
Arrays of string for keys and values
-
Keys are prefixes – ordered word sequence of constant length
-
Values are an accumulation of the next word after the prefix, however many times it may occur.
-
e.g. If a prefix occurs twice, then the value will be an array of two words – possibly the same word twice.
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/loremarkov.rb', line 73 def self.analyze(text, num_prefix_words) markov = {} words = lex(text) # Go through the possible valid prefixes. Adding 1 gives you the final # key: *num_prefix_words* words with a nil value -- signifying EOF (words.length - num_prefix_words + 1).times { |i| prefix_words = [] num_prefix_words.times { |j| prefix_words << words[i + j] } # Set to empty array on a new prefix. # Add the target word, which will be nil on the last iteration markov[prefix_words] ||= [] markov[prefix_words] << words[i + num_prefix_words] } markov end |
.lex(str, tokens = TOKENS) ⇒ Object
lex - Decompose text into an array of tokens and words Words are merely the string of characters between the nearest two TOKENS e.g.
lex %q{"Hello", he said.}
becomes
- %q{"} # TOKEN
- %q{Hello} # word
- %q{"} # TOKEN
- %q{,} # word
- %q{ } # TOKEN
- %q{he} # word
- %q{ } # TOKEN
- %q{said.} # word
This operation can be losslessly reversed by calling #join on the resulting array. i.e. ‘lex(str).join == str`
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/loremarkov.rb', line 45 def self.lex(str, tokens = TOKENS) final_ary = [] word = '' # This code makes no attempt to deal with non-ASCII string encodings. # i.e. byte-per-char str.each_byte { |b| # This byte is either a token, thereby ending the current word # or it is part of the current word if tokens.include?(b.chr) final_ary << word if !word.empty? final_ary << b.chr word = '' else word << b.chr end } final_ary << word if !word.empty? final_ary end |
.sample_text(name) ⇒ Object
Given ‘lorem_ipsum`, return the string from reading `text/lorem_ipsum`
97 98 99 |
# File 'lib/loremarkov.rb', line 97 def self.sample_text(name) File.read File.join(__dir__, '..', 'text', name) end |
.start_prefix(text, num_prefix_words) ⇒ Object
Given the entire text, use an extremely conservative heuristic to grab only the first chunk to pass to lex
92 93 94 |
# File 'lib/loremarkov.rb', line 92 def self.start_prefix(text, num_prefix_words) lex(text[0, 999 * num_prefix_words])[0, num_prefix_words] end |
Instance Method Details
#analyze(text) ⇒ Object
Generate Markov structure from text. Text should have a definite end, not just a convenient buffer split. This method may be called several times, but note that several EOFs will be present in the markov structure, any one of which will trigger a conclusion by #generate_all.
115 116 117 |
# File 'lib/loremarkov.rb', line 115 def analyze(text) @markov.merge!(self.class.analyze(text, @num_prefix_words)) end |
#destroy(text) ⇒ Object
Do it, you know you want to
134 135 136 137 |
# File 'lib/loremarkov.rb', line 134 def destroy(text) analyze(text) generate_all(self.class.start_prefix(text, @num_prefix_words)) end |
#generate_all(start_prefix_words) ⇒ Object
Given the start prefix, generate words until EOF
125 126 127 128 129 130 131 |
# File 'lib/loremarkov.rb', line 125 def generate_all(start_prefix_words) words = start_prefix_words while tmp = generate_one(words[-1 * @num_prefix_words, @num_prefix_words]) words << tmp end words.join end |
#generate_one(prefix_words) ⇒ Object
Generate the next word for a given prefix
120 121 122 |
# File 'lib/loremarkov.rb', line 120 def generate_one(prefix_words) @markov.fetch(prefix_words).sample end |