markov_twitter

setup: installation

Either:

gem install markov_twitter

or add it to a Gemfile:

gem "markov_twitter"

After doing this, require it as usual:

require "markov_twitter"

setup: twitter integration

The source code of the gem (available on github here) includes a .env.example file which includes two environment variables. Both of them need to be changed to the values provided by Twitter. To get these credentials, create an application on the Twitter developer console. Then create a file identical to .env.example but named .env in the root of your project, and add the credentials there. Finally, add the dotenv gem and call Dotenv.load right afterward.

The two environment variables that are needed are TWITTER_API_KEY and TWITTER_SECRET_KEY. They can alternatively be set on a per-invocation basis using the env command in bash, e.g.:

env TWITTER_API_KEY=foo TWITTER_SECRET_KEY=bar ruby script.rb

Note that the callback URL or any of the OAuth stuff on the Twitter dev console is unnecessary. Specifically this requires only application-only authentication.

usage: TweetReader

First, initialize a MarkovTwitter::Authenticator:

authenticator = MarkovTwitter::Authenticator.new(
  api_key: ENV.fetch("TWITTER_API_KEY"),
  secret_key: ENV.fetch("TWITTER_SECRET_KEY")
)

Then initialize MarkovTwitter::TweetReader:

tweet_reader = MarkovTwitter::TweetReader.new(
  client: authenticator.client
)

Lastly, fetch some tweets for an arbitrary username. Note that the get_tweets method will return the most recently 20 tweets only. This gem doesn't have a way to fetch more tweets than that.

tweets = tweet_reader.get_tweets(username: "@accidental575")
puts tweets.map(&:text).first # the newest
# => "Jets fan who stands for /\nnational anthem sits on /\nAmerican flag /\n#accidentalhaiku by @Deadspin \nhttps://t.co/INsLlMB31G"

usage: MarkovBuilder

MarkovTwitter::MarkovBuilder gets passed the list of tweet strings to its initialize:

chain = MarkovTwitter::MarkovBuilder.new(
  phrases: tweets.map(&:text)
)

It internally stores the words in a #nodes dict where keys are strings and values are Node instances. A Node is created from each whitespace-separated entity. Punctuation is treated like any other non-whitespace character.

The linkages between words are automatically created (Node#linkages) and it's possible to evaluate the chain right away, producing a randomly generated sentence. There are three built in methods to evaluate the chain, but more can be constructed using lower-level methods. There are two ways these methods differ:

  1. Do they build the result by walking along the :next or :prev nodes (forward or backward)?

  2. How do they pick the first node, and how do they choose a node when there are no more linkages along the given direction (:prev or :next)?

Here are those three methods:

  1. evaluate

    • traverses rightward along :next
    • when starting or stuck, picks any random word
    5.times.map  { chain.evaluate length: 10 }
    # => [
    # "by @FlayrahNews https://t.co/LbxzPQ5Zqv back. / together with dung! / American",
    # "thought/ #accidentalhaiku by @news_24_365 https://t.co/kkfz5S3Kut pumpkin / Wes Anderson's Isle",
    # "has been in a lot about / #accidentalhaiku by @UrbanLion_",c
    # "them, my boyfriend used my friends. Or as / #accidentalhaiku",
    # "25 years... / feeling it today. / to write /"
    # ]
    
  2. evaluate_favoring_end

    • traverses leftward along :prev
    • when starting or stuck, picks a word that was at the end of one of the original phrases.
    • reverses the result before returning
    5.times.map  { chain.evaluate_favoring_end length: 10 }
    # => [
    # "revolution / to improve care, / #accidentalhaiku by @Deadspin https://t.co/INsLlMB31G",
    # "to save the songs you thought/ #accidentalhaiku by @Mary_Mulan https://t.co/ixw2EQamHq",
    # "adventure / together with dung! / #accidentalhaiku by @Deadspin https://t.co/INsLlMB31G",
    # "harder / for / creativity? / #accidentalhaiku by @AlbertBrooks https://t.co/DzXbGeYh0Z",
    # "/ Asking for 25 years... / #accidentalhaiku by @StratfordON https://t.co/k81u693AbV"
    # ]
    
  3. evaluate_favoring_start

    • traverses rightward along :next
    • when starting or stuck, picks a word that was at the start of one of the original phrases.
    5.times.map { chain.evaluate_favoring_start length: 10 }
    # => [
    # "RT if you listened to / to get lost /",
    # "Jets fan who stands for / #accidentalhaiku by @theloniousdev https://t.co/6Rb5F8XySy   # ",
    # "The first trailer for / and never come back.    # /",
    # "Zooey Deschanel / and never come back. / house in   # ",
    # "Oh my friends. Or as / #accidentalhaiku by @timkaine https://t.co/4pgknpmom5   # "    
    # ]
    

Note that it is possible to manually change the lists of start nodes and end nodes using MarkovBuilder#start_nodes and MarkovBuilder#end_nodes

advanced usage: custom evaluator

The three previously mentioned methods all use _evaluate under the hood. This method supports any permutation of the following keyword args (all except start_node and probability_bounds are required).

  • length
    number of nodes in the result
  • direction
    :next or :prev
  • start_node
    the node to use at the beginning
  • probability_bounds
    Array where 0 <= Int1 <= Int2 <= 100
    This is essentially used to "stack the dice", so to speak. Internally, smaller probabilities are checked first. So if A has 50% likelihood and B/C/D/E/F each have 10% likelihood, then B/C/D/E/F can be guaranted by using [0,50] as probability_bounds. This 'stacked' probability is applied any time the program chooses a :next or :prev option.
  • node_finder
    A lambda which gets run when the evaluator is starting or stuck. It gets passed random nodes one-by-one. The first one for which the block returns a truthy value is used.

Note that _evaluate returns nodes and so the values must be manually fetched and joined. Here's an example of providing a custom node_finder lambda so that all phrases in the result start with "the":

5.times.map do
  nodes = chain._evaluate(
    direction: :next,
    length: 10,
    node_finder: -> (node) {
      node.value.downcase == "the"
    }
  )
  nodes.map(&:value).join " "
end
# => [
# "the rain / #accidentalhaiku by @theloniousdev https://t.co/6Rb5F8XySy The first trailer",
# "The first trailer for / #accidentalhaiku by @shiku___ https://t.co/ZutjdsopAo the",
# "the songs you thought/ #accidentalhaiku by @Mary_Mulan https://t.co/ixw2EQamHq The first",
# "The first trailer for / #accidentalhaiku by @UrbanLion_ https://t.co/bvM6eeXGj5 The",
# "the rain / and start / I THOUGHT MY BOYFRIEND"
# ]

advanced usage: linkage manipulation

There are manipulations available at the Node level (accessible through the MarkovBuilder#nodes dict). Keep in mind that there is only a single Node for each unique string. There can be many references to it from other nodes' linkages, but since there is still only a single object, each unique string only has a single set of :next and :previous linkages emanating from it.

Although the core linkage data is accessible in Node#linkages and Node#total_num_inputs, they should not be manipulated directly via these references. Rather, use one of the following methods which are automatically balancing in terms of keeping :next and :previous probabilities mirrored and ensuring that the probabilities sum to 1. That is to say, if I add node1 as the :next linkage of node2, then node1 will have its :prev probabilities balanced and node2 will have its :next probabilities balanced.

  1. #add_next_linkage(child_node)
    adds a linkage in the :next direction or increases its likelihood
  2. #add_prev_linkage(parent_node)
    adds a linkage in the :prev direction or increases its likelihood
  3. #remove_next_linkage(child_node)
    removes a linkage in the :next direction or decreases its likelihood
  4. #remove_prev_linkage(parent_node)
    removes a linkage in the :prev direction or decreases its likelihood
  5. #add_linkage!(direction, other_node, probability)
    Force-sets the probability of a linkage. Adjusts the other probabilities so they still sum to 1.
  6. #remove_linkage!(direction, other_node)
    Completely removes a linkage as an option. Adjusts other probabilities so they still sum to 1.

All of these methods can be safely run many times. Note that remove_next_linkage and remove_prev_linkage do not completely remove the node from the list of options. They just decrement its probability an amount determined by Node#total_num_inputs.

development: code organization

The gem boilerplate was scaffolded using a gem I made, gemmyrb.

Test scripts are in the spec/ folder, although some helper methods are written into the application code at MarkovTwitter::TestHelperMethods.

The application code is in lib/.

Documentation is built with yard into doc/ - it's viewable on rubydoc. It has 100% documentation at time of writing. If when building, it shows that something is undocumented, run yard --list-undoc to find out where it is.

development: tests

To run the tests, install markov_twitter with the development dependencies:

gem install markov_twitter --development

Then run rspec in the root of the repo.

There are 40 test cases at time of writing.

By default, Webmock will prevent any real HTTP calls for the twitter-related tests, but this can be disabled (and real Twitter data used) by running the test suite with an environment variable:

env DISABLE_WEBMOCK=true rspec

development: todos

Things which would be interesting to add:

  • dictionary-based search and replace
  • part-of-speech-based search and replace

performance

Here are the benchmarks for indexing and evaluation for Moby Dick (~115k whitespace-delimited words/punctuation sequence). Since the program uses whitespace to separate words and treats punctuation like any other word, the start/end of phrases needs to be manually defined. Here, I'm splitting phrases by empty lines (e.g. by paragraph).

loading the text into memory                    
0.030000   0.000000   0.030000 (  0.101186)

adding the text to a markov chain              
20.340000   0.070000  20.410000 ( 20.608080)

evaluating 10k words with random evaluator      
3.410000   0.000000   3.410000 (  3.456607)

evaluating 10k words with favor_next evaluator  
1.440000   0.000000   1.440000 (  1.471715)

evaluating 10k words with favor_prev evaluator  
3.540000   0.000000   3.540000 (  3.563398)

Here's some example results:

chain = MarkovTwitter::MarkovBuilder.new(
  phrases: File.read("spec/mobydick.txt").split(/^\s?$/)
)

chain.evaluate length: 50

"block, I began howling than in the sea as some fifteen thousand men ? ' And of something like a lightning-like hurtling whisper Starbuck now live in her only in an outline pictures life. I feel funny. Fa, la ! ' Pull, then, by what ye harpooneers ! the plainest"

chain.evaluate_favoring_start length: 50

" But we were locked within. For what does not to be in laying open his shipmates ; most part, that so prolonged, and fasces of miles off Patagonia, ipurcbasefc for all ; softly, and weaving and frankly admit that people to make tearless Lima has at all customary dinner"

chain.evaluate_favoring_end length: 50

"the best mode in two fellow-beings should rub each in chorus.) In short, and fetch that man may very readily passes through the sailors we found that but signifying nothing. That is a repugnance to me on his father's heathens. Arrived at present as much like the sea."

# prioritizing improbable linkages and start each phrase with "I"
chain._evaluate(
  length: 50,
  direction: :next,
  probability_bounds: [0,5],
  node_finder: -> (node) { node.value == "I" }
).map(&:value).join(" ")

"I allow himself and bent upon us. This whale's heart.' I recall all glittering teeth -gnashing there. Further on, Ishmael, be cherishing unwarrantable prejudices against your thousand Patagonian sights and spears. Some say was darkened like Czar in Queequeg's arm did that typhoon on water when these weapons offensively, and"