EmailGraph

Build and analyze graph data structures from email history.

Focus is on identities and interactions between them, for example:

  • Who do I email with the most?
  • Who are my strong contacts, as identified by some two-way interaction threshold function?
  • What is the distribution of my email interactions with a given person over time?

Subject lines and message bodies are currently ignored for simplification.

Installation

Add this line to your application's Gemfile:

gem 'email_graph'

And then execute:

$ bundle

Or install it yourself as:

$ gem install email_graph

Graph types

There are two types of graphs to be built from email data.

1. Interaction graph

Class: EmailGraph::InteractionGraph

This is a directed graph where each vertex is an email address and each edge is an instance of EmailGraph::InteractionRelationship - a directed interaction history between two emails.

The graph has these properties:

  • Implements common graph methods (e.g., #vertices, #edges, etc)
  • Efficient fetching of vertices' in-edges (not always a default of graph structures)
  • Loops allowed

Given a message, an interaction is created from the sender to every address in the to, cc, and bcc fields - there is no distinction among the latter.

EmailGraph::InteractionRelationship objects have an interactions attribute, which holds an array of the Time objects of the interactions.

Example:

# Assuming you have an array 'messages' of message-like objects (see below for
# how to use the Gmail fetcher)
g = EmailGraph::InteractionGraph.new
messages.each{ |m| g.add_message(m) }

# ...or...

g = EmailGraph::InteractionGraph.new(messages: messages)

# For example, see a sorted list of your email contacts by emails sent
g.edges_from("[email protected]")
  .sort_by{ |e| -e.interactions.size }
  .map{ |e| [e.to, e.interactions.size] }

2. Mutual relationship graph

Class: EmailGraph::UndirectedGraph (just uses the abstract class)

This is an undirected graph where each vertex is also an email, however, this time, the edges are instances of EmailGraph::MutualRelationship - an undirected edge that similarly includes an interaction history (though an undirected one).

This graph is created from an EmailGraph::InteractionGraph by creating undirected edges from pairs of directed edge inverses. Optionally, a filter can be applied during this process to determine whether an undirected edge is created for a given pair of directed edges.

g = EmailGraph::InteractionGraph.new(messages: messages)

# This creates the graph using the default filter, which is that an edge has to
# have an inverse in order to create a new undirected edge.
mg = g.to_mutual_graph

# Alternatively, you can specify a custom filter. For example, this replicates
# the one used by A. Chapanond et al. in their analysis of emails* from the Enron
# case data set
filter = Proc.new do |e, e_inverse|
  if e && e_inverse
    counts = [e.interactions.size, e_inverse.interactions.size]
    counts.all?{ |c| c >= 6 } && counts.inject(:+) >= 30
  else
    false
  end
end
mg = g.to_mutual_graph(&filter)

*Chapanond, Anurat, Mukkai S. Krishnamoorthy, and Bülent Yener. "Graph theoretic and spectral analysis of Enron email data." Computational & Mathematical Organization Theory 11.3 (2005): 265-281.

Email normalization

You'll likely want to normalize email addresses before adding them to a graph. Otherwise, you'll end up with separate vertices for different capitalizations of the same address - not to mention differences with '.' placement and other issues.

EmailGraph::InteractionGraph will do this by default using SoundCloud's Normailize gem.

You can also pass your own email processing block on instantiation for the entire graph, or when calling #add_message.

Fetching emails

A fetcher for Gmail is included for convenience.

g = EmailGraph::InteractionGraph.new

email = "XXX"
# You'll need an OAuth2 access token with Gmail permissions. One way to get one
# is to use the Google Oauth Playground (https://developers.google.com/oauthplayground/)
# and under "Gmail API v1" authorize "https://mail.google.com/".
access_token = "XXX"

f = EmailGraph::GmailFetcher::Fetcher.new( email: email,
                                           access_token: access_token )

# This should cover all emails from that account. If no mailbox param is
# provided, defaults to Inbox.
mailboxes = ['[Gmail]/All Mail', '[Gmail]/Trash']
f.each_message(mailboxes: mailboxes){ |m| g.add_message(m) }

Contributing

  1. Fork it ( https://github.com/[my-github-username]/email_graph/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request