DoverToCalais

DoverToCalais allows the user to send a wide range of data sources (files & URLs) to OpenCalais and receive asynchronous responses when OpenCalais has finished processing the inputs. In addition, DoverToCalais enables response filtering in order to find relevant tags and/or tag values.

What is OpenCalais?

In short -and quoting the OpenCalais creators:
> “The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.

In general, OpenCalais Simple XML Format (the one used by DoverToCalais) returns three kinds of tags: Entitites, Events and Topics. Entities are static ‘things’, like Persons, Places, et al. that are involved in the textual context in some capacity. OpenCalais assigns a relevance score to each entity to indicate it’s relevance within the context of the data source’s general topic. Events are facts or actions that pertain to one or more Entities. Topics are a characterisation or generic description of the data source’s context.

We can use these tags and the information within them to extract relevant information from the data or to draw useful conclusions about it. For example, if the data source tags include an <Event> with the value of ‘CompanyExpansion’, I can then look for the <City> or <Company> tags to find out which company is expanding and if it’s near my location (hint: they may be looking for more staff :)) Or, I could pick out all <Company>s involved in a <JointVenture>, or all <Person>s implicated in an <Arrest> in my <City>, etc.

Why use OpenCalais?

There are many reasons, mainly to:

  • incorporate tags into other applications, such as search, news aggregation, blogs, catalogs, etc.
  • enrich search by looking for deeper, contextual meaning instead of merely phrases or keywords.
  • help to discern relationships between semantic entities.
  • facilitate data processing and analysis by allowing easy identification of relevant or important data sources and the discarding of irrelevant ones.

DoverToCalais Features

  1. Multiple data source support: Thanks to the power of Yomu, DoverToCalais can process a vast range of files (and, of course, web pages), extract text from them and send them to OpenCalais for analysis and tag generation.

  2. Asynchronous responses (callbacks): Users can set callbacks to receive the processed meta-data, once the OpenCalais Web Service response has been received. Furthermore, a user can set multiple callbacks for the same request (data source), thus enabling cleaner, more modular code.

  3. Result filtering: DoverToCalais uses the OpenCalais Simple XML Format as the preferred response format. The user can work directly with the XML-formatted response, or -if feeling a bit lazy- can take advantage of the DoverToCalais filtering functionality and receive specific entities, optionally based on specified conditions.

For more details of the features and code samples, see Usage.

Pre-requisites

To use the OpenCalais Web Service and -by extension- DoverToCalais, one needs to possess an OpenCalais API key, which is easily obtainable from the OpenCalais web site.

Also, DoverToCalais requires the presence of a working JRE.

Installation

Add this line to your application’s Gemfile:

gem 'dover_to_calais'

And then execute:

$ bundle

Or install it yourself as:

$ gem install dover_to_calais

Dependencies

DoverToCalais has been developed in Ruby 1.9.3 and relies on the following gems to work (installation with the gem command will automatically install all dependencies)

  • ‘nokogiri’, 1.6.0
  • ‘eventmachine’, 1.0.3
  • ‘em-http-request’, 1.1.0
  • ‘yomu’, 0.1.9

As Yomu depends on a working JRE in order to function, so does DoverToCalais.

Usage

Using DoverToCalais is extremely simple.

The Basics

As DoverToCalais uses the awesome-ness of EventMachine, code must be placed within an EM run block:

```ruby EM.run do

# use Control + C to stop the EM
Signal.trap('INT')  { EventMachine.stop }
Signal.trap('TERM') { EventMachine.stop }

# we need an API key to use OpenCalais
DoverToCalais::API_KEY =  'my-opencalais-api-key'
# create a new dover
dover =  DoverToCalais::Dover.new('http://www.bbc.co.uk/news/world-africa-24412315')
# parse the text and send it to OpenCalais
dover.analyse_this
puts 'do some stuff....'
# set a callback for when we receive a response
dover.to_calais { |response| puts response.error ? response.error : response }

puts 'do some more stuff....'

end ``` This will produce the following result:

do some stuff….
do some more stuff….

<OpenCalaisSimple>
……….
(the rest of the XML response from OpenCalais)

As can be observed, the callback (#to_calais) is trigerred after the rest of the code has been executed and only when the OpenCalais request has been completed.

Of course, we can analyse more than one sources at a time:

```ruby EM.run do

# use Control + C to stop the EM Signal.trap(‘INT’) { EventMachine.stop } Signal.trap(‘TERM’) { EventMachine.stop }

DoverToCalais::API_KEY = ‘my-opencalais-api-key’

d1 = DoverToCalais::Dover.new(‘http://www.bbc.co.uk/news/world-africa-24412315’) d2 = DoverToCalais::Dover.new(‘/home/fred/Documents/RailsRecipes.pdf’) d3 = DoverToCalais::Dover.new(‘//network-drive/annual_forecast.doc’)

d1.analyse_this; d2.analyse_this; d3.analyse_this;

puts ‘do some stuff….’

d1.to_calais { response puts response.error ? response.error : response }
d2.to_calais { response puts response.error ? response.error : response }
d3.to_calais { response puts response.error ? response.error : response }

puts ‘do some more stuff….’

end ```

This will output the two puts statements followed by the three callbacks (d1, d2, d3) in the order in which they are triggered, i.e. the first callback to receive a response from OpenCalais will fire first.

Filtering the response

Why parse the response XML ourselves when DoverToCalais can do it for us? We’ll just use the #filter method on the response object, passing a filtering hash:

ruby my_filter = {:entity => 'Entity1', :value => 'Value1', :given => {:entity => 'Entity2', :value => 'Value2'}} reponse.filter(my_filter)

The above tells DoverToCalais to look in the reponse for an entity called ‘Entity1’ with a value of ‘Value1’, only if the response contains an entity called ‘Entity2’ which has a value of ‘Value2’.

The conditional clause (:given) is optional; the filtering hash can be used in pretty much any permutation. For instance:

```ruby EM.run do

DoverToCalais::API_KEY =  'my-opencalais-api-key'

dover =  DoverToCalais::Dover.new('http://www.bbc.co.uk/news/world-africa-24412315')
dover.analyse_this

dover.to_calais do |response|
if   response.error
  puts  response.error
else
  puts response.filter({:entity => 'Company'})
end
end

end ```

This will pick out all entities tagged ‘Company’ from the data source. The output will be an Array of ResponseItem objects.

<struct DoverToCalais::ResponseItem name=”Company”, value=”BBC News”, relevance=0.654, count=13, normalized=nil, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Company”, value=”TV Radio”, relevance=0.565, count=2, normalized=”HERALD & WEEKLY-TV,RADIO OPS”, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Company”, value=”Reuters”, relevance=0.255, count=2, normalized=”THOMSON REUTERS GROUP LIMITED”, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Company”, value=”Twitter”, relevance=0.395, count=1, normalized=”TWITTER, INC.”, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Company”, value=”Huffington Post UK”, relevance=0.136, count=1, normalized=nil, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Company”, value=”Ireland Kenya”, relevance=0.144, count=1, normalized=nil, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Company”, value=”Yahoo! UK”, relevance=0.144, count=1, normalized=”YAHOO! UK LIMITED”, importance=nil, originalValue=nil>

If this output looks a bit cluttered, we can easily tidy it up:

```ruby EM.run do

DoverToCalais::API_KEY = ‘my-opencalais-api-key’

dover = DoverToCalais::Dover.new(‘http://www.bbc.co.uk/news/world-africa-24412315’) dover.analyse_this

dover.to_calais do |response| if response.error puts response.error else items = response.filter(=> ‘Company’) items.each do |item| puts “#itemitem.name: #itemitem.value, relevance = #itemitem.relevance” end end end

end ```

Which will give us:

Company: BBC News, relevance = 0.656
Company: TV Radio, relevance = 0.566
Company: Reuters, relevance = 0.26
Company: Guardian.co.uk, relevance = 0.143
Company: Twitter, relevance = 0.399
Company: Huffington Post UK, relevance = 0.132
Company: Ireland Kenya, relevance = 0.139
Company: Yahoo! UK, relevance = 0.139

Let’s see if the data source refers to any business partnerships:

```ruby EM.run do

DoverToCalais::API_KEY = ‘my-opencalais-api-key’

dover = DoverToCalais::Dover.new(‘http://www.bbc.co.uk/news/technology-24380202’) dover.analyse_this

dover.to_calais do |response| if response.error puts response.error else items = response.filter(=> ‘Event’, :value => ‘Business Partnership’) puts “There are #itemsitems.length events like that in the source” end end

end ```

which will produce:

There are 1 events like that in the source

Now let’s find all companies involved in any business partnerships:

```ruby EM.run do

DoverToCalais::API_KEY = ‘my-opencalais-api-key’

dover = DoverToCalais::Dover.new(‘http://www.bbc.co.uk/news/technology-24380202’) dover.analyse_this

dover.to_calais do |response| if response.error puts response.error else items = response.filter( => ‘Company’, :given => {:entity => ‘Event’, :value => ‘Business Partnership’} ) items.each do |item| puts “#itemitem.name: #itemitem.value a.k.a #itemitem.normalized, relevance = #itemitem.relevance” end end end

end ```

which gives us:

Company: BBC News a.k.a , relevance = 0.678
Company: Google a.k.a GOOGLE INC., relevance = 0.508
Company: Flutter a.k.a FLUTTER COM INC, relevance = 0.531
Company: TV Radio a.k.a HERALD & WEEKLY-TV,RADIO OPS, relevance = 0.558
Company: Microsoft a.k.a MICROSOFT CORPORATION, relevance = 0.303
Company: Adobe a.k.a ADOBE SYSTEMS INCORPORATED, relevance = 0.193
Company: Netflix a.k.a NETFLIX, INC., relevance = 0.301
Company: Y Combinator a.k.a Y Combinator, relevance = 0.258
Company: Nintendo a.k.a Nintendo Co., Ltd., relevance = 0.286
Company: Samsung a.k.a Samsung C&T Corporation, relevance = 0.285
Company: Glyndwr University a.k.a , relevance = 0.269

At this point, someone may ask: “But what if we want to get more than one entity for a given condition? The filter hash doesn’t allow that!”

No it doesn’t. However, given that filtering is done on the whole reponse after it’s been received, we can apply many filters on the same response:

```ruby EM.run do

DoverToCalais::API_KEY = ‘my-opencalais-api-key’

dover = DoverToCalais::Dover.new(‘http://www.bbc.co.uk/news/technology-24380202’) dover.analyse_this

dover.to_calais do |response| if response.error puts response.error else result1 = response.filter( => ‘Company’, :value => ‘Google’, :given => {:entity => ‘Technology’, :value => ‘gesture recognition’} ) result2 = response.filter( => ‘Product’, :given => {:entity => ‘Technology’, :value => ‘gesture recognition’} ) puts result1 | result2 end end

end ```

Which will give us all the gesture-recognition products that Google is associated with according to our data source:

<struct DoverToCalais::ResponseItem name=”Company”, value=”Google”, relevance=0.506, count=7, normalized=”GOOGLE INC.”, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Product”, value=”Xbox Kinect”, relevance=0.286, count=1, normalized=nil, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Product”, value=”Galaxy S4 smartphone”, relevance=0.282, count=1, normalized=nil, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Product”, value=”Wii”, relevance=0.286, count=1, normalized=nil, importance=nil, originalValue=nil>
<struct DoverToCalais::ResponseItem name=”Product”, value=”Galaxy S4”, relevance=0.282, count=1, normalized=nil, importance=nil, originalValue=nil>

PS: If you’re not sure about the names or values of the tags you want to filter, you can get a listing with the following Constants:

ruby CalaisOntology::CALAIS_ENTITIES CalaisOntology::CALAIS_EVENTS CalaisOntology::CALAIS_TOPICS

Code samples

More examples of using DoverToCalais can be found as GitHub Gists:

Using DoverToCalais to semantically tag all files in a directory
Use DoverToCalais to find all Persons or Organizations with a relevance score greater than 0.1, if the data source contains an environmental event

Using a Proxy

If you’re behind a corporate firewall and the only way to reach outside is through a proxy then you need to set the DoverToCalais::PROXY constant:

ruby DoverToCalais::PROXY = :proxy => { :host => 'www.myproxy.com', :port => 8080, :authorization => ['username', 'password'] #optional }

If you’re connecting through a SOCKS5 Proxy just set the :type key to :socks5.

ruby DoverToCalais::PROXY = :proxy => { :host => 'www.myproxy.com', :port => 8080, :type => :socks5 }

Documentation

Comprehensive documentation can be found at http://rubydoc.info/gems/dover_to_calais.

Testing

A list of Cucumber features and scenarios can be found in the features directory. The list is far from exhaustive, so feel free to add your own scenarios and steps.

To run the tests, there is already a rake task set up. Just type:

rake features API_KEY='my_api_key'

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

Changelog

  • 07-Oct-2013 Version: 0.1.0
    Initial release
  • 10-Feb-2014 Version: 0.1.1 Improved Response error message
  • 10-Feb-2014 Version: 0.2.0
    Added #analyse_this to public interface