Class: Primal::InputTermExtraction

Inherits:
Object
  • Object
show all
Includes:
HTTParty
Defined in:
lib/primal/AlchemyAPIWrapper.rb

Overview

The InputTermExtraction class abstracts accessing AlchemyAPI (www.alchemyapi.com/) to extract important information from Web pages/News articles/blog posts/plain text.

The main function is getPrimalRequest, which accepts a string that represents a Web page URL or some text, and build a Primal topic URI that will direct the user to the Primal Web App.

Constant Summary collapse

@@debugMe =

Set this to true/false in order to turn on/off debugging of this class

true
@@alchemyRoot =
"http://access.alchemyapi.com/calls"
@@alchemyURL =
"#{@@alchemyRoot}/url"
@@alchemyText =
"#{@@alchemyRoot}/text"
@@entitiesLimit =

change these variables to modify how the Primal request is built

1
@@keywordsLimit =
3
@@categoryIgnores =

We ignore keywords that intersect with entities of the following types:

{
  'person'            => 1,
  'organization'      => 1,
  'city'              => 1,
  'company'           => 1,
  'continent'         => 1,
  'country'           => 1,
  'region'            => 1,
  'stateorcountry'    => 1,
  'geographicfeature' => 1
}

Instance Method Summary collapse

Constructor Details

#initialize(alchemyApiKey) ⇒ InputTermExtraction

Constructor for the InputTermExtraction class

Pass in the Api Key for Alchemy services. You can register for a free API key here:

http://www.alchemyapi.com/api/register.html

to test your application. Please read the license of Alchemy API.



58
59
60
# File 'lib/primal/AlchemyAPIWrapper.rb', line 58

def initialize(alchemyApiKey)
  @alchemyApiKey = alchemyApiKey
end

Instance Method Details

#buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) ⇒ Object

Uses the deconstructed Alchemy information to create a valid Primal URL



163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
# File 'lib/primal/AlchemyAPIWrapper.rb', line 163

def buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON)
  # Check if any of the extractions failed
  if !categoryJSON or !entitiesJSON or !keywordsJSON
     $stderr.puts "Cannot build Primal request. Alchemy failed to extract information."
     return nil
  end
  
  if @@debugMe
    $stderr.puts "Building Primal request..."
  end
 
  ### Get information required for building a Primal request
  # Get the category from the extracted data
  category = rewriteCategory(categoryJSON['category'])

  if @@debugMe
    $stderr.puts "Category = #{category}"
  end
      
  ### Select top entities from all extracted entities
  entitiesList = entitiesJSON['entities'].collect { |entity|
    entity['text'].downcase
  }[0, @@entitiesLimit]
  
  if @@debugMe
    prettified = entitiesJSON['entities'].collect { |entity|
      entity['text']
    }.join(', ')
    $stderr.puts "Entities = #{prettified}"
  end
  
  ### Select top keywords from all extracted keywords
  # Remove keywords that intersect with entities of the types in @@categoryIgnores
  allEntities = entitiesJSON['entities'].select { |entity|
    @@categoryIgnores.has_key? entity['type'].downcase
  }.collect { |entity|
    entity['text'].downcase
  } 
  
  if @@debugMe
    prettified = keywordsJSON['keywords'].collect { |keyword|
      keyword['text']
    }.join(', ')
    $stderr.puts "Keywords = #{prettified}"
  end
  
  keywordsList = keywordsJSON['keywords'].select { |keyword|
    normalizedKw = keyword['text'].downcase
    intersectsWithEntity = !(allEntities.select { |entity|
                                 entity.include? normalizedKw or normalizedKw.include? entity
                               }.empty?)
    # Ignore keywords > 4 words or those that intersect with entities
    normalizedKw.split.size < 5 && !intersectsWithEntity
  }.collect { |keyword|
    keyword['text'].downcase
  }
    
  # Remove repeated keywords
  keywordsList = getNonRepeatedKeywords(keywordsList)
  
  ### Build Primal topic URI
  primalRequest = ""
  unless category.nil?     then primalRequest = "/" + category end
  if entitiesList.size > 0 then primalRequest = primalRequest + "/" + entitiesList.join("/") end
  if keywordsList.size > 0 then primalRequest = primalRequest + "/" + keywordsList.join(";") end
  if @@debugMe
    $stderr.puts "Primal request = #{primalRequest}"
  end
  URI::encode(primalRequest)
end

#getAlchemy(serviceURL, parameters) ⇒ Object

Perform a GET request to Alchemy service URL and return the response as a JSON object

Returns nil on error



313
314
315
316
# File 'lib/primal/AlchemyAPIWrapper.rb', line 313

def getAlchemy(serviceURL, parameters)
  response = self.class.get(serviceURL, parameters)
  returnAlchemyResponseJSON(response)
end

#getNonRepeatedKeywords(keywordsList) ⇒ Object

Returns the top @@keywordsLimit keywords, ignoring those contained within other keywords



238
239
240
241
242
243
244
245
246
247
248
249
# File 'lib/primal/AlchemyAPIWrapper.rb', line 238

def getNonRepeatedKeywords(keywordsList)
  # If there is less than @@keywordsLimit, or if the first @@keywordsLimit keywords
  # are unique, return the first @@keywordsLimit keywords
  repeatedKeywords = getRepeatedKeywords(keywordsList[0, @@keywordsLimit])
  if keywordsList.length <= @@keywordsLimit or repeatedKeywords.empty?
    keywordsList[0, @@keywordsLimit] 
  else
    # Remove repeated elements from the first @@keywordsLimit keywords, and recursively
    # call this function
    getNonRepeatedKeywords(keywordsList - repeatedKeywords)
  end
end

#getPrimalRequest(urlOrText) ⇒ Object

Receives a string that represents a Web page URL or some text, and returns a Primal topic URI.

Returns nil on error



68
69
70
71
72
73
74
# File 'lib/primal/AlchemyAPIWrapper.rb', line 68

def getPrimalRequest(urlOrText)
  if isURI(urlOrText)
    getPrimalRequestURL(urlOrText)
  else
    getPrimalRequestTEXT(urlOrText)
  end
end

#getPrimalRequestTEXT(textToProcess) ⇒ Object

Processes the given Text at Alchemy and then translates the results to a valid Primal URL



128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/primal/AlchemyAPIWrapper.rb', line 128

def getPrimalRequestTEXT(textToProcess)
  if @@debugMe
    $stderr.puts "Extracting information from text..."
  end

  # get category
  categoryJSON = postAlchemy("#{@@alchemyText}/TextGetCategory",
                             :query => {
                               :outputMode => 'json',
                               :apikey => @alchemyApiKey,
                               :text => textToProcess
                            })

  # get entities
  entitiesJSON = postAlchemy("#{@@alchemyText}/TextGetRankedNamedEntities",
                             :query => {
                               :outputMode => 'json',
                               :apikey => @alchemyApiKey,
                               :text => textToProcess
                            })

  # get keywords
  keywordsJSON = postAlchemy("#{@@alchemyText}/TextGetRankedKeywords",
                             :query => {
                               :outputMode => 'json',
                               :apikey => @alchemyApiKey,
                               :text => textToProcess
                            })

  buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON)
end

#getPrimalRequestURL(urlToProcess) ⇒ Object

Processes the given URL at Alchemy and then translates the results to a valid Primal URL



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/primal/AlchemyAPIWrapper.rb', line 92

def getPrimalRequestURL(urlToProcess)
  if @@debugMe
    $stderr.puts "Extracting information from URL..."
  end

  # get category of the Web page
  categoryJSON = getAlchemy("#{@@alchemyURL}/URLGetCategory",
                            :query => {
                               :outputMode => 'json',
                               :apikey => @alchemyApiKey,
                               :url => urlToProcess
                           })

  # get entities in the Web page
  entitiesJSON = getAlchemy("#{@@alchemyURL}/URLGetRankedNamedEntities",
                            :query => {
                              :outputMode => 'json',
                              :apikey => @alchemyApiKey,
                              :url => urlToProcess
                           })

  # get keywords from the Web page
  keywordsJSON = getAlchemy("#{@@alchemyURL}/URLGetRankedKeywords",
                            :query => {
                              :outputMode => 'json',
                              :apikey => @alchemyApiKey,
                              :url => urlToProcess
                           })

  buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON)
end

#getRepeatedKeywords(keywordsList) ⇒ Object

Returns any repeated words in the list



254
255
256
257
258
259
260
# File 'lib/primal/AlchemyAPIWrapper.rb', line 254

def getRepeatedKeywords(keywordsList)
  keywordsList.select { |keyword|
         not(keywordsList.select { |other|
           other != keyword and other.include? keyword
         }.empty?)
  }
end

#isURI(string) ⇒ Object

Indicates whether or not a given string represents a URL



79
80
81
82
83
84
85
86
# File 'lib/primal/AlchemyAPIWrapper.rb', line 79

def isURI(string)
  uri = URI.parse(string)
  %w( http https ).include?(uri.scheme)
rescue URI::BadURIError
  false
rescue URI::InvalidURIError
  false
end

#postAlchemy(serviceURL, parameters) ⇒ Object

Perform a POST request to Alchemy service URL and return the response as a JSON object

Returns nil on error



302
303
304
305
# File 'lib/primal/AlchemyAPIWrapper.rb', line 302

def postAlchemy(serviceURL, parameters)
  response = self.class.post(serviceURL, parameters)
  returnAlchemyResponseJSON(response)
end

#returnAlchemyResponseJSON(response) ⇒ Object

Return the body of the response in a JSON object or nil on error



322
323
324
325
326
327
328
329
330
331
332
333
334
# File 'lib/primal/AlchemyAPIWrapper.rb', line 322

def returnAlchemyResponseJSON(response)
  code = response.code
  body = response.body
  bodyJSON = JSON.parse(body)

  # A statusInfo field contains the details of the error 
  if bodyJSON['status'] != "OK"
    puts bodyJSON['statusInfo']
    nil
  else
    bodyJSON
  end
end

#rewriteCategory(category) ⇒ Object

Modifies the extracted category string to become a clear topic in the Primal request.

AlchemyAPI categorizes text into a limited set of category types.

See http://www.alchemyapi.com/api/categ/categs.html for a complete list.

Some of the cateogy type names have two strings concatenated by an underscore character. This function selects one of the two strings (or a totally new string) to be the topic in the Primal request.



274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# File 'lib/primal/AlchemyAPIWrapper.rb', line 274

def rewriteCategory(category)
  case category
  when "unknown"               # AlchemyAPI failed to classify the text
    category = nil
  when "arts_entertainment"    
    category = "arts"          # rewrite to 'arts'
  when "computer_internet"
    category = "technology"    # rewrite to 'technology', a clearer topic for this category
  when "culture_politics"
    category = "politics"      # rewrite to 'politics'
  when "law_crime"
    category = "law"           # rewrite to 'law'
  when "science_technology"
    category = "science"       # rewrite to 'science'
  else
    # The previous conditions should cover all the categories extracted by
    # Alchemy.  In case of a new category that contains an underscore, replace
    # it and keep the two words as the topic. 
    category = category.sub('_', ' ')
  end
end