Class: Primal::InputTermExtraction
- Inherits:
-
Object
- Object
- Primal::InputTermExtraction
- Includes:
- HTTParty
- Defined in:
- lib/primal/AlchemyAPIWrapper.rb
Overview
The InputTermExtraction class abstracts accessing AlchemyAPI (www.alchemyapi.com/) to extract important information from Web pages/News articles/blog posts/plain text.
The main function is getPrimalRequest, which accepts a string that represents a Web page URL or some text, and build a Primal topic URI that will direct the user to the Primal Web App.
Constant Summary collapse
- @@debugMe =
Set this to true/false in order to turn on/off debugging of this class
true- @@alchemyRoot =
"http://access.alchemyapi.com/calls"- @@alchemyURL =
"#{@@alchemyRoot}/url"- @@alchemyText =
"#{@@alchemyRoot}/text"- @@entitiesLimit =
change these variables to modify how the Primal request is built
1- @@keywordsLimit =
3- @@categoryIgnores =
We ignore keywords that intersect with entities of the following types:
{ 'person' => 1, 'organization' => 1, 'city' => 1, 'company' => 1, 'continent' => 1, 'country' => 1, 'region' => 1, 'stateorcountry' => 1, 'geographicfeature' => 1 }
Instance Method Summary collapse
-
#buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) ⇒ Object
Uses the deconstructed Alchemy information to create a valid Primal URL.
-
#getAlchemy(serviceURL, parameters) ⇒ Object
Perform a GET request to Alchemy service URL and return the response as a JSON object.
-
#getNonRepeatedKeywords(keywordsList) ⇒ Object
Returns the top @@keywordsLimit keywords, ignoring those contained within other keywords.
-
#getPrimalRequest(urlOrText) ⇒ Object
Receives a string that represents a Web page URL or some text, and returns a Primal topic URI.
-
#getPrimalRequestTEXT(textToProcess) ⇒ Object
Processes the given Text at Alchemy and then translates the results to a valid Primal URL.
-
#getPrimalRequestURL(urlToProcess) ⇒ Object
Processes the given URL at Alchemy and then translates the results to a valid Primal URL.
-
#getRepeatedKeywords(keywordsList) ⇒ Object
Returns any repeated words in the list.
-
#initialize(alchemyApiKey) ⇒ InputTermExtraction
constructor
Constructor for the InputTermExtraction class.
-
#isURI(string) ⇒ Object
Indicates whether or not a given string represents a URL.
-
#postAlchemy(serviceURL, parameters) ⇒ Object
Perform a POST request to Alchemy service URL and return the response as a JSON object.
-
#returnAlchemyResponseJSON(response) ⇒ Object
Return the body of the response in a JSON object or nil on error.
-
#rewriteCategory(category) ⇒ Object
Modifies the extracted category string to become a clear topic in the Primal request.
Constructor Details
#initialize(alchemyApiKey) ⇒ InputTermExtraction
Constructor for the InputTermExtraction class
Pass in the Api Key for Alchemy services. You can register for a free API key here:
http://www.alchemyapi.com/api/register.html
to test your application. Please read the license of Alchemy API.
58 59 60 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 58 def initialize(alchemyApiKey) @alchemyApiKey = alchemyApiKey end |
Instance Method Details
#buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) ⇒ Object
Uses the deconstructed Alchemy information to create a valid Primal URL
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 163 def buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) # Check if any of the extractions failed if !categoryJSON or !entitiesJSON or !keywordsJSON $stderr.puts "Cannot build Primal request. Alchemy failed to extract information." return nil end if @@debugMe $stderr.puts "Building Primal request..." end ### Get information required for building a Primal request # Get the category from the extracted data category = rewriteCategory(categoryJSON['category']) if @@debugMe $stderr.puts "Category = #{category}" end ### Select top entities from all extracted entities entitiesList = entitiesJSON['entities'].collect { |entity| entity['text'].downcase }[0, @@entitiesLimit] if @@debugMe prettified = entitiesJSON['entities'].collect { |entity| entity['text'] }.join(', ') $stderr.puts "Entities = #{prettified}" end ### Select top keywords from all extracted keywords # Remove keywords that intersect with entities of the types in @@categoryIgnores allEntities = entitiesJSON['entities'].select { |entity| @@categoryIgnores.has_key? entity['type'].downcase }.collect { |entity| entity['text'].downcase } if @@debugMe prettified = keywordsJSON['keywords'].collect { |keyword| keyword['text'] }.join(', ') $stderr.puts "Keywords = #{prettified}" end keywordsList = keywordsJSON['keywords'].select { |keyword| normalizedKw = keyword['text'].downcase intersectsWithEntity = !(allEntities.select { |entity| entity.include? normalizedKw or normalizedKw.include? entity }.empty?) # Ignore keywords > 4 words or those that intersect with entities normalizedKw.split.size < 5 && !intersectsWithEntity }.collect { |keyword| keyword['text'].downcase } # Remove repeated keywords keywordsList = getNonRepeatedKeywords(keywordsList) ### Build Primal topic URI primalRequest = "" unless category.nil? then primalRequest = "/" + category end if entitiesList.size > 0 then primalRequest = primalRequest + "/" + entitiesList.join("/") end if keywordsList.size > 0 then primalRequest = primalRequest + "/" + keywordsList.join(";") end if @@debugMe $stderr.puts "Primal request = #{primalRequest}" end URI::encode(primalRequest) end |
#getAlchemy(serviceURL, parameters) ⇒ Object
Perform a GET request to Alchemy service URL and return the response as a JSON object
Returns nil on error
313 314 315 316 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 313 def getAlchemy(serviceURL, parameters) response = self.class.get(serviceURL, parameters) returnAlchemyResponseJSON(response) end |
#getNonRepeatedKeywords(keywordsList) ⇒ Object
Returns the top @@keywordsLimit keywords, ignoring those contained within other keywords
238 239 240 241 242 243 244 245 246 247 248 249 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 238 def getNonRepeatedKeywords(keywordsList) # If there is less than @@keywordsLimit, or if the first @@keywordsLimit keywords # are unique, return the first @@keywordsLimit keywords repeatedKeywords = getRepeatedKeywords(keywordsList[0, @@keywordsLimit]) if keywordsList.length <= @@keywordsLimit or repeatedKeywords.empty? keywordsList[0, @@keywordsLimit] else # Remove repeated elements from the first @@keywordsLimit keywords, and recursively # call this function getNonRepeatedKeywords(keywordsList - repeatedKeywords) end end |
#getPrimalRequest(urlOrText) ⇒ Object
Receives a string that represents a Web page URL or some text, and returns a Primal topic URI.
Returns nil on error
68 69 70 71 72 73 74 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 68 def getPrimalRequest(urlOrText) if isURI(urlOrText) getPrimalRequestURL(urlOrText) else getPrimalRequestTEXT(urlOrText) end end |
#getPrimalRequestTEXT(textToProcess) ⇒ Object
Processes the given Text at Alchemy and then translates the results to a valid Primal URL
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 128 def getPrimalRequestTEXT(textToProcess) if @@debugMe $stderr.puts "Extracting information from text..." end # get category categoryJSON = postAlchemy("#{@@alchemyText}/TextGetCategory", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :text => textToProcess }) # get entities entitiesJSON = postAlchemy("#{@@alchemyText}/TextGetRankedNamedEntities", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :text => textToProcess }) # get keywords keywordsJSON = postAlchemy("#{@@alchemyText}/TextGetRankedKeywords", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :text => textToProcess }) buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) end |
#getPrimalRequestURL(urlToProcess) ⇒ Object
Processes the given URL at Alchemy and then translates the results to a valid Primal URL
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 92 def getPrimalRequestURL(urlToProcess) if @@debugMe $stderr.puts "Extracting information from URL..." end # get category of the Web page categoryJSON = getAlchemy("#{@@alchemyURL}/URLGetCategory", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :url => urlToProcess }) # get entities in the Web page entitiesJSON = getAlchemy("#{@@alchemyURL}/URLGetRankedNamedEntities", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :url => urlToProcess }) # get keywords from the Web page keywordsJSON = getAlchemy("#{@@alchemyURL}/URLGetRankedKeywords", :query => { :outputMode => 'json', :apikey => @alchemyApiKey, :url => urlToProcess }) buildPrimalRequest(categoryJSON, entitiesJSON, keywordsJSON) end |
#getRepeatedKeywords(keywordsList) ⇒ Object
Returns any repeated words in the list
254 255 256 257 258 259 260 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 254 def getRepeatedKeywords(keywordsList) keywordsList.select { |keyword| not(keywordsList.select { |other| other != keyword and other.include? keyword }.empty?) } end |
#isURI(string) ⇒ Object
Indicates whether or not a given string represents a URL
79 80 81 82 83 84 85 86 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 79 def isURI(string) uri = URI.parse(string) %w( http https ).include?(uri.scheme) rescue URI::BadURIError false rescue URI::InvalidURIError false end |
#postAlchemy(serviceURL, parameters) ⇒ Object
Perform a POST request to Alchemy service URL and return the response as a JSON object
Returns nil on error
302 303 304 305 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 302 def postAlchemy(serviceURL, parameters) response = self.class.post(serviceURL, parameters) returnAlchemyResponseJSON(response) end |
#returnAlchemyResponseJSON(response) ⇒ Object
Return the body of the response in a JSON object or nil on error
322 323 324 325 326 327 328 329 330 331 332 333 334 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 322 def returnAlchemyResponseJSON(response) code = response.code body = response.body bodyJSON = JSON.parse(body) # A statusInfo field contains the details of the error if bodyJSON['status'] != "OK" puts bodyJSON['statusInfo'] nil else bodyJSON end end |
#rewriteCategory(category) ⇒ Object
Modifies the extracted category string to become a clear topic in the Primal request.
AlchemyAPI categorizes text into a limited set of category types.
See http://www.alchemyapi.com/api/categ/categs.html for a complete list.
Some of the cateogy type names have two strings concatenated by an underscore character. This function selects one of the two strings (or a totally new string) to be the topic in the Primal request.
274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 |
# File 'lib/primal/AlchemyAPIWrapper.rb', line 274 def rewriteCategory(category) case category when "unknown" # AlchemyAPI failed to classify the text category = nil when "arts_entertainment" category = "arts" # rewrite to 'arts' when "computer_internet" category = "technology" # rewrite to 'technology', a clearer topic for this category when "culture_politics" category = "politics" # rewrite to 'politics' when "law_crime" category = "law" # rewrite to 'law' when "science_technology" category = "science" # rewrite to 'science' else # The previous conditions should cover all the categories extracted by # Alchemy. In case of a new category that contains an underscore, replace # it and keep the two words as the topic. category = category.sub('_', ' ') end end |