Class: SynonymScrapper::Scrapper
- Inherits:
-
Object
- Object
- SynonymScrapper::Scrapper
- Defined in:
- lib/synonym_scrapper/scrapper.rb
Overview
Base scrapper used to scrape APIs/websites
Direct Known Subclasses
Constant Summary collapse
- USER_AGENTS =
List of user agents to select from when scraping.
[ 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0', 'Mozilla/5.0 (Windows; U; Win 9x 4.90; SG; rv:1.9.2.4) Gecko/20101104 Netscape/9.1.0285', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0', 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201', 'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2', 'Mozilla/5.0 (iPhone; U; ru; CPU iPhone OS 4_2_1 like Mac OS X; ru) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148a Safari/6533.18.5', 'Mozilla/5.0 (Linux; U; Android 2.3.4; fr-fr; HTC Desire Build/GRJ22) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1', 'Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+', 'Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1', ]
Instance Attribute Summary collapse
-
#base_url ⇒ Object
Base url of the API/website to be consulted.
-
#max_retries ⇒ Object
Number, denotes the maximum number of retries to do for each failed request.
-
#retries_left ⇒ Object
Number, denotes how many more retries will be done for a request.
Instance Method Summary collapse
-
#build_call_url(endpoint) ⇒ Object
Method to be overwritten by classes that inherit from this one endpoint can be anything [Array, Hash, String, etc] as long as it is used consistently in the child class.
-
#call(endpoint) ⇒ Object
Executes a call to the given
endpointand returns its response. -
#initialize(max_retries, base_url) ⇒ Scrapper
constructor
Initilalize the scrapper with the
base_urlto scrape and the maximum number of retries,max_retries. -
#retry_call(endpoint) ⇒ Object
Retry the call to the
endpointspecified after a waiting between 50 and 1000 milliseconds (random sleep).
Constructor Details
#initialize(max_retries, base_url) ⇒ Scrapper
Initilalize the scrapper with the base_url to scrape and the maximum number of retries, max_retries
43 44 45 46 47 |
# File 'lib/synonym_scrapper/scrapper.rb', line 43 def initialize max_retries, base_url @max_retries = max_retries @retries_left = max_retries @base_url = base_url end |
Instance Attribute Details
#base_url ⇒ Object
Base url of the API/website to be consulted.
37 38 39 |
# File 'lib/synonym_scrapper/scrapper.rb', line 37 def base_url @base_url end |
#max_retries ⇒ Object
Number, denotes the maximum number of retries to do for each failed request.
29 30 31 |
# File 'lib/synonym_scrapper/scrapper.rb', line 29 def max_retries @max_retries end |
#retries_left ⇒ Object
Number, denotes how many more retries will be done for a request.
33 34 35 |
# File 'lib/synonym_scrapper/scrapper.rb', line 33 def retries_left @retries_left end |
Instance Method Details
#build_call_url(endpoint) ⇒ Object
Method to be overwritten by classes that inherit from this one endpoint can be anything [Array, Hash, String, etc] as long as it is used consistently in the child class.
54 55 56 |
# File 'lib/synonym_scrapper/scrapper.rb', line 54 def build_call_url endpoint raise Error, "This method must be redefined in subclasses" end |
#call(endpoint) ⇒ Object
Executes a call to the given endpoint and returns its response.
In case of HTTP Error, method will retry @max_retries times. In case of a 404 response, then it will be assumed that retrying would be useless and an empty array is returned. No retrying is done for other types of errors.
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
# File 'lib/synonym_scrapper/scrapper.rb', line 66 def call endpoint uri = build_call_url(endpoint) begin response = URI.open(uri, 'User-Agent' => USER_AGENTS.sample) rescue OpenURI::HTTPError => e puts e return [] if e. == '404 Not Found' retry_call endpoint unless @retries_left <= 0 rescue => e puts e end # Reset the retries_left variable on this instance after each request @retries_left = @max_retries return response end |
#retry_call(endpoint) ⇒ Object
Retry the call to the endpoint specified after a waiting between 50 and 1000 milliseconds (random sleep)
86 87 88 89 90 91 92 93 |
# File 'lib/synonym_scrapper/scrapper.rb', line 86 def retry_call endpoint @retries_left -= 1 sleepTime = (50 + rand(950)) / 1000 sleep(sleepTime) call(endpoint) end |