Class: TF2R::Scraper
- Inherits:
-
Object
- Object
- TF2R::Scraper
- Includes:
- TextHelpers
- Defined in:
- lib/tf2r/scraper.rb
Overview
This class provides a way to retrieve webpages from tf2r.com and scrape them for information about raffles, users, entries, etc.
Defined Under Namespace
Classes: InvalidUserPage
Instance Method Summary collapse
-
#fetch(url) ⇒ Mechanize::Page
Fetches the page at the given URL.
-
#initialize(options = {}) ⇒ Scraper
constructor
Creates a Scraper.
-
#load_cookies(cookies_txt) ⇒ Mechanize::CookieJar
Loads the Mechanize agent with cookies from a cookies.txt.
-
#scrape_main_page ⇒ Hash
Scrapes TF2R for all active raffles.
-
#scrape_raffle(page) ⇒ Array
Scrapes a raffle for all available information.
-
#scrape_raffle_for_creator(page) ⇒ Hash
Scrapes a raffle page for information about the creator.
-
#scrape_raffle_for_participants(page) ⇒ Array
Scrapes a raffle page for all the participants.
-
#scrape_raffle_for_raffle(page) ⇒ Hash
Scrapes a raffle page for some information about the raffle.
-
#scrape_ranks(info_page) ⇒ Array
Scrapes the TF2R info page for available user ranks.
-
#scrape_user(user_page) ⇒ Hash
Scrapes a user page for information about the user.
Methods included from TextHelpers
#extract_color, #extract_link_snippet, #extract_steam_id, #raffle_link, #raffle_link_full, #user_link
Constructor Details
#initialize(options = {}) ⇒ Scraper
Creates a Scraper. Pass values using the options hash.
:user_agent a String used for the User-Agent header
:cookies_txt a File containing cookies to load into the Mechanize agent
20 21 22 23 24 25 26 |
# File 'lib/tf2r/scraper.rb', line 20 def initialize(={}) @mech = Mechanize.new { |mech| mech.user_agent = [:user_agent] || "TF2R::Scraper #{VERSION}" } ([:cookies_txt]) if [:cookies_txt] end |
Instance Method Details
#fetch(url) ⇒ Mechanize::Page
Fetches the page at the given URL.
44 45 46 |
# File 'lib/tf2r/scraper.rb', line 44 def fetch(url) @mech.get(url) end |
#load_cookies(cookies_txt) ⇒ Mechanize::CookieJar
Loads the Mechanize agent with cookies from a cookies.txt.
Certain pages on TF2R require a session with a logged-in user. This requires a Netscape-style cookies.txt that contains a valid “session” cookie for “tf2r.com”.
36 37 38 |
# File 'lib/tf2r/scraper.rb', line 36 def () @mech..load(, :cookiestxt) end |
#scrape_main_page ⇒ Hash
Scrapes TF2R for all active raffles.
59 60 61 62 63 64 65 66 |
# File 'lib/tf2r/scraper.rb', line 59 def scrape_main_page page = fetch('http://tf2r.com/raffles.html') # All raffle links begin with 'tf2r.com/k' raffle_links = page.links_with(href: /tf2r\.com\/k/) raffle_links.map! { |x| x.uri.to_s } raffle_links.reverse! end |
#scrape_raffle(page) ⇒ Array
Scrapes a raffle for all available information.
74 75 76 77 |
# File 'lib/tf2r/scraper.rb', line 74 def scrape_raffle(page) [scrape_raffle_for_raffle(page), scrape_raffle_for_creator(page)] end |
#scrape_raffle_for_creator(page) ⇒ Hash
Scrapes a raffle page for information about the creator.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/tf2r/scraper.rb', line 99 def scrape_raffle_for_creator(page) # Reag classed some things "raffle_infomation". That's spelled right. infos = page.parser.css('.raffle_infomation') # The main 'a' element, containing the creator's username. user_anchor = infos[2].css('a')[0] steam_id = extract_steam_id(user_anchor.attribute('href').to_s) username = user_anchor.text avatar_link = infos[1].css('img')[0].attribute('src').to_s posrep = /(\d+)/.match(infos.css('.upvb').text)[1].to_i negrep = /(\d+)/.match(infos.css('.downvb').text)[1].to_i # The creator's username color. Corresponds to rank. color = extract_color(user_anchor.attribute('style').to_s) {steam_id: steam_id, username: username, avatar_link: avatar_link, posrep: posrep, negrep: negrep, color: color} end |
#scrape_raffle_for_participants(page) ⇒ Array
Scrapes a raffle page for all the participants.
This should rarely be used. This will only be necessary in the case that a raffle has maximum entries greater than 2500.
in chronological order (first entered to last).
* :steam_id (+Fixnum+) — the participant's SteamID64.
* :username (+String+) — the participant's username.
* :color (+String+) — hex color code of the participant's username.
174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/tf2r/scraper.rb', line 174 def scrape_raffle_for_participants(page) participants = [] participant_divs = page.parser.css('.pentry') participant_divs.each do |participant| user_anchor = participant.children[1] steam_id = extract_steam_id(user_anchor.to_s) username = participant.text color = extract_color(user_anchor.children[0].attribute('style')) participants << {steam_id: steam_id, username: username, color: color} end participants.reverse! end |
#scrape_raffle_for_raffle(page) ⇒ Hash
Scrapes a raffle page for some information about the raffle.
The information is incomplete. This should be used in conjunction with the API as part of TF2R::Raffle.
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/tf2r/scraper.rb', line 140 def scrape_raffle_for_raffle(page) # Reag classed some things "raffle_infomation". That's spelled right. infos = page.parser.css('.raffle_infomation') # Elements of the main raffle info table. raffle_tds = infos[3].css('td') # 'kabc123' for http://tf2r.com/kabc123.html' link_snippet = extract_link_snippet(page.uri.path) title = extract_title(infos[0].text) description = raffle_tds[1].text # This doesn't work right now, because Miz just displays "10%" in the # page HTML and updates it with JS after a call to the API. # win_chance = /(.+)%/.match(infos.css('#winc').text)[1].to_f / 100 start_time = extract_start_time(raffle_tds[9]) end_time = extract_end_time(raffle_tds[11]) {link_snippet: link_snippet, title: title, description: description, start_time: start_time, end_time: end_time} end |
#scrape_ranks(info_page) ⇒ Array
Scrapes the TF2R info page for available user ranks.
See tf2r.com/info.html.
245 246 247 248 249 |
# File 'lib/tf2r/scraper.rb', line 245 def scrape_ranks(info_page) rank_divs = info_page.parser.css('#ranks').children ranks = rank_divs.select { |div| div.children.size == 3 } ranks.map { |div| extract_rank(div) } end |
#scrape_user(user_page) ⇒ Hash
Scrapes a user page for information about the user.
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
# File 'lib/tf2r/scraper.rb', line 209 def scrape_user(user_page) if user_page.parser.css('.profile_info').empty? raise InvalidUserPage, 'The given page does not correspond to any user.' else infos = user_page.parser.css('.raffle_infomation') #sic user_anchor = infos[1].css('a')[0] steam_id = extract_steam_id(user_page.uri.to_s) username = /TF2R Item Raffles - (.+)/.match(user_page.title)[1] avatar_link = infos[0].css('img')[0].attribute('src').to_s posrep = infos.css('.upvb').text.to_i negrep = infos.css('.downvb').text.to_i color = extract_color(user_anchor.attribute('style').to_s) end {steam_id: steam_id, username: username, avatar_link: avatar_link, posrep: posrep, negrep: negrep, color: color} end |