Class: TF2R::Scraper

Inherits:
Object
  • Object
show all
Includes:
TextHelpers
Defined in:
lib/tf2r/scraper.rb

Overview

This class provides a way to retrieve webpages from tf2r.com and scrape them for information about raffles, users, entries, etc.

Author:

  • Justin Kim

Defined Under Namespace

Classes: InvalidUserPage

Instance Method Summary collapse

Methods included from TextHelpers

#extract_color, #extract_link_snippet, #extract_steam_id, #raffle_link, #raffle_link_full, #user_link

Constructor Details

#initialize(options = {}) ⇒ Scraper

Creates a Scraper. Pass values using the options hash.

:user_agent a String used for the User-Agent header
:cookies_txt a File containing cookies to load into the Mechanize agent

Parameters:

  • options (Hash) (defaults to: {})

    options to create a Scraper with

Options Hash (options):

  • :user_agent (String)

    a custom User-Agent header content

  • :cookies_txt (File)

    a cookies.txt to load the Mechanize agent with



20
21
22
23
24
25
26
# File 'lib/tf2r/scraper.rb', line 20

def initialize(options={})
  @mech = Mechanize.new { |mech|
    mech.user_agent = options[:user_agent] || "TF2R::Scraper #{VERSION}"
  }

  load_cookies(options[:cookies_txt]) if options[:cookies_txt]
end

Instance Method Details

#fetch(url) ⇒ Mechanize::Page

Fetches the page at the given URL.

Parameters:

  • url (String)

    the desired URL.

Returns:

  • (Mechanize::Page)

    the page given by Mechanize.



44
45
46
# File 'lib/tf2r/scraper.rb', line 44

def fetch(url)
  @mech.get(url)
end

#load_cookies(cookies_txt) ⇒ Mechanize::CookieJar

Loads the Mechanize agent with cookies from a cookies.txt.

Certain pages on TF2R require a session with a logged-in user. This requires a Netscape-style cookies.txt that contains a valid “session” cookie for “tf2r.com”.

Parameters:

  • cookies_txt (File)

    the cookies.txt file.

Returns:

  • (Mechanize::CookieJar)

    the CookieJar of the Mechanize agent.



36
37
38
# File 'lib/tf2r/scraper.rb', line 36

def load_cookies(cookies_txt)
  @mech.cookie_jar.load(cookies_txt, :cookiestxt)
end

#scrape_main_pageHash

Scrapes TF2R for all active raffles.

See tf2r.com/raffles.html

Examples:

s.scrape_main_page #=> ['http://tf2r.com/kold.html',
                        'http://tf2r.com/knew.html',
                        'http://tf2r.com/knewest.html']

Returns:

  • (Hash)

    String links of all active raffles in chronological order (oldest to newest by creation time).



59
60
61
62
63
64
65
66
# File 'lib/tf2r/scraper.rb', line 59

def scrape_main_page
  page = fetch('http://tf2r.com/raffles.html')

  # All raffle links begin with 'tf2r.com/k'
  raffle_links = page.links_with(href: /tf2r\.com\/k/)
  raffle_links.map! { |x| x.uri.to_s }
  raffle_links.reverse!
end

#scrape_raffle(page) ⇒ Array

Scrapes a raffle for all available information.

Parameters:

  • page (Mechanize::Page)

    the raffle page.

Returns:

  • (Array)
    • the raffle Hash from scrape_raffle_for_raffle

    • the creator Hash from scrape_raffle_for_creator



74
75
76
77
# File 'lib/tf2r/scraper.rb', line 74

def scrape_raffle(page)
  [scrape_raffle_for_raffle(page),
   scrape_raffle_for_creator(page)]
end

#scrape_raffle_for_creator(page) ⇒ Hash

Scrapes a raffle page for information about the creator.

Examples:

p = s.fetch('http://tf2r.com/kstzcbd.html')
s.scrape_raffle_for_creator(p) #=>
{:steam_id=>76561198061719848,
 :username=>"Yulli",
 :avatar_link=>"http://media.steampowered.com/steamcommunity/public/images/avatars/bc/bc9dc4302d23f2e2f37f59c59f29c27dbc8cade6_full.jpg",
 :posrep=>11458,
 :negrep=>0,
 :color=>"70b01b"}

Parameters:

  • page (Mechanize::Page)

    the raffle page.

Returns:

  • (Hash)

    a representation of a user, the raffle creator.

    • :steam_id (Fixnum) — the creator’s SteamID64.

    • :username (String) — the creator’s username.

    • :avatar_link (String) — a link to the creator’s avatar.

    • :posrep (Fixnum) — the creator’s positive rep.

    • :negrep (Fixnum) — the creator’s negative rep.

    • :color (String) — hex color code of the creator’s username.



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/tf2r/scraper.rb', line 99

def scrape_raffle_for_creator(page)
  # Reag classed some things "raffle_infomation". That's spelled right.
  infos = page.parser.css('.raffle_infomation')

  # The main 'a' element, containing the creator's username.
  user_anchor = infos[2].css('a')[0]

  steam_id = extract_steam_id(user_anchor.attribute('href').to_s)
  username = user_anchor.text
  avatar_link = infos[1].css('img')[0].attribute('src').to_s
  posrep = /(\d+)/.match(infos.css('.upvb').text)[1].to_i
  negrep = /(\d+)/.match(infos.css('.downvb').text)[1].to_i

  # The creator's username color. Corresponds to rank.
  color = extract_color(user_anchor.attribute('style').to_s)

  {steam_id: steam_id, username: username, avatar_link: avatar_link,
   posrep: posrep, negrep: negrep, color: color}
end

#scrape_raffle_for_participants(page) ⇒ Array

Scrapes a raffle page for all the participants.

This should rarely be used. This will only be necessary in the case that a raffle has maximum entries greater than 2500.

in chronological order (first entered to last).

* :steam_id (+Fixnum+) — the participant's SteamID64.
* :username (+String+) — the participant's username.
* :color (+String+) — hex color code of the participant's username.

Parameters:

  • page (Mechanize::Page)

    the raffle page.

Returns:

  • (Array)

    contains Hashes representing each of the participants,



174
175
176
177
178
179
180
181
182
183
184
185
186
187
# File 'lib/tf2r/scraper.rb', line 174

def scrape_raffle_for_participants(page)
  participants = []
  participant_divs = page.parser.css('.pentry')
  participant_divs.each do |participant|
    user_anchor = participant.children[1]
    steam_id = extract_steam_id(user_anchor.to_s)
    username = participant.text
    color = extract_color(user_anchor.children[0].attribute('style'))

    participants << {steam_id: steam_id, username: username, color: color}
  end

  participants.reverse!
end

#scrape_raffle_for_raffle(page) ⇒ Hash

Scrapes a raffle page for some information about the raffle.

The information is incomplete. This should be used in conjunction with the API as part of TF2R::Raffle.

Examples:

p = s.fetch('http://tf2r.com/kstzcbd.html')
s.scrape_raffle_for_raffle(p) #=>
{:link_snippet=>"kstzcbd",
 :title=>"Just one refined [1 hour]",
 :description=>"Plain and simple.",
 :start_time=>2012-10-29 09:51:45 -0400,
 :end_time=>2012-10-29 09:53:01 -0400}

Parameters:

  • page (Mechanize::Page)

    the raffle page.

Returns:

  • (Hash)

    a partial representation of the raffle.

    • :link_snippet (String) — the “raffle id” in the URL.

    • :title (String) — the raffle’s title.

    • :description (String) — the raffle’s “message”.

    • :start_time (Time) — the creation time of the raffle.

    • :end_time (Time) — the projects/observed end time for the raffle.



140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/tf2r/scraper.rb', line 140

def scrape_raffle_for_raffle(page)
  # Reag classed some things "raffle_infomation". That's spelled right.
  infos = page.parser.css('.raffle_infomation')

  # Elements of the main raffle info table.
  raffle_tds = infos[3].css('td')

  # 'kabc123' for http://tf2r.com/kabc123.html'
  link_snippet = extract_link_snippet(page.uri.path)
  title = extract_title(infos[0].text)
  description = raffle_tds[1].text

  # This doesn't work right now, because Miz just displays "10%" in the
  # page HTML and updates it with JS after a call to the API.
  # win_chance = /(.+)%/.match(infos.css('#winc').text)[1].to_f / 100

  start_time = extract_start_time(raffle_tds[9])
  end_time = extract_end_time(raffle_tds[11])

  {link_snippet: link_snippet, title: title, description: description,
   start_time: start_time, end_time: end_time}
end

#scrape_ranks(info_page) ⇒ Array

Scrapes the TF2R info page for available user ranks.

See tf2r.com/info.html.

Examples:

p = s.fetch('http://tf2r.com/info.html')
s.scrape_user(p) #=>
[{:color=>"ebe2ca", :name=>"User",
  :description=>"Every new or existing user has this rank."}, ...]

Parameters:

  • info_page (Mechanize::Page)

    the info page.

Returns:

  • (Array)

    contains Hashes representing each of the ranks.

    • :name (String) — the rank’s name.

    • :description (String) — the rank’s description.

    • :color (String) — the rank’s hex color code.



245
246
247
248
249
# File 'lib/tf2r/scraper.rb', line 245

def scrape_ranks(info_page)
  rank_divs = info_page.parser.css('#ranks').children
  ranks = rank_divs.select { |div| div.children.size == 3 }
  ranks.map { |div| extract_rank(div) }
end

#scrape_user(user_page) ⇒ Hash

Scrapes a user page for information about the user.

Examples:

p = s.fetch('http://tf2r.com/user/76561198061719848.html')
s.scrape_user(p) #=>
{:steam_id=>76561198061719848,
 :username=>"Yulli",
 :avatar_link=>"http://media.steampowered.com/steamcommunity/public/images/avatars/bc/bc9dc4302d23f2e2f37f59c59f29c27dbc8cade6_full.jpg",
  :posrep=>11459,
  :negrep=>0,
  :color=>"70b01b"}

Parameters:

  • user_page (Mechanize::Page)

    the user page.

Returns:

  • (Hash)

    a representation of the user.

    • :steam_id (Fixnum) — the user’s SteamID64.

    • :username (String) — the user’s username.

    • :avatar_link (String) — a link to the user’s avatar.

    • :posrep (Fixnum) — the user’s positive rep.

    • :negrep (Fixnum) — the user’s negative rep.

    • :color (String) — hex color code of the user’s username.



209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/tf2r/scraper.rb', line 209

def scrape_user(user_page)
  if user_page.parser.css('.profile_info').empty?
    raise InvalidUserPage, 'The given page does not correspond to any user.'
  else
    infos = user_page.parser.css('.raffle_infomation') #sic
    user_anchor = infos[1].css('a')[0]

    steam_id = extract_steam_id(user_page.uri.to_s)
    username = /TF2R Item Raffles - (.+)/.match(user_page.title)[1]
    avatar_link = infos[0].css('img')[0].attribute('src').to_s

    posrep = infos.css('.upvb').text.to_i
    negrep = infos.css('.downvb').text.to_i

    color = extract_color(user_anchor.attribute('style').to_s)
  end

  {steam_id: steam_id, username: username, avatar_link: avatar_link,
   posrep: posrep, negrep: negrep, color: color}
end