Class: TF2R::Scraper

Inherits:

Object

Object
TF2R::Scraper

Includes:: TextHelpers

Defined in:: lib/tf2r/scraper.rb

Overview

This class provides a way to retrieve webpages from tf2r.com and scrape them for information about raffles, users, entries, etc.

Author:

Justin Kim

Defined Under Namespace

Classes: InvalidUserPage

Instance Method Summary collapse

#fetch(url) ⇒ Mechanize::Page

Fetches the page at the given URL.
#initialize(options = {}) ⇒ Scraper constructor

Creates a Scraper.
#load_cookies(cookies_txt) ⇒ Mechanize::CookieJar

Loads the Mechanize agent with cookies from a cookies.txt.
#scrape_main_page ⇒ Hash

Scrapes TF2R for all active raffles.
#scrape_raffle(page) ⇒ Array

Scrapes a raffle for all available information.
#scrape_raffle_for_creator(page) ⇒ Hash

Scrapes a raffle page for information about the creator.
#scrape_raffle_for_participants(page) ⇒ Array

Scrapes a raffle page for all the participants.
#scrape_raffle_for_raffle(page) ⇒ Hash

Scrapes a raffle page for some information about the raffle.
#scrape_ranks(info_page) ⇒ Array

Scrapes the TF2R info page for available user ranks.
#scrape_user(user_page) ⇒ Hash

Scrapes a user page for information about the user.

Methods included from TextHelpers

#extract_color, #extract_link_snippet, #extract_steam_id, #raffle_link, #raffle_link_full, #user_link

Constructor Details

#initialize(options = {}) ⇒ `Scraper`

Creates a Scraper. Pass values using the options hash.

:user_agent a String used for the User-Agent header
:cookies_txt a File containing cookies to load into the Mechanize agent

Parameters:

options (Hash) (defaults to: {}) —

options to create a Scraper with

Options Hash (options):

:user_agent (String) —

a custom User-Agent header content
:cookies_txt (File) —

a cookies.txt to load the Mechanize agent with

# File 'lib/tf2r/scraper.rb', line 20

def initialize(options={})
  @mech = Mechanize.new { |mech|
    mech.user_agent = options[:user_agent] || "TF2R::Scraper #{VERSION}"
  }

  load_cookies(options[:cookies_txt]) if options[:cookies_txt]
end

Instance Method Details

#fetch(url) ⇒ `Mechanize::Page`

Fetches the page at the given URL.

Parameters:

url (String) —

the desired URL.

Returns:

(Mechanize::Page) —

the page given by Mechanize.



44
45
46

# File 'lib/tf2r/scraper.rb', line 44

def fetch(url)
  @mech.get(url)
end

#load_cookies(cookies_txt) ⇒ `Mechanize::CookieJar`

Loads the Mechanize agent with cookies from a cookies.txt.

Certain pages on TF2R require a session with a logged-in user. This requires a Netscape-style cookies.txt that contains a valid “session” cookie for “tf2r.com”.

Parameters:

cookies_txt (File) —

the cookies.txt file.

Returns:

(Mechanize::CookieJar) —

the CookieJar of the Mechanize agent.



36
37
38

# File 'lib/tf2r/scraper.rb', line 36

def load_cookies(cookies_txt)
  @mech.cookie_jar.load(cookies_txt, :cookiestxt)
end

#scrape_main_page ⇒ `Hash`

Scrapes TF2R for all active raffles.

See tf2r.com/raffles.html

Examples:

s.scrape_main_page #=> ['http://tf2r.com/kold.html',
                        'http://tf2r.com/knew.html',
                        'http://tf2r.com/knewest.html']

Returns:

(Hash) —

String links of all active raffles in chronological order (oldest to newest by creation time).

# File 'lib/tf2r/scraper.rb', line 59

def scrape_main_page
  page = fetch('http://tf2r.com/raffles.html')

  # All raffle links begin with 'tf2r.com/k'
  raffle_links = page.links_with(href: /tf2r\.com\/k/)
  raffle_links.map! { |x| x.uri.to_s }
  raffle_links.reverse!
end

#scrape_raffle(page) ⇒ `Array`

Scrapes a raffle for all available information.

Parameters:

page (Mechanize::Page) —

the raffle page.

Returns:

(Array) —
- the raffle Hash from scrape_raffle_for_raffle
- the creator Hash from scrape_raffle_for_creator

# File 'lib/tf2r/scraper.rb', line 74

def scrape_raffle(page)
  [scrape_raffle_for_raffle(page),
   scrape_raffle_for_creator(page)]
end

#scrape_raffle_for_creator(page) ⇒ `Hash`

Scrapes a raffle page for information about the creator.

Examples:

p = s.fetch('http://tf2r.com/kstzcbd.html')
s.scrape_raffle_for_creator(p) #=>
{:steam_id=>76561198061719848,
 :username=>"Yulli",
 :avatar_link=>"http://media.steampowered.com/steamcommunity/public/images/avatars/bc/bc9dc4302d23f2e2f37f59c59f29c27dbc8cade6_full.jpg",
 :posrep=>11458,
 :negrep=>0,
 :color=>"70b01b"}

Parameters:

page (Mechanize::Page) —

the raffle page.

Returns:

(Hash) —
a representation of a user, the raffle creator.
- :steam_id (Fixnum) — the creator’s SteamID64.
- :username (String) — the creator’s username.
- :avatar_link (String) — a link to the creator’s avatar.
- :posrep (Fixnum) — the creator’s positive rep.
- :negrep (Fixnum) — the creator’s negative rep.
- :color (String) — hex color code of the creator’s username.

# File 'lib/tf2r/scraper.rb', line 99

def scrape_raffle_for_creator(page)
  # Reag classed some things "raffle_infomation". That's spelled right.
  infos = page.parser.css('.raffle_infomation')

  # The main 'a' element, containing the creator's username.
  user_anchor = infos[2].css('a')[0]

  steam_id = extract_steam_id(user_anchor.attribute('href').to_s)
  username = user_anchor.text
  avatar_link = infos[1].css('img')[0].attribute('src').to_s
  posrep = /(\d+)/.match(infos.css('.upvb').text)[1].to_i
  negrep = /(\d+)/.match(infos.css('.downvb').text)[1].to_i

  # The creator's username color. Corresponds to rank.
  color = extract_color(user_anchor.attribute('style').to_s)

  {steam_id: steam_id, username: username, avatar_link: avatar_link,
   posrep: posrep, negrep: negrep, color: color}
end

#scrape_raffle_for_participants(page) ⇒ `Array`

Scrapes a raffle page for all the participants.

This should rarely be used. This will only be necessary in the case that a raffle has maximum entries greater than 2500.

in chronological order (first entered to last).

* :steam_id (+Fixnum+) — the participant's SteamID64.
* :username (+String+) — the participant's username.
* :color (+String+) — hex color code of the participant's username.

Parameters:

page (Mechanize::Page) —

the raffle page.

Returns:

(Array) —

contains Hashes representing each of the participants,

# File 'lib/tf2r/scraper.rb', line 174

def scrape_raffle_for_participants(page)
  participants = []
  participant_divs = page.parser.css('.pentry')
  participant_divs.each do |participant|
    user_anchor = participant.children[1]
    steam_id = extract_steam_id(user_anchor.to_s)
    username = participant.text
    color = extract_color(user_anchor.children[0].attribute('style'))

    participants << {steam_id: steam_id, username: username, color: color}
  end

  participants.reverse!
end

#scrape_raffle_for_raffle(page) ⇒ `Hash`

Scrapes a raffle page for some information about the raffle.

The information is incomplete. This should be used in conjunction with the API as part of TF2R::Raffle.

Examples:

p = s.fetch('http://tf2r.com/kstzcbd.html')
s.scrape_raffle_for_raffle(p) #=>
{:link_snippet=>"kstzcbd",
 :title=>"Just one refined [1 hour]",
 :description=>"Plain and simple.",
 :start_time=>2012-10-29 09:51:45 -0400,
 :end_time=>2012-10-29 09:53:01 -0400}

Parameters:

page (Mechanize::Page) —

the raffle page.

Returns:

(Hash) —
a partial representation of the raffle.
- :link_snippet (String) — the “raffle id” in the URL.
- :title (String) — the raffle’s title.
- :description (String) — the raffle’s “message”.
- :start_time (Time) — the creation time of the raffle.
- :end_time (Time) — the projects/observed end time for the raffle.

# File 'lib/tf2r/scraper.rb', line 140

def scrape_raffle_for_raffle(page)
  # Reag classed some things "raffle_infomation". That's spelled right.
  infos = page.parser.css('.raffle_infomation')

  # Elements of the main raffle info table.
  raffle_tds = infos[3].css('td')

  # 'kabc123' for http://tf2r.com/kabc123.html'
  link_snippet = extract_link_snippet(page.uri.path)
  title = extract_title(infos[0].text)
  description = raffle_tds[1].text

  # This doesn't work right now, because Miz just displays "10%" in the
  # page HTML and updates it with JS after a call to the API.
  # win_chance = /(.+)%/.match(infos.css('#winc').text)[1].to_f / 100

  start_time = extract_start_time(raffle_tds[9])
  end_time = extract_end_time(raffle_tds[11])

  {link_snippet: link_snippet, title: title, description: description,
   start_time: start_time, end_time: end_time}
end

#scrape_ranks(info_page) ⇒ `Array`

Scrapes the TF2R info page for available user ranks.

See tf2r.com/info.html.

Examples:

p = s.fetch('http://tf2r.com/info.html')
s.scrape_user(p) #=>
[{:color=>"ebe2ca", :name=>"User",
  :description=>"Every new or existing user has this rank."}, ...]

Parameters:

info_page (Mechanize::Page) —

the info page.

Returns:

(Array) —
contains Hashes representing each of the ranks.
- :name (String) — the rank’s name.
- :description (String) — the rank’s description.
- :color (String) — the rank’s hex color code.

# File 'lib/tf2r/scraper.rb', line 245

def scrape_ranks(info_page)
  rank_divs = info_page.parser.css('#ranks').children
  ranks = rank_divs.select { |div| div.children.size == 3 }
  ranks.map { |div| extract_rank(div) }
end

#scrape_user(user_page) ⇒ `Hash`

Scrapes a user page for information about the user.

Examples:

p = s.fetch('http://tf2r.com/user/76561198061719848.html')
s.scrape_user(p) #=>
{:steam_id=>76561198061719848,
 :username=>"Yulli",
 :avatar_link=>"http://media.steampowered.com/steamcommunity/public/images/avatars/bc/bc9dc4302d23f2e2f37f59c59f29c27dbc8cade6_full.jpg",
  :posrep=>11459,
  :negrep=>0,
  :color=>"70b01b"}

Parameters:

user_page (Mechanize::Page) —

the user page.

Returns:

(Hash) —
a representation of the user.
- :steam_id (Fixnum) — the user’s SteamID64.
- :username (String) — the user’s username.
- :avatar_link (String) — a link to the user’s avatar.
- :posrep (Fixnum) — the user’s positive rep.
- :negrep (Fixnum) — the user’s negative rep.
- :color (String) — hex color code of the user’s username.

# File 'lib/tf2r/scraper.rb', line 209

def scrape_user(user_page)
  if user_page.parser.css('.profile_info').empty?
    raise InvalidUserPage, 'The given page does not correspond to any user.'
  else
    infos = user_page.parser.css('.raffle_infomation') #sic
    user_anchor = infos[1].css('a')[0]

    steam_id = extract_steam_id(user_page.uri.to_s)
    username = /TF2R Item Raffles - (.+)/.match(user_page.title)[1]
    avatar_link = infos[0].css('img')[0].attribute('src').to_s

    posrep = infos.css('.upvb').text.to_i
    negrep = infos.css('.downvb').text.to_i

    color = extract_color(user_anchor.attribute('style').to_s)
  end

  {steam_id: steam_id, username: username, avatar_link: avatar_link,
   posrep: posrep, negrep: negrep, color: color}
end

Class: TF2R::Scraper

Overview

Defined Under Namespace

Instance Method Summary collapse

Methods included from TextHelpers

Constructor Details

#initialize(options = {}) ⇒ Scraper

Instance Method Details

#fetch(url) ⇒ Mechanize::Page

#load_cookies(cookies_txt) ⇒ Mechanize::CookieJar

#scrape_main_page ⇒ Hash

#scrape_raffle(page) ⇒ Array

#scrape_raffle_for_creator(page) ⇒ Hash

#scrape_raffle_for_participants(page) ⇒ Array

#scrape_raffle_for_raffle(page) ⇒ Hash

#scrape_ranks(info_page) ⇒ Array

#scrape_user(user_page) ⇒ Hash

#initialize(options = {}) ⇒ `Scraper`

#fetch(url) ⇒ `Mechanize::Page`

#load_cookies(cookies_txt) ⇒ `Mechanize::CookieJar`

#scrape_main_page ⇒ `Hash`

#scrape_raffle(page) ⇒ `Array`

#scrape_raffle_for_creator(page) ⇒ `Hash`

#scrape_raffle_for_participants(page) ⇒ `Array`

#scrape_raffle_for_raffle(page) ⇒ `Hash`

#scrape_ranks(info_page) ⇒ `Array`

#scrape_user(user_page) ⇒ `Hash`