Class: Statement::Scraper
- Inherits:
-
Object
- Object
- Statement::Scraper
- Defined in:
- lib/statement/scraper.rb
Class Method Summary collapse
- .backfill_bilirakis ⇒ Object
- .backfill_boustany ⇒ Object
- .backfill_from_scrapers ⇒ Object
- .barton ⇒ Object
- .billnelson(year = 2013) ⇒ Object
- .boxer(start = 1) ⇒ Object
- .capuano ⇒ Object
- .chabot(year = Date.today.year) ⇒ Object
- .clark(year = Date.today.year) ⇒ Object
- .coburn(year = Date.today.year) ⇒ Object
- .cold_fusion(year = Date.today.year, month = nil) ⇒ Object
- .committee_methods ⇒ Object
- .committee_scrapers ⇒ Object
- .conaway(page = 1) ⇒ Object
- .costa ⇒ Object
- .crapo ⇒ Object
- .crenshaw(year = Date.today.year, month = nil) ⇒ Object
- .culberson_chabot_grisham(page = 1) ⇒ Object
- .document_query(page = 1) ⇒ Object
-
.donnelly(year = Date.today.year) ⇒ Object
deprecated.
- .edwards ⇒ Object
- .ellison(page = 0) ⇒ Object
- .farr(year = 2014) ⇒ Object
- .fischer(year = Date.today.year) ⇒ Object
- .gabbard ⇒ Object
- .house_energy_minority ⇒ Object
- .house_gop(url) ⇒ Object
- .house_homeland_security_minority ⇒ Object
- .house_judiciary_majority ⇒ Object
- .house_rules_majority ⇒ Object
- .house_ways_means_majority ⇒ Object
- .inhofe(year = Date.today.year) ⇒ Object
- .klobuchar(year) ⇒ Object
-
.lautenberg(rows = 1000) ⇒ Object
fetches the latest 1000 releases, can be altered.
- .lujan ⇒ Object
- .mcclintock ⇒ Object
- .mcnerney(page = 1) ⇒ Object
- .member_methods ⇒ Object
- .member_scrapers ⇒ Object
- .olson(year = 2014) ⇒ Object
- .open_html(url) ⇒ Object
- .palazzo(page = 1) ⇒ Object
- .roe(page = 1) ⇒ Object
- .senate_aging ⇒ Object
-
.senate_approps_majority ⇒ Object
special cases for committees without RSS feeds.
- .senate_approps_minority ⇒ Object
- .senate_banking(year = Date.today.year) ⇒ Object
- .senate_hsag_majority(year = Date.today.year) ⇒ Object
- .senate_hsag_minority(year = Date.today.year) ⇒ Object
- .senate_indian ⇒ Object
- .senate_intel(congress = 113, start_year = 2013, end_year = 2014) ⇒ Object
- .senate_smallbiz_minority ⇒ Object
- .sessions(year = Date.today.year) ⇒ Object
- .sherman_mccaul(page = 0) ⇒ Object
- .susandavis ⇒ Object
-
.swalwell(page = 1) ⇒ Object
special cases for members without RSS feeds.
- .vitter(year = Date.today.year) ⇒ Object
- .welch ⇒ Object
Class Method Details
.backfill_bilirakis ⇒ Object
712 713 714 715 716 717 718 719 720 721 |
# File 'lib/statement/scraper.rb', line 712 def self.backfill_bilirakis results = [] domain = 'bilirakis.house.gov' url = 'http://bilirakis.house.gov/press-releases/' doc = open_html(url) return if doc.nil? doc.css("ul li[@class='article articleright']").each do |row| results << {:source => url, :url => 'http://bilirakis.house.gov' + row.children[3].children[1]['href'], :title => row.children[3].text.strip, :date => Date.parse(row.children[5].text), :domain => domain } end end |
.backfill_boustany ⇒ Object
723 724 725 726 727 728 729 730 |
# File 'lib/statement/scraper.rb', line 723 def self.backfill_boustany results = [] domain = 'boustany.house.gov' url = 'http://boustany.house.gov/113th-congress/showallitems/' doc = open_html(url) return if doc.nil? end |
.backfill_from_scrapers ⇒ Object
49 50 51 52 53 54 55 56 |
# File 'lib/statement/scraper.rb', line 49 def self.backfill_from_scrapers results = [cold_fusion(2012, 0), cold_fusion(2011, 0), cold_fusion(2010, 0), billnelson(year=2012), document_query(page=3), document_query(page=4), coburn(year=2012), coburn(year=2011), coburn(year=2010), boxer(start=11), boxer(start=21), boxer(start=31), boxer(start=41), vitter(year=2012), vitter(year=2011), swalwell(page=2), swalwell(page=3), clark(year=2013), culberson_chabot_grisham(page=2), sherman_mccaul(page=1), sessions(year=2013), pryor(page=1), ellison(page=1), ellison(page=2), ellison(page=3), farr(year=2013), farr(year=2012), farr(year=2011), mcnerney(page=2), mcnerney(page=3), mcnerney(page=4), mcnerney(page=5), mcnerney(page=6), olson(year=2013)].flatten Utils.remove_generic_urls!(results) end |
.barton ⇒ Object
570 571 572 573 574 575 576 577 578 579 580 |
# File 'lib/statement/scraper.rb', line 570 def self. results = [] domain = 'joebarton.house.gov' url = "http://joebarton.house.gov/press-releasescolumns/" doc = open_html(url) return if doc.nil? (doc/:h3)[0..-3].each do |row| results << { :source => url, :url => "http://joebarton.house.gov/"+row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.next.next.text), :domain => domain} end results end |
.billnelson(year = 2013) ⇒ Object
376 377 378 379 380 381 382 383 384 385 386 |
# File 'lib/statement/scraper.rb', line 376 def self.billnelson(year=2013) results = [] base_url = "http://www.billnelson.senate.gov/news/" year_url = base_url + "media.cfm?year=#{year}" doc = open_html(year_url) return if doc.nil? doc.xpath('//li').each do |row| results << { :source => year_url, :url => base_url + row.children[0]['href'], :title => row.children[0].text.strip, :date => Date.parse(row.children.last.text), :domain => "billnelson.senate.gov" } end results end |
.boxer(start = 1) ⇒ Object
437 438 439 440 441 442 443 444 445 446 447 |
# File 'lib/statement/scraper.rb', line 437 def self.boxer(start=1) results = [] url = "http://www.boxer.senate.gov/en/press/releases.cfm?start=#{start}" domain = 'www.boxer.senate.gov' doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='left']")[1..-1].each do |row| results << { :source => url, :url => domain + row.next.next.children[1].children[0]['href'], :title => row.next.next.children[1].children[0].text, :date => Date.parse(row.text.strip), :domain => domain} end results end |
.capuano ⇒ Object
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 |
# File 'lib/statement/scraper.rb', line 244 def self.capuano results = [] base_url = "http://www.house.gov/capuano/news/" list_url = base_url + 'date.shtml' doc = open_html(list_url) return if doc.nil? doc.xpath("//a").select{|l| !l['href'].nil? and l['href'].include?('/pr')}[1..-5].each do |link| begin year = link['href'].split('/').first date = Date.parse(link.text.split(' ').first+'/'+year) rescue date = nil end results << { :source => list_url, :url => base_url + link['href'], :title => link.text.split(' ',2).last, :date => date, :domain => "www.house.gov/capuano/" } end return results[0..-5] end |
.chabot(year = Date.today.year) ⇒ Object
324 325 326 327 328 329 330 331 332 333 334 335 |
# File 'lib/statement/scraper.rb', line 324 def self.chabot(year=Date.today.year) results = [] base_url = "http://chabot.house.gov/news/" url = base_url + "documentquery.aspx?DocumentTypeID=2508&Year=#{year}" doc = open_html(url) return if doc.nil? doc.xpath("//li")[40..48].each do |row| next if not row.text.include?('Posted') results << { :source => url, :url => base_url + row.children[1]['href'], :title => row.children[1].children.text.strip, :date => Date.parse(row.children[3].text.strip), :domain => "chabot.house.gov" } end results end |
.clark(year = Date.today.year) ⇒ Object
515 516 517 518 519 520 521 522 523 524 525 526 |
# File 'lib/statement/scraper.rb', line 515 def self.clark(year=Date.today.year) results = [] domain = 'katherineclark.house.gov' url = "http://katherineclark.house.gov/index.cfm/press-releases?MonthDisplay=0&YearDisplay=#{year}" doc = open_html(url) return if doc.nil? (doc/:tr)[1..-1].each do |row| next if row.children[1].text.strip == 'Date' results << { :source => url, :date => Date.parse(row.children[1].text.strip), :title => row.children[3].children.text, :url => row.children[3].children[0]['href'], :domain => domain} end results end |
.coburn(year = Date.today.year) ⇒ Object
425 426 427 428 429 430 431 432 433 434 435 |
# File 'lib/statement/scraper.rb', line 425 def self.coburn(year=Date.today.year) results = [] url = "http://www.coburn.senate.gov/public/index.cfm?p=PressReleases&ContentType_id=d741b7a7-7863-4223-9904-8cb9378aa03a&Group_id=7a55cb96-4639-4dac-8c0c-99a4a227bd3a&MonthDisplay=0&YearDisplay=#{year}" doc = open_html(url) return if doc.nil? doc.xpath("//tr")[2..-1].each do |row| next if row.text.strip[0..3] == "Date" results << { :source => url, :url => row.children[3].children[0]['href'], :title => row.children[3].text.strip, :date => Date.strptime(row.children[1].text.strip, "%m/%d/%y"), :domain => "coburn.senate.gov" } end results end |
.cold_fusion(year = Date.today.year, month = nil) ⇒ Object
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
# File 'lib/statement/scraper.rb', line 282 def self.cold_fusion(year=Date.today.year, month=nil) results = [] year = Date.today.year if not year domains = ['www.ronjohnson.senate.gov/public/','www.risch.senate.gov/public/'] domains.each do |domain| if domain == 'www.risch.senate.gov/public/' if not month url = "http://www.risch.senate.gov/public/index.cfm/pressreleases" else url = "http://www.risch.senate.gov/public/index.cfm/pressreleases?YearDisplay=#{year}&MonthDisplay=#{month}&page=1" end else if not month url = "http://www.ronjohnson.senate.gov/public/index.cfm/press-releases" else url = "http://www.ronjohnson.senate.gov/public/index.cfm/press-releases?YearDisplay=#{year}&MonthDisplay=#{month}&page=1" end end doc = Statement::Scraper.open_html(url) return if doc.nil? doc.xpath("//tr")[2..-1].each do |row| date_text, title = row.children.map{|c| c.text.strip}.reject{|c| c.empty?} next if date_text == 'Date' or date_text.size > 10 date = Date.parse(date_text) results << { :source => url, :url => row.children[3].children.first['href'], :title => title, :date => date, :domain => domain } end end results.flatten end |
.committee_methods ⇒ Object
35 36 37 |
# File 'lib/statement/scraper.rb', line 35 def self.committee_methods [:senate_approps_majority, :senate_approps_minority, :senate_banking, :senate_hsag_majority, :senate_hsag_minority, :senate_indian, :senate_aging, :senate_smallbiz_minority, :senate_intel, :house_energy_minority, :house_homeland_security_minority, :house_judiciary_majority, :house_rules_majority, :house_ways_means_majority] end |
.committee_scrapers ⇒ Object
58 59 60 61 62 63 64 |
# File 'lib/statement/scraper.rb', line 58 def self.committee_scrapers year = Date.today.year results = [senate_approps_majority, senate_approps_minority, senate_banking(year), senate_hsag_majority(year), senate_hsag_minority(year), senate_indian, senate_aging, senate_smallbiz_minority, senate_intel(113, 2013, 2014), house_energy_minority, house_homeland_security_minority, house_judiciary_majority, house_rules_majority, house_ways_means_majority].flatten Utils.remove_generic_urls!(results) end |
.conaway(page = 1) ⇒ Object
312 313 314 315 316 317 318 319 320 321 322 |
# File 'lib/statement/scraper.rb', line 312 def self.conaway(page=1) results = [] base_url = "http://conaway.house.gov/news/" page_url = base_url + "documentquery.aspx?DocumentTypeID=1279&Page=#{page}" doc = open_html(page_url) return if doc.nil? doc.xpath("//li")[41..50].each do |row| results << { :source => page_url, :url => base_url + row.children[1]['href'], :title => row.children[1].children.text.strip, :date => Date.parse(row.children[3].text.strip), :domain => "conaway.house.gov" } end results end |
.costa ⇒ Object
635 636 637 638 639 640 641 642 643 644 645 |
# File 'lib/statement/scraper.rb', line 635 def self.costa results = [] domain = 'costa.house.gov' url = "http://costa.house.gov/index.php/newsroom30/press-releases12" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='nspArt']").each do |row| results << { :source => url, :url => "http://costa.house.gov" + row.children[0].children[1].children[0]['href'], :title => row.children[0].children[1].children[0].text.strip, :date => Date.parse(row.children[0].children[0].text), :domain => domain} end results end |
.crapo ⇒ Object
401 402 403 404 405 406 407 408 409 410 411 |
# File 'lib/statement/scraper.rb', line 401 def self.crapo results = [] base_url = "http://www.crapo.senate.gov/media/newsreleases/" url = base_url + "release_all.cfm" doc = open_html(url) return if doc.nil? doc.xpath("//tr").each do |row| results << { :source => url, :url => base_url + row.children[3].children[0]['href'], :title => row.children[3].text.strip, :date => Date.parse(row.children[1].text.strip.gsub('-','/')), :domain => "crapo.senate.gov" } end results end |
.crenshaw(year = Date.today.year, month = nil) ⇒ Object
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
# File 'lib/statement/scraper.rb', line 262 def self.crenshaw(year=Date.today.year, month=nil) results = [] year = Date.today.year if not year domain = 'crenshaw.house.gov' if month url = "http://crenshaw.house.gov/index.cfm/pressreleases?YearDisplay=#{year}&MonthDisplay=#{month}&page=1" else url = "http://crenshaw.house.gov/index.cfm/pressreleases" end doc = Statement::Scraper.open_html(url) return if doc.nil? doc.xpath("//tr")[2..-1].each do |row| date_text, title = row.children.map{|c| c.text.strip}.reject{|c| c.empty?} next if date_text == 'Date' or date_text.size > 10 date = Date.parse(date_text) results << { :source => url, :url => row.children[3].children.first['href'], :title => title, :date => date, :domain => domain } end results end |
.culberson_chabot_grisham(page = 1) ⇒ Object
554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 |
# File 'lib/statement/scraper.rb', line 554 def self.culberson_chabot_grisham(page=1) results = [] domains = [{'culberson.house.gov' => 2573}, {'chabot.house.gov' => 2508}, {'lujangrisham.house.gov' => 2447}] domains.each do |domain| doc = open_html("http://"+domain.keys.first+"/news/documentquery.aspx?DocumentTypeID=#{domain.values.first}&Page=#{page}") return if doc.nil? doc.css('ul.UnorderedNewsList li').each do |row| link = "http://"+domain.keys.first+"/news/" + row.children[1]['href'] title = row.children[1].text.strip date = Date.parse(row.children[3].text.strip) results << { :source => "http://"+domain.keys.first+"/news/"+"documentquery.aspx?DocumentTypeID=#{domain.values.first}&Page=#{page}", :title => title, :url => link, :date => date, :domain => domain.keys.first } end end results.flatten end |
.document_query(page = 1) ⇒ Object
699 700 701 702 703 704 705 706 707 708 709 710 |
# File 'lib/statement/scraper.rb', line 699 def self.document_query(page=1) results = [] domains = [{"thornberry.house.gov" => 1776}, {"wenstrup.house.gov" => 2491}, {"clawson.house.gov" => 2641}] domains.each do |domain| doc = open_html("http://"+domain.keys.first+"/news/documentquery.aspx?DocumentTypeID=#{domain.values.first}&Page=#{page}") return if doc.nil? doc.xpath("//div[@class='middlecopy']//li").each do |row| results << { :source => "http://"+domain.keys.first+"/news/"+"documentquery.aspx?DocumentTypeID=#{domain.values.first}&Page=#{page}", :url => "http://"+domain.keys.first+"/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain.keys.first } end end results.flatten end |
.donnelly(year = Date.today.year) ⇒ Object
deprecated
463 464 465 466 467 468 469 470 471 472 473 474 |
# File 'lib/statement/scraper.rb', line 463 def self.donnelly(year=Date.today.year) results = [] url = "http://www.donnelly.senate.gov/newsroom/" domain = "www.donnelly.senate.gov" doc = open_html(url+"press?year=#{year}") return if doc.nil? doc.xpath("//tr")[1..-1].each do |row| next if row.text.strip.size < 30 results << { :source => url, :url => "http://www.donnelly.senate.gov"+row.children[3].children[1]['href'].strip, :title => row.children[3].text.strip, :date => Date.strptime(row.children[1].text, "%m/%d/%y"), :domain => domain} end results end |
.edwards ⇒ Object
541 542 543 544 545 546 547 548 549 550 551 552 |
# File 'lib/statement/scraper.rb', line 541 def self.edwards results = [] domain = 'donnaedwards.house.gov' url = "http://donnaedwards.house.gov/index.php?option=com_content&view=category&id=10&Itemid=18" doc = open_html(url) return if doc.nil? table = (doc/:table)[4] (table/:tr).each do |row| results << { :source => url, :url => "http://donnaedwards.house.gov/"+row.children.children[1]['href'], :title => row.children.children[1].text.strip, :date => Date.parse(row.children.children[3].text.strip), :domain => domain} end results end |
.ellison(page = 0) ⇒ Object
622 623 624 625 626 627 628 629 630 631 632 633 |
# File 'lib/statement/scraper.rb', line 622 def self.ellison(page=0) results = [] domain = 'ellison.house.gov' url = "http://ellison.house.gov/media-center/press-releases?page=#{page}" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='views-field views-field-created datebar']").each do |row| next if row.nil? results << { :source => url, :url => "http://ellison.house.gov" + row.next.next.children[1].children[0]['href'], :title => row.next.next.children[1].children[0].text.strip, :date => Date.parse(row.text.strip), :domain => domain} end results end |
.farr(year = 2014) ⇒ Object
647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 |
# File 'lib/statement/scraper.rb', line 647 def self.farr(year=2014) results = [] domain = 'www.farr.house.gov' if year == 2014 url = "http://www.farr.house.gov/index.php/newsroom/press-releases" else url = "http://www.farr.house.gov/index.php/newsroom/press-releases-archive/#{year.to_s}-press-releases" end doc = open_html(url) return if doc.nil? doc.xpath("//tr[@class='cat-list-row0']").each do |row| results << { :source => url, :url => "http://farr.house.gov" + row.children[1].children[1]['href'], :title => row.children[1].children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain} end results end |
.fischer(year = Date.today.year) ⇒ Object
413 414 415 416 417 418 419 420 421 422 423 |
# File 'lib/statement/scraper.rb', line 413 def self.fischer(year=Date.today.year) results = [] url = "http://www.fischer.senate.gov/public/index.cfm/press-releases?MonthDisplay=0&YearDisplay=#{year}" doc = open_html(url) return if doc.nil? doc.xpath("//tr")[2..-1].each do |row| next if row.text.strip[0..3] == "Date" results << { :source => url, :url => row.children[3].children[0]['href'], :title => row.children[3].text.strip, :date => Date.strptime(row.children[1].text.strip, "%m/%d/%y"), :domain => "fischer.senate.gov" } end results end |
.gabbard ⇒ Object
610 611 612 613 614 615 616 617 618 619 620 |
# File 'lib/statement/scraper.rb', line 610 def self. results = [] domain = 'gabbard.house.gov' url = "http://gabbard.house.gov/index.php/news/press-releases" doc = open_html(url) return if doc.nil? doc.css('ul.fc_leading li').each do |row| results << {:source => url, :url => "http://gabbard.house.gov"+row.children[0].children[1]['href'], :title => row.children[0].children[1].text.strip, :date => Date.parse(row.children[2].text), :domain => domain} end results end |
.house_energy_minority ⇒ Object
173 174 175 176 177 178 179 180 181 182 |
# File 'lib/statement/scraper.rb', line 173 def self.house_energy_minority results = [] url = "http://democrats.energycommerce.house.gov/index.php?q=news-releases" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='views-field-title']").each do |row| results << { :source => url, :url => "http://democrats.energycommerce.house.gov"+row.children[1].children[0]['href'], :title => row.children[1].children[0].text, :date => Date.parse(row.next.next.text.strip), :domain => "http://energycommerce.house.gov/", :party => 'minority' } end results end |
.house_gop(url) ⇒ Object
18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/statement/scraper.rb', line 18 def self.house_gop(url) doc = open_html(url) return unless doc uri = URI.parse(url) date = Date.parse(uri.query.split('=').last) links = doc.xpath("//ul[@id='membernews']").search('a') results = links.map do |link| abs_link = Utils.absolute_link(url, link["href"]) { :source => url, :url => abs_link, :title => link.text.strip, :date => date, :domain => URI.parse(link["href"]).host } end Utils.remove_generic_urls!(results) end |
.house_homeland_security_minority ⇒ Object
184 185 186 187 188 189 190 191 192 193 |
# File 'lib/statement/scraper.rb', line 184 def self.house_homeland_security_minority results = [] url = "http://chsdemocrats.house.gov/press/index.asp?subsection=1" doc = open_html(url) return if doc.nil? doc.xpath("//li[@class='article']").each do |row| results << { :source => url, :url => "http://chsdemocrats.house.gov"+row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text), :domain => "http://chsdemocrats.house.gov/", :party => 'minority' } end results end |
.house_judiciary_majority ⇒ Object
195 196 197 198 199 200 201 202 203 204 205 |
# File 'lib/statement/scraper.rb', line 195 def self.house_judiciary_majority results = [] url = "http://judiciary.house.gov/news/press2013.html" doc = open_html(url) return if doc.nil? doc.xpath("//p")[3..60].each do |row| next if row.text.size < 30 results << { :source => url, :url => row.children[5]['href'], :title => row.children[0].text, :date => Date.parse(row.children[1].text.strip), :domain => "http://judiciary.house.gov/", :party => 'majority' } end results end |
.house_rules_majority ⇒ Object
207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/statement/scraper.rb', line 207 def self.house_rules_majority results = [] url = "http://www.rules.house.gov/News/Default.aspx" doc = open_html(url) return if doc.nil? doc.xpath("//tr")[1..-2].each do |row| next if row.text.strip.size < 30 results << { :source => url, :url => "http://www.rules.house.gov/News/"+row.children[0].children[1].children[0]['href'], :title => row.children[0].children[1].children[0].text, :date => Date.parse(row.children[2].children[1].text.strip), :domain => "http://www.rules.house.gov/", :party => 'majority' } end results end |
.house_ways_means_majority ⇒ Object
219 220 221 222 223 224 225 226 227 228 229 |
# File 'lib/statement/scraper.rb', line 219 def self.house_ways_means_majority results = [] url = "http://waysandmeans.house.gov/news/documentquery.aspx?DocumentTypeID=1496" doc = open_html(url) return if doc.nil? doc.xpath("//ul[@class='UnorderedNewsList']").children.each do |row| next if row.text.strip.size < 10 results << { :source => url, :url => "http://waysandmeans.house.gov"+row.children[1].children[1]['href'], :title => row.children[1].children[1].text, :date => Date.parse(row.children[3].children[0].text.strip), :domain => "http://waysandmeans.house.gov/", :party => 'majority' } end results end |
.inhofe(year = Date.today.year) ⇒ Object
476 477 478 479 480 481 482 483 484 485 486 487 488 489 |
# File 'lib/statement/scraper.rb', line 476 def self.inhofe(year=Date.today.year) results = [] url = "http://www.inhofe.senate.gov/newsroom/press-releases?year=#{year}" domain = "www.inhofe.senate.gov" doc = open_html(url) return if doc.nil? if doc.xpath("//tr")[1..-1] doc.xpath("//tr")[1..-1].each do |row| next if row.text.strip.size < 30 results << { :source => url, :url => row.children[3].children[0]['href'].strip, :title => row.children[3].text, :date => Date.strptime(row.children[1].text, "%m/%d/%y"), :domain => domain} end end results end |
.klobuchar(year) ⇒ Object
349 350 351 352 353 354 355 356 357 358 359 360 361 362 |
# File 'lib/statement/scraper.rb', line 349 def self.klobuchar(year) results = [] base_url = "http://www.klobuchar.senate.gov/" [year.to_i-1,year.to_i].each do |year| year_url = base_url + "public/news-releases?MonthDisplay=0&YearDisplay=#{year}" doc = open_html(year_url) return if doc.nil? doc.xpath("//tr")[1..-1].each do |row| next if row.children[3].children[0].text.strip == 'Title' results << { :source => year_url, :url => row.children[3].children[0]['href'], :title => row.children[3].children[0].text.strip, :date => Date.strptime(row.children[1].text, "%m/%d/%y"), :domain => "klobuchar.senate.gov" } end end results end |
.lautenberg(rows = 1000) ⇒ Object
fetches the latest 1000 releases, can be altered
389 390 391 392 393 394 395 396 397 398 399 |
# File 'lib/statement/scraper.rb', line 389 def self.lautenberg(rows=1000) results = [] base_url = 'http://www.lautenberg.senate.gov/newsroom/' url = base_url + "releases.cfm?maxrows=#{rows}&startrow=1&&type=1" doc = open_html(url) return if doc.nil? doc.xpath("//tr")[4..-2].each do |row| results << { :source => url, :url => base_url + row.children[2].children[0]['href'], :title => row.children[2].text.strip, :date => Date.strptime(row.children[0].text.strip, "%m/%d/%y"), :domain => "lautenberg.senate.gov" } end results end |
.lujan ⇒ Object
364 365 366 367 368 369 370 371 372 373 374 |
# File 'lib/statement/scraper.rb', line 364 def self.lujan results = [] base_url = 'http://lujan.house.gov/' doc = open_html(base_url+'index.php?option=com_content&view=article&id=981&Itemid=78') return if doc.nil? doc.xpath('//ul')[1].children.each do |row| next if row.text.strip == '' results << { :source => base_url+'index.php?option=com_content&view=article&id=981&Itemid=78', :url => base_url + row.children[0]['href'], :title => row.children[0].text, :date => nil, :domain => "lujan.house.gov" } end results end |
.mcclintock ⇒ Object
663 664 665 666 667 668 669 670 671 672 673 |
# File 'lib/statement/scraper.rb', line 663 def self.mcclintock results = [] domain = 'mcclintock.house.gov' url = "http://mcclintock.house.gov/press-all.shtml" doc = open_html(url) return if doc.nil? doc.css("ul li").first(152).each do |row| results << { :source => url, :url => row.children[0].children[1]['href'], :title => row.children[0].children[1].text.strip, :date => Date.parse(row.children[0].children[0].text), :domain => domain} end results end |
.mcnerney(page = 1) ⇒ Object
687 688 689 690 691 692 693 694 695 696 697 |
# File 'lib/statement/scraper.rb', line 687 def self.mcnerney(page=1) results = [] domain = 'mcnerney.house.gov' url = "http://mcnerney.house.gov/media-center/press-releases" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='views-field views-field-title']").each do |row| results << {:source => url, :url => 'http://mcnerney.house.gov' + row.children[1].children[0]['href'], :title => row.children[1].children[0].text.strip, :date => Date.parse(row.next.next.text.strip), :domain => domain } end results end |
.member_methods ⇒ Object
31 32 33 |
# File 'lib/statement/scraper.rb', line 31 def self.member_methods [:crenshaw, :capuano, :cold_fusion, :conaway, :chabot, :susandavis, :freshman_senators, :klobuchar, :billnelson, :crapo, :boxer, :vitter, :inhofe, :palazzo, :roe, :document_query, :swalwell, :fischer, :clark, :edwards, :culberson_chabot_grisham, :barton, :sherman_mccaul, :welch, :sessions, :gabbard, :ellison, :costa, :farr, :mcclintock, :mcnerney, :olson] end |
.member_scrapers ⇒ Object
39 40 41 42 43 44 45 46 47 |
# File 'lib/statement/scraper.rb', line 39 def self.member_scrapers year = Date.today.year results = [crenshaw, capuano, cold_fusion(year, nil), conaway, chabot, susandavis, klobuchar(year), palazzo(page=1), roe(page=1), billnelson(year=year), document_query(page=1), document_query(page=2), swalwell(page=1), crapo, coburn, boxer(start=1), vitter(year=year), inhofe(year=2014), fischer, clark(year=year), edwards, culberson_chabot_grisham(page=1), , sherman_mccaul, welch, sessions(year=year), , ellison(page=0), costa, farr, mcclintock, olson, mcnerney].flatten results = results.compact Utils.remove_generic_urls!(results) end |
.olson(year = 2014) ⇒ Object
675 676 677 678 679 680 681 682 683 684 685 |
# File 'lib/statement/scraper.rb', line 675 def self.olson(year=2014) results = [] domain = 'olson.house.gov' url = "http://olson.house.gov/#{year}-press-releases/" doc = open_html(url) return if doc.nil? (doc/:h3).each do |row| results << {:source => url, :url => 'http://olson.house.gov' + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.next.next.text), :domain => domain } end results end |
.open_html(url) ⇒ Object
10 11 12 13 14 15 16 |
# File 'lib/statement/scraper.rb', line 10 def self.open_html(url) begin Nokogiri::HTML(open(url).read) rescue nil end end |
.palazzo(page = 1) ⇒ Object
491 492 493 494 495 496 497 498 499 500 501 |
# File 'lib/statement/scraper.rb', line 491 def self.palazzo(page=1) results = [] domain = "palazzo.house.gov" url = "http://palazzo.house.gov/news/documentquery.aspx?DocumentTypeID=2519&Page=#{page}" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='middlecopy']//li").each do |row| results << { :source => url, :url => "http://palazzo.house.gov/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain } end results end |
.roe(page = 1) ⇒ Object
503 504 505 506 507 508 509 510 511 512 513 |
# File 'lib/statement/scraper.rb', line 503 def self.roe(page=1) results = [] domain = 'roe.house.gov' url = "http://roe.house.gov/news/documentquery.aspx?DocumentTypeID=1532&Page=#{page}" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='middlecopy']//li").each do |row| results << { :source => url, :url => "http://roe.house.gov/news/" + row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.children[3].text.strip), :domain => domain } end results end |
.senate_aging ⇒ Object
140 141 142 143 144 145 146 147 148 149 |
# File 'lib/statement/scraper.rb', line 140 def self.senate_aging results = [] url = "http://www.aging.senate.gov/pressroom.cfm?maxrows=100&startrow=1&&type=1" doc = open_html(url) return if doc.nil? doc.xpath("//tr")[6..104].each do |row| results << { :source => url, :url => "http://www.aging.senate.gov/"+row.children[2].children[0]['href'], :title => row.children[2].text.strip, :date => Date.parse(row.children[0].text), :domain => "http://www.aging.senate.gov/" } end results end |
.senate_approps_majority ⇒ Object
special cases for committees without RSS feeds
68 69 70 71 72 73 74 75 76 77 78 79 |
# File 'lib/statement/scraper.rb', line 68 def self.senate_approps_majority results = [] url = "http://www.appropriations.senate.gov/news.cfm" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='newsDateUnderlined']").each do |date| date.next.next.children.reject{|c| c.text.strip.empty?}.each do |row| results << { :source => url, :url => url + row.children[0]['href'], :title => row.text, :date => Date.parse(date.text), :domain => "http://www.appropriations.senate.gov/", :party => 'majority' } end end results end |
.senate_approps_minority ⇒ Object
81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/statement/scraper.rb', line 81 def self.senate_approps_minority results = [] url = "http://www.appropriations.senate.gov/republican.cfm" doc = open_html(url) return if doc.nil? doc.xpath("//div[@class='newsDateUnderlined']").each do |date| date.next.next.children.reject{|c| c.text.strip.empty?}.each do |row| results << { :source => url, :url => url + row.children[0]['href'], :title => row.text, :date => Date.parse(date.text), :domain => "http://www.appropriations.senate.gov/", :party => 'minority' } end end results end |
.senate_banking(year = Date.today.year) ⇒ Object
94 95 96 97 98 99 100 101 102 103 |
# File 'lib/statement/scraper.rb', line 94 def self.senate_banking(year=Date.today.year) results = [] url = "http://www.banking.senate.gov/public/index.cfm?FuseAction=Newsroom.PressReleases&ContentRecordType_id=b94acc28-404a-4fc6-b143-a9e15bf92da4&Region_id=&Issue_id=&MonthDisplay=0&YearDisplay=#{year}" doc = open_html(url) return if doc.nil? doc.xpath("//tr").each do |row| results << { :source => url, :url => "http://www.banking.senate.gov/public/" + row.children[2].children[1]['href'], :title => row.children[2].text.strip, :date => Date.parse(row.children[0].text.strip+", #{year}"), :domain => "http://www.banking.senate.gov/", :party => 'majority' } end results end |
.senate_hsag_majority(year = Date.today.year) ⇒ Object
105 106 107 108 109 110 111 112 113 114 115 |
# File 'lib/statement/scraper.rb', line 105 def self.senate_hsag_majority(year=Date.today.year) results = [] url = "http://www.hsgac.senate.gov/media/majority-media?year=#{year}" doc = open_html(url) return if doc.nil? doc.xpath("//tr").each do |row| next if row.text.strip.size < 30 results << { :source => url, :url => row.children[2].children[0]['href'].strip, :title => row.children[2].children[0].text, :date => Date.parse(row.children[0].text), :domain => "http://www.hsgac.senate.gov/", :party => 'majority' } end results end |
.senate_hsag_minority(year = Date.today.year) ⇒ Object
117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/statement/scraper.rb', line 117 def self.senate_hsag_minority(year=Date.today.year) results = [] url = "http://www.hsgac.senate.gov/media/minority-media?year=#{year}" doc = open_html(url) return if doc.nil? doc.xpath("//tr").each do |row| next if row.text.strip.size < 30 results << { :source => url, :url => row.children[2].children[0]['href'].strip, :title => row.children[2].children[0].text, :date => Date.parse(row.children[0].text), :domain => "http://www.hsgac.senate.gov/", :party => 'minority' } end results end |
.senate_indian ⇒ Object
129 130 131 132 133 134 135 136 137 138 |
# File 'lib/statement/scraper.rb', line 129 def self.senate_indian results = [] url = "http://www.indian.senate.gov/news/index.cfm" doc = open_html(url) return if doc.nil? doc.xpath("//h3").each do |row| results << { :source => url, :url => "http://www.indian.senate.gov"+row.children[0]['href'], :title => row.children[0].text, :date => Date.parse(row.previous.previous.text), :domain => "http://www.indian.senate.gov/", :party => 'majority' } end results end |
.senate_intel(congress = 113, start_year = 2013, end_year = 2014) ⇒ Object
162 163 164 165 166 167 168 169 170 171 |
# File 'lib/statement/scraper.rb', line 162 def self.senate_intel(congress=113, start_year=2013, end_year=2014) results = [] url = "http://www.intelligence.senate.gov/press/releases.cfm?congress=#{congress}&y1=#{start_year}&y2=#{end_year}" doc = open_html(url) return if doc.nil? doc.xpath("//tr[@valign='top']")[7..-1].each do |row| results << { :source => url, :url => "http://www.intelligence.senate.gov/press/"+row.children[2].children[0]['href'], :title => row.children[2].children[0].text.strip, :date => Date.parse(row.children[0].text), :domain => "http://www.intelligence.senate.gov/" } end results end |
.senate_smallbiz_minority ⇒ Object
151 152 153 154 155 156 157 158 159 160 |
# File 'lib/statement/scraper.rb', line 151 def self.senate_smallbiz_minority results = [] url = "http://www.sbc.senate.gov/public/index.cfm?p=RepublicanPressRoom" doc = open_html(url) return if doc.nil? doc.xpath("//ul[@class='recordList']").each do |row| results << { :source => url, :url => row.children[0].children[2].children[0]['href'], :title => row.children[0].children[2].children[0].text, :date => Date.parse(row.children[0].children[0].text), :domain => "http://www.sbc.senate.gov/", :party => 'minority' } end results end |
.sessions(year = Date.today.year) ⇒ Object
528 529 530 531 532 533 534 535 536 537 538 539 |
# File 'lib/statement/scraper.rb', line 528 def self.sessions(year=Date.today.year) results = [] domain = 'sessions.senate.gov' url = "http://www.sessions.senate.gov/public/index.cfm/news-releases?YearDisplay=#{year}" doc = open_html(url) return if doc.nil? (doc/:tr)[1..-1].each do |row| next if row.children[1].text.strip == 'Date' results << { :source => url, :date => Date.parse(row.children[1].text), :title => row.children[3].children.text, :url => row.children[3].children[0]['href'], :domain => domain} end results end |
.sherman_mccaul(page = 0) ⇒ Object
582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 |
# File 'lib/statement/scraper.rb', line 582 def self.sherman_mccaul(page=0) results = [] domains = ['sherman.house.gov', 'mccaul.house.gov'] domains.each do |domain| url = "http://#{domain}/media-center/press-releases?page=#{page}" doc = open_html(url) return if doc.nil? dates = doc.xpath('//span[@class="field-content"]').map {|s| s.text if s.text.strip.include?("201")}.compact! (doc/:h3).first(10).each_with_index do |row, i| date = Date.parse(dates[i]) results << {:source => url, :url => "http://"+domain+row.children.first['href'], :title => row.children.first.text.strip, :date => date, :domain => domain} end end results.flatten end |
.susandavis ⇒ Object
337 338 339 340 341 342 343 344 345 346 347 |
# File 'lib/statement/scraper.rb', line 337 def self.susandavis results = [] base_url = "http://www.house.gov/susandavis/" doc = open_html(base_url+'news.shtml') return if doc.nil? doc.search("ul")[6].children.each do |row| next if row.text.strip == '' results << { :source => base_url+'news.shtml', :url => base_url + row.children[1]['href'], :title => row.children[1].text.split.join(' '), :date => Date.parse(row.children.first.text), :domain => "house.gov/susandavis" } end results end |
.swalwell(page = 1) ⇒ Object
special cases for members without RSS feeds
233 234 235 236 237 238 239 240 241 242 |
# File 'lib/statement/scraper.rb', line 233 def self.swalwell(page=1) results = [] url = "http://swalwell.house.gov/category/press-releases/page/#{page}/" doc = open_html(url) return if doc.nil? doc.xpath("//h3")[0..4].each do |row| results << { :source => url, :url => row.children[0]['href'], :title => row.children[0].text, :date => nil, :domain => 'swalwell.house.gov'} end results end |
.vitter(year = Date.today.year) ⇒ Object
449 450 451 452 453 454 455 456 457 458 459 460 |
# File 'lib/statement/scraper.rb', line 449 def self.vitter(year=Date.today.year) results = [] url = "http://www.vitter.senate.gov/newsroom/" domain = "www.vitter.senate.gov" doc = open_html(url+"press?year=#{year}") return if doc.nil? doc.xpath("//tr")[1..-1].each do |row| next if row.text.strip.size < 30 results << { :source => url, :url => row.children[3].children[0]['href'].strip, :title => row.children[3].text, :date => Date.strptime(row.children[1].text, "%m/%d/%y"), :domain => domain} end results end |
.welch ⇒ Object
598 599 600 601 602 603 604 605 606 607 608 |
# File 'lib/statement/scraper.rb', line 598 def self.welch results = [] domain = 'welch.house.gov' url = "http://www.welch.house.gov/press-releases/" doc = open_html(url) return if doc.nil? (doc/:h3).each do |row| results << { :source => url, :url => "http://www.welch.house.gov/"+row.children[1]['href'], :title => row.children[1].text.strip, :date => Date.parse(row.next.next.text), :domain => domain} end results end |