Class: APWArticles::Scraper
- Inherits:
-
Object
- Object
- APWArticles::Scraper
- Defined in:
- lib/apw_articles/scraper.rb
Class Method Summary collapse
-
.scrape_article(url) ⇒ Object
This method takes an article URL, creates an article hash then populates that article hash using information scraped from the URL.
-
.scrape_categories(url = "https://apracticalwedding.com/category/marriage-essays/?listas=list") ⇒ Object
This method takes in a URL of a list page of essays at APW and creates an array of all link attributes on the essay links in the list.
-
.scrape_list(category, page = 1) ⇒ Object
This class method defines variables i and j to determine what url number needs to be scraped based on the number of items on each page (at time of publicaion, 66 articles/page).
Class Method Details
.scrape_article(url) ⇒ Object
This method takes an article URL, creates an article hash then populates that article hash using information scraped from the URL. The method then returns the article hash.
22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# File 'lib/apw_articles/scraper.rb', line 22 def self.scrape_article(url) article = {} doc = Nokogiri::HTML(open(url)) article[:title] = doc.css("h1").text article[:author] = doc.css(".staff-info h2").text article[:url] = url article[:blurb] = doc.css(".entry p").text[0,400] categories = [] doc.css(".categories a").each do |link| categories << link.attribute("href").value.split("/")[-1] end # do end article[:categories] = categories article end |
.scrape_categories(url = "https://apracticalwedding.com/category/marriage-essays/?listas=list") ⇒ Object
This method takes in a URL of a list page of essays at APW and creates an array of all link attributes on the essay links in the list. It then breaks apart the string of link attributes and returns an array of those attributes with the preface “category-”,
38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/apw_articles/scraper.rb', line 38 def self.scrape_categories(url = "https://apracticalwedding.com/category/marriage-essays/?listas=list") doc = Nokogiri::HTML(open(url)).css(".type-post") link_attributes = [] categories = [] doc.each { |link| link_attributes << link.attribute("class").value } link_attributes.each do |attributes_list| attributes_array = attributes_list.split(/ category-/) attributes_array.slice!(0) attributes_array.each do |category| categories << category.split[0] end # attributes_array do end end # link_attributes do end categories.uniq end |
.scrape_list(category, page = 1) ⇒ Object
This class method defines variables i and j to determine what url number needs to be scraped based on the number of items on each page (at time of publicaion, 66 articles/page). The method then scrapes a url based on the category and URL number and uses that page’s list of articles to creates a new article object per article link. Article objects include title, url and category. The method returns nil
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# File 'lib/apw_articles/scraper.rb', line 4 def self.scrape_list(category, page = 1) # NOTE: probably I should only scrape for the # of articles I need for the given request / call - this is very laggy i = 1 if page.between?(1,6) i = 2 if page.between?(7,13) i = 3 if page > 13 j = 1 if page.between?(1,5) j = 2 if page.between?(6,12) j = 3 if page > 12 until i > j Nokogiri::HTML(open("https://apracticalwedding.com/category/marriage-essays/#{category}/page/#{i}/?listas=list")).css(".type-post").each do |post| APWArticles::Article.new({url: post.css("a").attribute("href").value, title: post.css("h2").text, categories: [category]}) end # do end i += 1 end # until loop end nil end |