Module: RelatonGb::GbScrapper
- Extended by:
- Scrapper
- Defined in:
- lib/relaton_gb/gb_scrapper.rb
Overview
National standard scrapper.
Class Method Summary collapse
-
.get_committee(doc) ⇒ Hash
-
:type [String] * :name [String].
-
- .scrape_doc(pid) ⇒ RelatonGb::GbBibliographicItem
- .scrape_page(text) ⇒ RelatonGb::HitCollection
Methods included from Scrapper
fetch_structuredidentifier, get_contributors, get_docid, get_status, get_titles, get_type, scrapped_data
Class Method Details
.get_committee(doc) ⇒ Hash
Returns * :type [String]
-
:name [String].
44 45 46 47 48 |
# File 'lib/relaton_gb/gb_scrapper.rb', line 44 def get_committee(doc) name = doc.xpath("//p/a[1]/following-sibling::text()").text. match(/(?<=()[^)]+/).to_s { type: "technical", name: name } end |
.scrape_doc(pid) ⇒ RelatonGb::GbBibliographicItem
32 33 34 35 36 37 38 |
# File 'lib/relaton_gb/gb_scrapper.rb', line 32 def scrape_doc(pid) src = "http://www.std.gov.cn/gb/search/gbDetailed?id=" + pid doc = Nokogiri::HTML OpenURI.open_uri(src) GbBibliographicItem.new scrapped_data(doc, src: src) rescue OpenURI::HTTPError, SocketError raise RelatonBib::RequestError, "Cannot access #{src}" end |
.scrape_page(text) ⇒ RelatonGb::HitCollection
17 18 19 20 21 22 23 24 25 26 27 28 |
# File 'lib/relaton_gb/gb_scrapper.rb', line 17 def scrape_page(text) search_html = OpenURI.open_uri( "http://www.std.gov.cn/search/stdPage?q=" + text ) result = Nokogiri::HTML search_html hits = result.css(".s-title a").map do |h| Hit.new pid: h[:pid], title: h.text, scrapper: self end HitCollection.new hits rescue OpenURI::HTTPError, SocketError raise RelatonBib::RequestError, "Cannot access http://www.std.gov.cn/search/stdPage" end |