Class: Ietf::Data::Importer::Scrapers::IrtfScraper
- Inherits:
-
BaseScraper
- Object
- BaseScraper
- Ietf::Data::Importer::Scrapers::IrtfScraper
- Defined in:
- lib/ietf/data/importer/scrapers/irtf_scraper.rb
Overview
Scraper for IRTF groups from irtf.org
Constant Summary collapse
- BASE_URL =
Base URL for IRTF website
"https://www.irtf.org/groups.html"
Instance Method Summary collapse
-
#extract_from_dropdown(doc) ⇒ Array<Ietf::Data::Importer::Group>
Extract groups from the dropdown menu.
-
#fetch ⇒ Array<Ietf::Data::Importer::Group>
Fetch all IRTF groups.
Methods inherited from BaseScraper
Instance Method Details
#extract_from_dropdown(doc) ⇒ Array<Ietf::Data::Importer::Group>
Extract groups from the dropdown menu
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# File 'lib/ietf/data/importer/scrapers/irtf_scraper.rb', line 87 def extract_from_dropdown(doc) groups = [] # Look for the dropdown menu containing research groups dropdown = doc.css('a.dropdown-toggle').find do |el| el.text.include?('Research Groups') end return [] unless dropdown # Find the dropdown menu dropdown_parent = dropdown.parent = dropdown_parent.css('.dropdown-menu') return [] unless .any? log "Found dropdown menu with research groups", 1 # Extract groups from the dropdown menu .css('a.dropdown-item').each do |link| next unless link && link['href'] name = link.text.strip href = link['href'] # Extract abbreviation from href (e.g., cfrg.html -> CFRG) if href =~ /(\w+)\.html$/ abbreviation = $1.upcase else next # Skip if we can't determine abbreviation end # Construct full URL if it's a relative path details_url = href if !details_url.start_with?('http') if details_url.start_with?('/') details_url = "https://www.irtf.org#{details_url}" else details_url = "https://www.irtf.org/#{details_url}" end end begin details = fetch_group_details(details_url) group = Importer::Group.new( abbreviation: abbreviation, name: name, organization: 'irtf', type: 'rg', area: nil, status: 'active', # Assume active since it's in the menu description: nil, # Will be populated from details page if available chairs: details[:chairs], mailing_list: details[:mailing_list], mailing_list_archive: details[:mailing_list_archive], website_url: details_url, charter_url: details[:charter_url], concluded_date: details[:concluded_date] ) groups << group rescue => e log "Error fetching details for #{abbreviation} (#{details_url}): #{e.message}", 2 end end groups end |
#fetch ⇒ Array<Ietf::Data::Importer::Group>
Fetch all IRTF groups
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
# File 'lib/ietf/data/importer/scrapers/irtf_scraper.rb', line 17 def fetch groups = [] log "Fetching IRTF groups..." begin doc = fetch_html(BASE_URL) return [] unless doc # First try to extract from the dropdown menu dropdown_groups = extract_from_dropdown(doc) if dropdown_groups.any? log "Found #{dropdown_groups.size} groups in dropdown menu", 1 groups.concat(dropdown_groups) return groups end # If dropdown extraction fails, fall back to traditional section-based extraction # Debug the page structure headings = doc.css('h3').map(&:text).join(', ') log "Found headings on IRTF page: #{headings}", 1 # Extract active groups active_groups = extract_groups(doc, 'Active Research Groups', 'active') log "Found #{active_groups.size} active IRTF groups", 1 # Extract concluded groups concluded_groups = extract_groups(doc, 'Concluded Research Groups', 'concluded') log "Found #{concluded_groups.size} concluded IRTF groups", 1 groups.concat(active_groups) groups.concat(concluded_groups) # If still no groups found, try alternative selectors if groups.empty? log "No groups found with standard selectors, trying alternatives...", 1 # Try different section titles ['Current Research Groups', 'Research Groups', 'IRTF Groups'].each do |title| section_groups = extract_groups(doc, title, 'active') if section_groups.any? log "Found #{section_groups.size} groups with section title: #{title}", 1 groups.concat(section_groups) end end # Try a more generic approach if still no groups if groups.empty? log "Using generic list item selector...", 1 # Find any unordered list with links doc.css('ul').each do |list| if list.css('li a').any? generic_groups = extract_groups_from_list(list, 'active') if generic_groups.any? log "Found #{generic_groups.size} groups using generic list selector", 1 groups.concat(generic_groups) end end end end end rescue => e log "Error fetching IRTF groups: #{e.message}", 1 end groups end |