Module: Nokogiri::HTML5
- Defined in:
- lib/nokogumbo.rb
Class Method Summary collapse
-
.fragment(string) ⇒ Object
while fragment is on the Gumbo TODO list, simulate it by doing a full document parse and ignoring the parent <html>, <head>, and <body> tags, and collecting up the children of each.
-
.get(uri, options = {}) ⇒ Object
Fetch and parse a HTML document from the web, following redirects, handling https, and determining the character encoding using HTML5 rules.
-
.parse(string) ⇒ Object
Parse an HTML document.
Class Method Details
.fragment(string) ⇒ Object
while fragment is on the Gumbo TODO list, simulate it by doing a full document parse and ignoring the parent <html>, <head>, and <body> tags, and collecting up the children of each.
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/nokogumbo.rb', line 86 def self.fragment(string) doc = parse(string) fragment = Nokogiri::HTML::DocumentFragment.new(doc) if doc.children.length != 1 or doc.children.first.name != 'html' # no HTML? Return document as is fragment = doc else # examine children of HTML element children = doc.children.first.children # head is always first. If present, take children but otherwise # ignore the head element if children.length > 0 and doc.children.first.name = 'head' fragment << children.shift.children end # body may be next, or last. If found, take children but otherwise # ignore the body element. Also take any remaining elements, taking # care to preserve order. if children.length > 0 and doc.children.first.name = 'body' fragment << children.shift.children fragment << children elsif children.length > 0 and doc.children.last.name = 'body' body = children.pop fragment << children fragment << body.children else fragment << children end end # return result fragment end |
.get(uri, options = {}) ⇒ Object
Fetch and parse a HTML document from the web, following redirects, handling https, and determining the character encoding using HTML5 rules. uri
may be a String
or a URI
. options
contains http headers and special options. Everything which is not a special option is considered a header. Special options include:
* :follow_limit => number of redirects which are followed
* :basic_auth => [username, password]
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/nokogumbo.rb', line 34 def self.get(uri, ={}) headers = .clone headers = {:follow_limit => headers} if Numeric === headers # deprecated limit=headers[:follow_limit] ? headers.delete(:follow_limit).to_i : 10 require 'net/http' uri = URI(uri) unless URI === uri http = Net::HTTP.new(uri.host, uri.port) # TLS / SSL support http.use_ssl = true if uri.scheme == 'https' # Pass through Net::HTTP override values, which currently include: # :ca_file, :ca_path, :cert, :cert_store, :ciphers, # :close_on_empty_response, :continue_timeout, :key, :open_timeout, # :read_timeout, :ssl_timeout, :ssl_version, :use_ssl, # :verify_callback, :verify_depth, :verify_mode .each do |key, value| http.send "#{key}=", headers.delete(key) if http.respond_to? "#{key}=" end request = Net::HTTP::Get.new(uri.request_uri) # basic authentication auth = headers.delete(:basic_auth) auth ||= [uri.user, uri.password] if uri.user and uri.password request.basic_auth auth.first, auth.last if auth # remaining options are treated as headers headers.each {|key, value| request[key.to_s] = value.to_s} response = http.request(request) case response when Net::HTTPSuccess doc = parse(reencode(response.body, response['content-type'])) doc.instance_variable_set('@response', response) doc.class.send(:attr_reader, :response) doc when Net::HTTPRedirection response.value if limit <= 1 location = URI.join(uri, response['location']) get(location, .merge(:follow_limit => limit-1)) else response.value end end |
.parse(string) ⇒ Object
Parse an HTML document. string
contains the document. string
may also be an IO-like object. Returns a Nokogiri::HTML::Document
.
14 15 16 17 18 19 20 21 22 23 24 25 |
# File 'lib/nokogumbo.rb', line 14 def self.parse(string) if string.respond_to? :read string = string.read end # convert to UTF-8 (Ruby 1.9+) if string.respond_to?(:encoding) and string.encoding != Encoding::UTF_8 string = reencode(string) end Nokogumbo.parse(string.to_s) end |