Module: Nokogiri::HTML5

Defined in:
lib/nokogumbo.rb

Class Method Summary collapse

Class Method Details

.fragment(string) ⇒ Object

while fragment is on the Gumbo TODO list, simulate it by doing a full document parse and ignoring the parent <html>, <head>, and <body> tags, and collecting up the children of each.


86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/nokogumbo.rb', line 86

def self.fragment(string)
  doc = parse(string)
  fragment = Nokogiri::HTML::DocumentFragment.new(doc)

  if doc.children.length != 1 or doc.children.first.name != 'html'
    # no HTML?  Return document as is
    fragment = doc
  else
    # examine children of HTML element
    children = doc.children.first.children

    # head is always first.  If present, take children but otherwise
    # ignore the head element
    if children.length > 0 and doc.children.first.name = 'head'
      fragment << children.shift.children
    end

    # body may be next, or last.  If found, take children but otherwise
    # ignore the body element.  Also take any remaining elements, taking
    # care to preserve order.
    if children.length > 0 and doc.children.first.name = 'body'
      fragment << children.shift.children
      fragment << children
    elsif children.length > 0 and doc.children.last.name = 'body'
      body = children.pop
      fragment << children
      fragment << body.children
    else
      fragment << children
    end
  end

  # return result
  fragment
end

.get(uri, options = {}) ⇒ Object

Fetch and parse a HTML document from the web, following redirects, handling https, and determining the character encoding using HTML5 rules. uri may be a String or a URI. options contains http headers and special options. Everything which is not a special option is considered a header. Special options include:

* :follow_limit => number of redirects which are followed
* :basic_auth => [username, password]

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/nokogumbo.rb', line 34

def self.get(uri, options={})
  headers = options.clone
  headers = {:follow_limit => headers} if Numeric === headers # deprecated
  limit=headers[:follow_limit] ? headers.delete(:follow_limit).to_i : 10

  require 'net/http'
  uri = URI(uri) unless URI === uri

  http = Net::HTTP.new(uri.host, uri.port)

  # TLS / SSL support
  http.use_ssl = true if uri.scheme == 'https'

  # Pass through Net::HTTP override values, which currently include:
  #   :ca_file, :ca_path, :cert, :cert_store, :ciphers,
  #   :close_on_empty_response, :continue_timeout, :key, :open_timeout,
  #   :read_timeout, :ssl_timeout, :ssl_version, :use_ssl,
  #   :verify_callback, :verify_depth, :verify_mode
  options.each do |key, value|
    http.send "#{key}=", headers.delete(key) if http.respond_to? "#{key}="
  end

  request = Net::HTTP::Get.new(uri.request_uri)

  # basic authentication
  auth = headers.delete(:basic_auth)
  auth ||= [uri.user, uri.password] if uri.user and uri.password
  request.basic_auth auth.first, auth.last if auth

  # remaining options are treated as headers
  headers.each {|key, value| request[key.to_s] = value.to_s}

  response = http.request(request)

  case response
  when Net::HTTPSuccess
    doc = parse(reencode(response.body, response['content-type']))
    doc.instance_variable_set('@response', response)
    doc.class.send(:attr_reader, :response)
    doc
  when Net::HTTPRedirection
    response.value if limit <= 1
    location = URI.join(uri, response['location'])
    get(location, options.merge(:follow_limit => limit-1))
  else
    response.value
  end
end

.parse(string) ⇒ Object

Parse an HTML document. string contains the document. string may also be an IO-like object. Returns a Nokogiri::HTML::Document.


14
15
16
17
18
19
20
21
22
23
24
25
# File 'lib/nokogumbo.rb', line 14

def self.parse(string)
  if string.respond_to? :read
    string = string.read
  end

  # convert to UTF-8 (Ruby 1.9+) 
  if string.respond_to?(:encoding) and string.encoding != Encoding::UTF_8
    string = reencode(string)
  end

  Nokogumbo.parse(string.to_s)
end