Module: PageHub::Markdown::Embedder

Defined in:
lib/pagehub-markdown/processors/embedder.rb

Overview

Downloads remote textual resources from websites and allows for content extraction from HTML pages so it can be neatly embedded in another page.

Defined Under Namespace

Classes: EmbeddingError, GithubWikiProcessor, InvalidSizeError, InvalidTypeError, PageHubProcessor, Processor

Constant Summary collapse

AllowedTypes =

Resources whose content-type is not specified in this list will be rejected

[/text\/plain/, /text\/html/, /application\/html/]
MaximumLength =

Resources larger than 1 MByte will be rejected

1 * 1024 * 1024
FilteredHosts =

Resources served by any of the hosts specified in this list will be rejected

[]
Timeout =
5
MATCH =
/^\B\[\![include|embed]\s?(.*)\!\]\((.*)\)/

Class Method Summary collapse

Class Method Details

.allowed?(ctype) ⇒ Boolean

Returns:

  • (Boolean)


103
104
105
106
# File 'lib/pagehub-markdown/processors/embedder.rb', line 103

def allowed?(ctype)
  AllowedTypes.each { |t| return true if t.match ctype }
  false
end

.get_resource(raw_uri, source = "", args = "") ⇒ Object

Performs a HEAD request to validate the resource, and if it passes the checks it will be downloaded and processed if any eligible Embedder::Processor is registered.

Arguments:

  1. raw_uri the full raw URI of the file to be embedded

  2. source an optional identifier to specify the Processor

    that should be used to post-process the content
    
  3. args options that can be meaningful to the Processor, if any

Returns: A string containing the extracted data, or an empty one



49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/pagehub-markdown/processors/embedder.rb', line 49

def get_resource(raw_uri, source = "", args = "")
  begin
    uri = URI.parse(raw_uri)

    # reject if the host is banned
    return "" if FilteredHosts.include?(uri.host)

    http = Net::HTTP.new(uri.host, uri.port)
    http.open_timeout = Timeout
    http.read_timeout = Timeout
    http.use_ssl      = (uri.scheme == 'https')

    http.start do

      # get the content type and length
      ctype = ""
      clength = 0
      http.head(uri.path).each { |k,v|
        # puts "#{k} => #{v}"
        ctype = v if k == "content-type"
        clength = v.to_i if k == "content-length"
      }

      raise InvalidTypeError.new ctype if !self.allowed?(ctype)
      raise InvalidSizeError.new clength if clength > MaximumLength

      open(raw_uri) { |f|
        content = f.read

        # invoke processors
        keys = []
        keys << source unless source.empty?
        keys << raw_uri
        @@processors.each { |p|
          if p.applies_to?(keys) then
            content = p.process(content, raw_uri, args)
            break
          end
        }

        return content
      }
    end
  rescue EmbeddingError => e
    # we want to escalate these errors
    raise e
  rescue Exception => e
    # mask as a generic EmbeddingError
    raise EmbeddingError.new "generic: #{e.class}##{e.message}"
  end

  ""
end

.register_processor(proc) ⇒ Object



108
109
110
111
# File 'lib/pagehub-markdown/processors/embedder.rb', line 108

def register_processor(proc)
  @@processors ||= []
  @@processors << proc
end