Class: PrettyProxy

Inherits:

Rack::Proxy

Object
Rack::Proxy
PrettyProxy

show all

Defined in:: lib/pretty_proxy.rb

Overview

The PrettyProxy class aggregate and validate the configuration of a proxy based in simple pretty url oriented rewriting rules. It’s too a rack app, and offers a abstract method for rewrite the responses returned by the proxy. The (X)HTML responses are rewritten to make the hyperlinks point to the proxy version of the page if it exist.

If you want to make a Rack app who use the proxy to point to another path of the same app you have to use a server in multithread mode, otherwise requests to the proxy will end in a deadlock. The proxy request the original page but the server don’t respond because is waiting the proxy request to be resolved. The proxy request don’t end because need the original page. A timeout error occur.

What this class can’t do but maybe will do in the future: smart handling of 3xx status response and chunked encoding (the chunks are concatened in the proxy and the transfer-encoding header removed); support more than deflate and gzip; exception classes with more than a message;

The exception classes (except Error) inherit Error, and Error inherit ArgumentError. They are empty yet, only have a message.

Glossary:

‘a valid proxy url/path’: The path (or the path of the url) start with the proxy_path and is followed by a original_path.

‘in(side)/out(side) the proxy control’: The url have (or not) the path starting with a original_path, and the scheme, port and host are the same of the original_domain.

CHANGELOG:

3.0.0
  * return a String for unproxify_url (and not more a URI)
     because this is a change in the API (and can break code) the major
     version is now 3, if you don't use this method you can safely upgrade
  * depends in addressable gem
  * handles correctly the URIs without scheme (but with host)
    like '//duckduckgo.com/' (spec added for that)

Examples:

A terrible example

# You can run this example with 'rake heresy_example' in the gem folder
# and see the result in localhost:9292/proxy/
require 'pretty_proxy'

class Heresy < PrettyProxy
  def sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env)
    status, headers, page = triplet
    page = page.gsub(/(MTG )?Magic(: The Gathering)?/, 'Yu-Gi-Oh')
    [status, headers, page]
  end
end

run Heresy.new('/proxy/', 'http://magiccards.info', '/')

Author:

Henrique Becker

Defined Under Namespace

Classes: ConfigError, Error, ProxyError

Instance Attribute Summary collapse

#original_domain ⇒ Object

return the clone of the internal value.
#original_paths ⇒ Object

return the clone of the internal value (always a Set, no matter what is passed to initialize).
#proxy_path ⇒ Object

return the clone of the internal value.

Instance Method Summary collapse

#call(env) ⇒ Object

Make this class a Rack app.
#initialize(proxy_path, original_domain, original_paths) ⇒ PrettyProxy constructor

Create a new PrettyProxy instance or raise a ConfigError.
#inside_proxy_control?(uri) ⇒ Boolean

Check if the URI::HTTP(S) is a page who can be accessed through the proxy.
#point_to_a_proxy_page?(hyperlink, proxy_domain) ⇒ Boolean

Take a url and the proxy domain (scheme, host and port) and return if the url point to a valid proxy page.
#proxify_html(html, proxy_url) ⇒ String

Take a (X)HTML Document and apply proxify_hyperlink to the ‘href’ attribute of each ‘a’ element.
#proxify_hyperlink(hyperlink, proxy_page_url) ⇒ String

Take a hyperlink and the url of the proxy page (not the original page) where it come from and return the rewritten hyperlink.
#rewrite_env(env) ⇒ Hash{String => String}

Modify a Rack environment hash of a request to the proxy version of a page to a request to the original page.
#rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ Array<(Integer, Hash{String => String}, #each)>

Mainly apply the proxify_html to the body of the response if it is a html.
#same_domain_as_original?(uri) ⇒ Boolean

Check if the #scheme, #host, and #port of the argument are equal to the original_domain ones.
#sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ Array<(Integer, Hash{String => String}, String)> abstract

A unproxified copy of the first argument.
#unproxify_url(url) ⇒ String

Take a proxy url and return the original URL behind the proxy.
#valid_path_for_proxy?(absolute_path) ⇒ Boolean

Check if the absolute path begin with a proxy_path and is followed by a original_paths element.

Constructor Details

#initialize(proxy_path, original_domain, original_paths) ⇒ `PrettyProxy`

Note:

See the specs pretty_proxy_spec for examples and complete definition of invalid args.

Create a new PrettyProxy instance or raise a ConfigError. Clone the arguments.

Parameters:

proxy_path (String) —

Start and end with slashes, represent the path in the proxy site who map to the proxy app (and, in consequence, to another path in the same or another site).
original_domain (String, URI) —

A URL without path (no trailing slash), query or fragment (can have scheme (http), domain and port), the site to where the proxy map.
original_paths (String, #each) —

The path (or the paths) to be mapped right inside the proxy_path (has to begin with slash).

Raises:

PrettyProxy::ConfigError

# File 'lib/pretty_proxy.rb', line 90

def initialize(proxy_path, original_domain, original_paths)
  Utils.validate_proxy_path(proxy_path)
  Utils.validate_original_domain_and_paths(original_domain, original_paths)

  @proxy_path = proxy_path.clone
  @original_domain = Addressable::URI.parse(original_domain.clone)
  @original_paths = Set.new 
  if original_paths.respond_to? :each
    original_paths.each { | value | @original_paths << value.clone }
  else
    @original_paths << original_paths.clone
  end
end

Instance Attribute Details

#original_domain ⇒ `Object`

return the clone of the internal value



111
112
113

# File 'lib/pretty_proxy.rb', line 111

[:proxy_path, :original_domain, :original_paths].each do | reader |
  define_method(reader) { instance_variable_get("@#{reader.to_s}").clone }
end

#original_paths ⇒ `Object`

return the clone of the internal value (always a Set, no matter what is passed to initialize).



111
112
113

# File 'lib/pretty_proxy.rb', line 111

[:proxy_path, :original_domain, :original_paths].each do | reader |
  define_method(reader) { instance_variable_get("@#{reader.to_s}").clone }
end

#proxy_path ⇒ `Object`

return the clone of the internal value



111
112
113

# File 'lib/pretty_proxy.rb', line 111

[:proxy_path, :original_domain, :original_paths].each do | reader |
  define_method(reader) { instance_variable_get("@#{reader.to_s}").clone }
end

Instance Method Details

#call(env) ⇒ `Object`

Make this class a Rack app. It’s overriden to repass to the rewrite_response the original Rack environment (request to the proxy) and the rewritten env (modified to point the original page request). If you don’t know the parameters and return of this method, please read http://rack.rubyforge.org/doc/SPEC.html.

# File 'lib/pretty_proxy.rb', line 364

def call(env)
  # in theory we only need to repass the rewritten_env, any original env info
  #  needed can be passed as a environment application variable
  #  example: (env['app_name.original_path'] = env['PATH_INFO'])
  #  but to avoid this to be a common idiom we repass the original env too
  rewritten_env = rewrite_env(env)
  rewrite_response(perform_request(rewritten_env), env, rewritten_env)
end

#inside_proxy_control?(uri) ⇒ `Boolean`

Check if the URI::HTTP(S) is a page who can be accessed through the proxy.

Returns:

(Boolean)

# File 'lib/pretty_proxy.rb', line 380

def inside_proxy_control?(uri)
  same_domain_as_original?(uri) &&
    valid_path_for_proxy?(@proxy_path + uri.path[1..-1])
end

#point_to_a_proxy_page?(hyperlink, proxy_domain) ⇒ `Boolean`

Take a url and the proxy domain (scheme, host and port) and return if the url point to a valid proxy page.

Returns:

(Boolean)

# File 'lib/pretty_proxy.rb', line 405

def point_to_a_proxy_page?(hyperlink, proxy_domain)
  Utils.same_domain?(hyperlink, proxy_domain) &&
    valid_path_for_proxy?(hyperlink.path)
end

#proxify_html(html, proxy_url) ⇒ `String`

Take a (X)HTML Document and apply proxify_hyperlink to the ‘href’ attribute of each ‘a’ element.

Parameters:

html (String) —

A (X)HTML document.
proxy_url (String, URI::HTTP, URI::HTTPS) —

The url where the the proxified version of the page will be displayed.

Returns:

(String) —

A copy of the document with the changes applied.

Raises:

PrettyProxy::ProxyError

# File 'lib/pretty_proxy.rb', line 207

def proxify_html(html, proxy_url)
  parsed_html = nil

  # If you parse XHTML as HTML with Nokogiri and use to_s after the markup can be messed up
# 
  # Example:     <meta name="description" content="not important" />
  #   becomes    <meta name="description" content="not important" >
  # To avoid this we parse a document who is XML valid as XML, and, otherwise as HTML
  begin
    # this also isn't a great way to do this
    # the Nokogiri don't have exception classes, this way any StandardError will be silenced
    options = Nokogiri::XML::ParseOptions::DEFAULT_XML &
                Nokogiri::XML::ParseOptions::STRICT &
                Nokogiri::XML::ParseOptions::DTDVALID
    parsed_html = Nokogiri::XML::Document.parse(html, nil, nil, options)
  rescue
    parsed_html = Nokogiri::HTML(html)
  end

  parsed_html.css('a').each do | hyperlink |
    hyperlink['href'] = proxify_hyperlink(hyperlink['href'], proxy_url)
  end

  parsed_html.to_s
end

#proxify_hyperlink(hyperlink, proxy_page_url) ⇒ `String`

Take a hyperlink and the url of the proxy page (not the original page) where it come from and return the rewritten hyperlink. If the page pointed vy the hyperlink is in the proxy control the rewritten hyperlink gonna point to the proxyfied version, otherwise gonna point to the original version.

Parameters:

hyperlink (String, URI::HTTP, URI::HTTPS) —

A string with a relative path or an url (string or URI).
proxy_page_url (String, URI::HTTP, URI::HTTPS) —

The url from the proxy page where the hyperlink come from.

Returns:

(String) —

A relative path or an url.

Raises:

PrettyProxy::ProxyError

# File 'lib/pretty_proxy.rb', line 161

def proxify_hyperlink(hyperlink, proxy_page_url)
  hyperlink = Addressable::URI.parse(hyperlink.clone)
  proxy_page_url = Addressable::URI.parse(proxy_page_url)

  # this is URI relative ('//duckduckgo.com', '/path', '../path')
  if hyperlink.relative?
    absolute_hyperlink = Addressable::URI.parse(unproxify_url(proxy_page_url))
                                         .join(hyperlink)
    if inside_proxy_control? absolute_hyperlink
      # this is path relative ('../path', 'path', but not '//duckduckgo.com' or '/path')
      if Pathname.new(hyperlink.path).relative?
        if point_to_a_proxy_page?(absolute_hyperlink, proxy_page_url)
          # in the case of a relative path in the original page who points
          # to a proxy page, and the proxy page is inside the proxy control
          # we have to use the absolute_hyperlink or the page will be double
          # proxified. Example: ../proxy/content in http://example.com/proxy/content,
          # with original_path as '/' is http://example.com/proxy/proxy/content
          hyperlink = absolute_hyperlink
        end
      else
        hyperlink.path = @proxy_path[0..-2] + absolute_hyperlink.path
        hyperlink.host = proxy_page_url.host if hyperlink.host
        hyperlink.port = proxy_page_url.port if hyperlink.port
      end
    else
      hyperlink = absolute_hyperlink
    end
  else # the hyperlink is absolute
    if inside_proxy_control? hyperlink
      # if points to the proxy itself we don't double-proxify
      unless point_to_a_proxy_page?(hyperlink, proxy_page_url)
        hyperlink = proxify_uri(hyperlink, proxy_page_url)
      end
    end
  end

  hyperlink.to_s
end

#rewrite_env(env) ⇒ `Hash{String => String}`

Modify a Rack environment hash of a request to the proxy version of a page to a request to the original page. As in Rack::proxy is used by #call for require the original page before call rewrite_response in the response. If you want to use your own rewrite rules maybe is more wise to subclass Rack::Proxy instead subclass this class. The purpose of this class is mainly implement and enforce these rules for you.

Parameters:

env (Hash{String => String}) —

A Rack environment hash. (see: http://rack.rubyforge.org/doc/SPEC.html)

Returns:

(Hash{String => String}) —

A unproxified copy of the argument.

Raises:

PrettyProxy::ProxyError

# File 'lib/pretty_proxy.rb', line 243

def rewrite_env(env)
  env = env.clone
  url_requested_to_proxy = Rack::Request.new(env).url
  # Using URI, and not Addressable::URI because the port value is incorrect in the last
  unproxified_url = Addressable::URI.parse(unproxify_url(url_requested_to_proxy))

  if env['HTTP_HOST']
    env['HTTP_HOST'] = unproxified_url.host
  end
  env['SERVER_NAME'] = unproxified_url.host
  env['SERVER_PORT'] = unproxified_url.inferred_port.to_s

  if env['SCRIPT_NAME'].empty? && !env['PATH_INFO'].empty?
    env['PATH_INFO'] = unproxified_url.path
  end
  if !env['SCRIPT_NAME'].empty? && env['PATH_INFO'].empty?
    env['SCRIPT_NAME'] = unproxified_url.path
  end
  # Seriously, i don't know how to split again the unproxified url, so PATH_INFO gonna have the full path
  if (!env['SCRIPT_NAME'].empty? && !env['PATH_INFO'].empty?) ||
      (env['SCRIPT_NAME'].empty? && env['PATH_INFO'].empty?)
    env['PATH_INFO'] = unproxified_url.path
    env['SCRIPT_NAME'] = ''
  end

  env['REQUEST_PATH'] = unproxified_url.path
  env['REQUEST_URI'] = unproxified_url.path

  env
end

#rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ `Array<(Integer, Hash{String => String}, #each)>`

Mainly apply the proxify_html to the body of the response if it is a html. Raise an error if the ‘content-encoding’ is other than deflate, gzip or identity. Change the ‘content-length’ header for the new body bytesize. Remove the ‘transfer-encoding’ if it is chunked, and act as not chunked. This method is inherited of Rack::Proxy, but in the original it have only the first parameter (the triplet). This version have the request Rack env to the proxy and the rewritten Rack env as second and third parameters, respectively.

Parameters:

triplet (Array<(Integer, Hash{String => String}, #each)>) —

A Rack response (see http://rack.rubyforge.org/doc/SPEC.html) for the request to the original site.
requested_to_proxy_env (Hash{String => String}) —

A Rack environment hash. The requested to the proxy version.
rewritten_env (Hash{String => String}) —

A Rack environment hash. The rewritten by the proxy to point to the original version.

Returns:

(Array<(Integer, Hash{String => String}, #each)>) —

A unproxified copy of the first argument.

Raises:

PrettyProxy::ProxyError

# File 'lib/pretty_proxy.rb', line 292

def rewrite_response(triplet, requested_to_proxy_env, rewritten_env)
  status, headers, body = triplet
  content_type = headers['content-type']
  return triplet unless %r{text/html} =~ content_type ||
                        %r{application/xhtml\+xml} =~ content_type

  # the #each method of body can't be called twice, but we need to call it here and it is called
  # after this method return, so we fake the body with a array of one string
  # we can't return a string (even it responds to #each) see: http://rack.rubyforge.org/doc/SPEC.html (section 'The Body')
  page = ''
  body.each do | chunk |
    page << chunk
  end

  case headers['content-encoding']
  when 'gzip' then page = Zlib::GzipReader.new(StringIO.new(page)).read
  when 'deflate' then page = Zlib::Inflate.inflate(page)
  when 'identity' then page = page
  when nil then page = page
  else
    fail ProxyError, 'unknown content-encoding, only encodings known are gzip, deflate and identity'
  end

  page = proxify_html(page, Rack::Request.new(requested_to_proxy_env).url)
  status, headers, page = sugared_rewrite_response([status, headers, page],
                                                    requested_to_proxy_env,
                                                    rewritten_env)

  case headers['content-encoding']
  when 'gzip'
    page_ = page.clone
    gzip_stream = Zlib::GzipWriter.new(StringIO.new(page_))
    gzip_stream.write page
    gzip_stream.close
    page = page_
  when 'deflate' then page = Zlib::Deflate.deflate(page)
  end

  headers['content-length'] = page.bytesize.to_s if headers['content-length']

  # TODO: find a way to make the code work with chunked encoding
  if 'chunked' == headers['transfer-encoding']
    headers.delete('transfer-encoding') 
    headers['content-length'] = page.bytesize.to_s
  end

  [status, headers, [page]]
end

#same_domain_as_original?(uri) ⇒ `Boolean`

Check if the #scheme, #host, and #port of the argument are equal to the original_domain ones.

Returns:

(Boolean)



375
376
377

# File 'lib/pretty_proxy.rb', line 375

def same_domain_as_original?(uri)
  Utils.same_domain?(@original_domain, uri)
end

#sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ `Array<(Integer, Hash{String => String}, String)>`

This method is abstract.

This method is called only over (X)HTML responses, after they are decompressed and the hyperlinks proxified, before they are compressed again and the new content-length calculated.

Note:

The body of the triplet is a String and not a object who respond to #each, the same has to be true in the return. Return a modified clone of the response, don’t change the argument.

Returns A unproxified copy of the first argument.

Parameters:

triplet (Array<(Integer, Hash{String => String}, String)>) —

Not a valid Rack response, the third element is a string with the response body.
requested_to_proxy_env (Hash{String => String}) —

A Rack environment hash. The requested to the proxy version.
rewritten_env (Hash{String => String}) —

A Rack environment hash. The rewritten by the proxy to point to the original version.

Returns:

(Array<(Integer, Hash{String => String}, String)>) —

A unproxified copy of the first argument.



355
356
357

# File 'lib/pretty_proxy.rb', line 355

def sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env)
  triplet
end

#unproxify_url(url) ⇒ `String`

Take a proxy url and return the original URL behind the proxy. Preserve the query and fragment, if any. For the rewrite of a request @see rewrite_env.

Parameters:

url (String, URI::HTTP, URI::HTTPS) —

A URL.

Returns:

(String) —

The unproxified URI in a string.

Raises:

PrettyProxy::ProxyError

# File 'lib/pretty_proxy.rb', line 135

def unproxify_url(url)
  url = Addressable::URI.parse(url.clone)
  
  unless valid_path_for_proxy? url.path
    fail ProxyError, "'#{url.to_s}' isn't inside the proxy control, it can't be unproxified"
  end

  url.site = @original_domain.site
  url.path = url.path.slice((@proxy_path.size-1)..-1)

  url.to_s
rescue Addressable::URI::InvalidURIError
  raise ArgumentError, "the url argument isn't a valid uri"
end

#valid_path_for_proxy?(absolute_path) ⇒ `Boolean`

Check if the absolute path begin with a proxy_path and is followed by a original_paths element.

Returns:

(Boolean)

# File 'lib/pretty_proxy.rb', line 387

def valid_path_for_proxy?(absolute_path)
  return false unless absolute_path.start_with?(@proxy_path)

  path_without_proxy_prefix = absolute_path[(@proxy_path.size-1)..-1]

  @original_paths.any? do | original_path |
    # if we don't test this '/about' and '/about_us' will match
    if original_path.end_with? '/'
      path_without_proxy_prefix.start_with? original_path
    else
      path_without_proxy_prefix == original_path ||
        path_without_proxy_prefix.start_with?("#{original_path}/")
    end
  end
end

Class: PrettyProxy

Overview

Examples:

A terrible example

Defined Under Namespace

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(proxy_path, original_domain, original_paths) ⇒ PrettyProxy

Instance Attribute Details

#original_domain ⇒ Object

#original_paths ⇒ Object

#proxy_path ⇒ Object

Instance Method Details

#call(env) ⇒ Object

#inside_proxy_control?(uri) ⇒ Boolean

#point_to_a_proxy_page?(hyperlink, proxy_domain) ⇒ Boolean

#proxify_html(html, proxy_url) ⇒ String

#proxify_hyperlink(hyperlink, proxy_page_url) ⇒ String

#rewrite_env(env) ⇒ Hash{String => String}

#rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ Array<(Integer, Hash{String => String}, #each)>

#same_domain_as_original?(uri) ⇒ Boolean

#sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ Array<(Integer, Hash{String => String}, String)>

#unproxify_url(url) ⇒ String

#valid_path_for_proxy?(absolute_path) ⇒ Boolean

#initialize(proxy_path, original_domain, original_paths) ⇒ `PrettyProxy`

#original_domain ⇒ `Object`

#original_paths ⇒ `Object`

#proxy_path ⇒ `Object`

#call(env) ⇒ `Object`

#inside_proxy_control?(uri) ⇒ `Boolean`

#point_to_a_proxy_page?(hyperlink, proxy_domain) ⇒ `Boolean`

#proxify_html(html, proxy_url) ⇒ `String`

#proxify_hyperlink(hyperlink, proxy_page_url) ⇒ `String`

#rewrite_env(env) ⇒ `Hash{String => String}`

#rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ `Array<(Integer, Hash{String => String}, #each)>`

#same_domain_as_original?(uri) ⇒ `Boolean`

#sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env) ⇒ `Array<(Integer, Hash{String => String}, String)>`

#unproxify_url(url) ⇒ `String`

#valid_path_for_proxy?(absolute_path) ⇒ `Boolean`