Google Ajax Crawler
Rack Middleware adhering to the Google Ajax Crawling Scheme, using a headless browser to render JS heavy pages and serve a dom snapshot of the rendered state to a requesting search engine.
Details of the scheme can be found at: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
Using
install
gem install google_ajax_crawler
In your config.ru
require 'google_ajax_crawler'
use GoogleAjaxCrawler::Crawler do |config|
config.page_loaded_test = lambda {|driver| driver.page.evaluate_script('document.getElementById("loading") == null') }
end
app = lambda {|env| [200, {'Content-Type' => 'text/plain'}, "b" ] }
run app
Examples
In the examples folder, each driver has a rackup file, which can be launched:
rackup examples/[driver_name].ru
then open a browser to http://localhost:9292/#!test and view source.... This is how a search engine will see your page. NOTE: don't look at the markup through a web inspector as it will most likely display dom elements rendered on the fly by js.
Change the url to http://localhost:9292/?_escaped_fragment_=test , and then again view source to see how the DOM state has been captured
Configuration Options
page_loaded_test
Tell the crawler when your page has finished loading / rendering. As determining when a page has completed rendering can depend on a number of qualitative factors (i.e. all ajax requests have responses, certain content has been displayed, or even when there are no loaders / spinners visible on the page), the page loaded test allows you to specify when the crawler should decide that your page has finished loading / rendering and to return a snapshot of the rendered dom at that time.
The current crawler driver is passed to the lambda to allow querying of the current page's dom state.
A good pattern is to test your page state in a js function returning a boolean, accessible from the window context.. i.e.
use GoogleAjaxCrawler::Crawler do |config|
config.page_loaded_test = lambda {|driver| driver.page.evaluate_script('myApp.isPageLoaded()') }
end
timeout
The max time the crawler should wait before returning a response
driver
The configured google ajax crawler driver used to query the current page state. Presently there is only one driver (now taking pull requests!); CapybaraWebkit
poll_interval
How often (in seconds) to test the page state with the configured page_loaded_test
response_headers
What response headers shoudl be returned with the dom snapshot. Default headers specify the content-type text/html
License
All free - Use, modify, fork to your hearts content... See LICENSE.txt for further details.