SnapSearch-Client-Ruby

Build Status

Snapsearch Client Ruby is Ruby based framework agnostic HTTP client library for SnapSearch (https://snapsearch.io/).

SnapSearch provides similar libraries in other languages: https://github.com/SnapSearch/Snapsearch-Clients

Installation

Usage

Development

Get the bundler dependency management tool.

gem install bundler

Install/update all dependencies:

bundle install

See all build tasks:

bundle exec rake -T

Make your changes. Release a new version tag with (see the other rake version:bump:... etc tasks):

bundle exec rake version:bump

Synchronise and push the tag to Github:

git push
git push --tags

Create the gem package:

bundle exec rake gem

Push the gem to Ruby Gems:

gem push pkg/snapsearch-client-ruby-MAJOR.MINOR.PATCH.gem

Setting Up the Detector

The Detector class detects if the incoming request is coming from a robot or not.

Detects if the request came from a search engine robot. It will intercept in cascading order:

  1. on a GET request
  2. on an HTTP or HTTPS protocol
  3. not on any ignored robot user agents
  4. not on any route not matching the whitelist
  5. not on any route matching the blacklist
  6. not on any static files that is not a PHP file if it is detected
  7. on requests with escaped_fragment query parameter
  8. on any matched robot user agents

You can customize a few aspects of this process:

User Agents

Most robots send a unique user-agent HTTP header that we match against to confirm if it indeed a request from a robot.
We also ignore certain user agents, such as the SnapSearch robot.

The list of user agents to match and ignore is contained in resources/robots.json. You can customize this list through the Detector instance you are working with:

# Retrieve the list of user agents to match and ignore:
detector.robots # => { 'match' => ['SomeRobot', 'AnotherRobot'], 'ignore' => ['SnapSearch'] }

# Add a user agent to match against:
detector.robots['match'] << 'NewRobot'

# Add a user agent to ignore:
detector.robots['ignore'] << 'MyRobot'

# Set a new list of user agents to match and ignore:
detector.robots = { 'match' => ['WebScraper', 'SillyBot'], 'ignore' => ['MyBotToIgnore'] }

# Load from a custom JSON file:
detector.robots_json = './my_robots.json'
detector.robots # => { 'match' => ['MyCustomBot', 'AnotherRobot'], 'ignore' => ['MyLoadedBotFromJSON'] }

Tests

Tests are written with RSpec. Run tests with bundle exec rspec spec/