Class: ScraperWiki::API

Inherits:
Object
  • Object
show all
Includes:
HTTParty
Defined in:
lib/scraperwiki-api.rb,
lib/scraperwiki-api/version.rb,
lib/scraperwiki-api/matchers.rb

Overview

A Ruby wrapper for the ScraperWiki API.

Defined Under Namespace

Modules: Matchers

Constant Summary collapse

RUN_INTERVALS =
{
  :never   => -1,
  :monthly => 2678400,
  :weekly  => 604800,
  :daily   => 86400,
  :hourly  => 3600,
}
VERSION =
"0.0.7"

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(apikey = nil) ⇒ API

Initializes a ScraperWiki API object.



37
38
39
# File 'lib/scraperwiki-api.rb', line 37

def initialize(apikey = nil)
  @apikey = apikey
end

Class Method Details

.edit_scraper_url(shortname) ⇒ String

Returns the URL to edit the scraper.

Parameters:

  • shortname (String)

    the scraper's shortname

Returns:

  • (String)

    the URL to edit the scraper



31
32
33
# File 'lib/scraperwiki-api.rb', line 31

def edit_scraper_url(shortname)
  "https://scraperwiki.com/scrapers/#{shortname}/edit/"
end

.scraper_url(shortname) ⇒ String

Returns the URL to the scraper's overview.

Parameters:

  • shortname (String)

    the scraper's shortname

Returns:

  • (String)

    the URL to the scraper's overview



23
24
25
# File 'lib/scraperwiki-api.rb', line 23

def scraper_url(shortname)
  "https://scraperwiki.com/scrapers/#{shortname}/"
end

Instance Method Details

#datastore_sqlite(shortname, query, opts = {}) ⇒ Array, ...

Note:

The query string parameter is +name+, not +shortname+ as in the ScraperWiki docs

Queries and extracts data via a general purpose SQL interface.

To make an RSS feed you need to use SQL's +AS+ keyword (e.g. "SELECT name AS description") to make columns called +title+, +link+, +description+, +guid+ (optional, uses link if not available) and +pubDate+ or +date+.

+jsondict+ example output:

[
  {
    "fieldA": "valueA",
    "fieldB": "valueB",
    "fieldC": "valueC",
  },
  ...
]

+jsonlist+ example output:

{
  "keys": ["fieldA", "fieldB", "fieldC"],
  "data": [
    ["valueA", "valueB", "valueC"],
    ...
  ]
}

+csv+ example output:

fieldA,fieldB,fieldC
valueA,valueB,valueC
...

Parameters:

  • shortname (String)

    the scraper's shortname (as it appears in the URL)

  • query (String)

    a SQL query

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :format (String)

    one of "jsondict", "jsonlist", "csv", "htmltable" or "rss2"

  • :attach (Array, String)

    ";"-delimited list of shortnames of other scrapers whose data you need to access

Returns:

  • (Array, Hash, String)

See Also:



86
87
88
89
90
91
# File 'lib/scraperwiki-api.rb', line 86

def datastore_sqlite(shortname, query, opts = {})
  if Array === opts[:attach]
    opts[:attach] = opts[:attach].join ';'
  end
  request_with_apikey '/datastore/sqlite', {:name => shortname, :query => query}.merge(opts)
end

#scraper_getinfo(shortname, opts = {}) ⇒ Array

Note:

Returns an array although the array seems to always have only one item

Note:

The +tags+ field seems to always be an empty array

Note:

Fields like +last_run+ seem to follow British Summer Time.

Note:

The query string parameter is +name+, not +shortname+ as in the ScraperWiki docs

Extracts data about a scraper's code, owner, history, etc.

  • +runid+ is a Unix timestamp with microseconds and a UUID.
  • The value of +records+ is the same as that of +total_rows+ under +datasummary+.
  • +run_interval+ is the number of seconds between runs. It is one of:
    • -1 (never)
    • 2678400 (monthly)
    • 604800 (weekly)
    • 86400 (daily)
    • 3600 (hourly)
  • +privacy_status+ is one of:
    • "public" (everyone can see and edit the scraper and its data)
    • "visible" (everyone can see the scraper, but only contributors can edit it)
    • "private" (only contributors can see and edit the scraper and its data)
  • An individual +runevents+ hash will have an +exception_message+ key if there was an error during that run.

Example output:

[
  {
    "code": "require 'nokogiri'\n...",
    "datasummary": {
      "tables": {
        "swdata": {
          "keys": [
            "fieldA",
            ...
          ],
          "count": 42,
          "sql": "CREATE TABLE `swdata` (...)"
        },
        "swvariables": {
          "keys": [
            "value_blob",
            "type",
            "name"
          ],
          "count": 2,
          "sql": "CREATE TABLE `swvariables` (`value_blob` blob, `type` text, `name` text)"
        },
        ...
      },
      "total_rows": 44,
      "filesize": 1000000
    },
    "description": "Scrapes websites for data.",
    "language": "ruby",
    "title": "Example scraper",
    "tags": [],
    "short_name": "example-scraper",
    "userroles": {
      "owner": [
        "johndoe"
      ],
      "editor": [
        "janedoe",
        ...
      ]
    },
    "last_run": "1970-01-01T00:00:00",
    "created": "1970-01-01T00:00:00",
    "runevents": [
      {
        "still_running": false,
        "pages_scraped": 5,
        "run_started": "1970-01-01T00:00:00",
        "last_update": "1970-01-01T00:00:00",
        "runid": "1325394000.000000_xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx",
        "records_produced": 42
      },
      ...
    ],
    "records": 44,
    "wiki_type": "scraper",
    "privacy_status": "visible",
    "run_interval": 604800,
    "attachable_here": [],
    "attachables": [],
    "history": [
      ...,
      {
        "date": "1970-01-01T00:00:00",
        "version": 0,
        "user": "johndoe",
        "session": "Thu, 1 Jan 1970 00:00:08 GMT"
      }
    ]
  }
]

Parameters:

  • shortname (String)

    the scraper's shortname (as it appears in the URL)

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :version (String)

    version number (-1 for most recent) [default -1]

  • :history_start_date (String)

    history and runevents are restricted to this date or after, enter as YYYY-MM-DD

  • :quietfields (Array, String)

    "|"-delimited list of fields to exclude from the output. Must be a subset of 'code|runevents|datasummary|userroles|history'

Returns:

  • (Array)


198
199
200
201
202
203
# File 'lib/scraperwiki-api.rb', line 198

def scraper_getinfo(shortname, opts = {})
  if Array === opts[:quietfields]
    opts[:quietfields] = opts[:quietfields].join '|'
  end
  request_with_apikey '/scraper/getinfo', {:name => shortname}.merge(opts)
end

#scraper_getruninfo(shortname, opts = {}) ⇒ Array

Note:

Returns an array although the array seems to always have only one item

Note:

The query string parameter is +name+, not +shortname+ as in the ScraperWiki docs

See what the scraper did during each run.

Example output:

[
  {
    "run_ended": "1970-01-01T00:00:00",
    "first_url_scraped": "http://www.iana.org/domains/example/",
    "pages_scraped": 5,
    "run_started": "1970-01-01T00:00:00",
    "runid": "1325394000.000000_xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx",
    "domainsscraped": [
      {
        "domain": "http://example.com",
        "bytes": 1000000,
        "pages": 5
      }
      ...
    ],
    "output": "...",
    "records_produced": 42
  }
]

Parameters:

  • shortname (String)

    the scraper's shortname (as it appears in the URL)

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • runid (String)

    a run ID

Returns:

  • (Array)


237
238
239
# File 'lib/scraperwiki-api.rb', line 237

def scraper_getruninfo(shortname, opts = {})
  request_with_apikey '/scraper/getruninfo', {:name => shortname}.merge(opts)
end

#scraper_getuserinfo(username) ⇒ Array

Note:

Returns an array although the array seems to always have only one item

Note:

The date joined field is +date_joined+ (with underscore) on #scraper_usersearch

Find out information about a user.

Example output:

[
  {
    "username": "johndoe",
    "profilename": "John Doe",
    "coderoles": {
      "owner": [
        "johndoe.emailer",
        "example-scraper",
        ...
      ],
      "email": [
        "johndoe.emailer"
      ],
      "editor": [
        "yet-another-scraper",
        ...
      ]
    },
    "datejoined": "1970-01-01T00:00:00"
  }
]

Parameters:

  • username (String)

    a username

Returns:

  • (Array)


273
274
275
# File 'lib/scraperwiki-api.rb', line 273

def scraper_getuserinfo(username)
  request_with_apikey '/scraper/getuserinfo', :username => username
end

#scraper_search(opts = {}) ⇒ Array

Search the titles and descriptions of all the scrapers.

Example output:

[
  {
    "description": "Scrapes websites for data.",
    "language": "ruby",
    "created": "1970-01-01T00:00:00",
    "title": "Example scraper",
    "short_name": "example-scraper",
    "privacy_status": "public"
  },
  ...
]

Parameters:

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :searchquery (String)

    search terms

  • :maxrows (Integer)

    number of results to return [default 5]

  • :requestinguser (String)

    the name of the user making the search, which changes the order of the matches

Returns:

  • (Array)


299
300
301
# File 'lib/scraperwiki-api.rb', line 299

def scraper_search(opts = {})
  request_with_apikey '/scraper/search', opts
end

#scraper_usersearch(opts = {}) ⇒ Array

Note:

The date joined field is +datejoined+ (without underscore) on #scraper_getuserinfo

Search for a user by name.

Example output:

[
  {
    "username": "johndoe",
    "profilename": "John Doe",
    "date_joined": "1970-01-01T00:00:00"
  },
  ...
]

Parameters:

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :searchquery (String)

    search terms

  • :maxrows (Integer)

    number of results to return [default 5]

  • :nolist (Array, String)

    space-separated list of usernames to exclude from the output

  • :requestinguser (String)

    the name of the user making the search, which changes the order of the matches

Returns:

  • (Array)


327
328
329
330
331
332
# File 'lib/scraperwiki-api.rb', line 327

def scraper_usersearch(opts = {})
  if Array === opts[:nolist]
    opts[:nolist] = opts[:nolist].join ' '
  end
  request '/scraper/usersearch', opts
end