Class: SciRuby::Data::Guardian

Inherits:
PublicSearcher show all
Defined in:
lib/sciruby/data/guardian.rb

Overview

World Government Data from the Guardian.

Defined Under Namespace

Classes: DatasetInfo

Constant Summary collapse

QUERY_DOMAIN =
%q{www.guardian.co.uk}
QUERY_PATH =
%q{/world-government-data/search.json}
FOUR_OH_FOUR_MESSAGE =
'404 Page not found'
ALLOWED_FORMATS =
[:csv, :excel]

Instance Attribute Summary

Attributes inherited from PublicSearcher

#search_result

Instance Method Summary collapse

Methods inherited from PublicSearcher

#download_dataset, #search

Constructor Details

#initialize(args = {}) ⇒ Guardian

Search the site or database using some set of parameters.

This function is the one that you should redefine if you want to require certain parameters, or if there are parameter co-dependencies. Ultimately, you call ‘search_internal(params)`.

Arguments

  • q: keywords (default: ”, if no other parameters are supplied)

  • facet_country: country code abbreviation to search

  • facet_source_title: e.g., data from Australian government would be data.nsw.org.au

  • facet_format: e.g., csv, excel, xml, shapefile, kml



31
32
33
34
35
36
# File 'lib/sciruby/data/guardian.rb', line 31

def initialize args={}
  #args[:facet_format] ||= :csv
  #@require_format ||= args[:facet_format] # This should be removed when we can interpret other formats.

  @search_result = search(args)
end

Instance Method Details

#dataset(source_id) ⇒ Object

Download a specific dataset by source_id and cache it in the searcher. Returns a Statsample::Dataset.

If this raises an exception, you can try this:

links = raw_dataset_links_cached(source_id)

And then for each of links, do ‘raw_dataset(source_id, link)` to see what the actual downloaded data was. This is good for debugging – e.g., did the page move? or is there something wrong with Ruby’s CSV interpreter? Or is it in some other format altogether?

Right now, this function only handles CSV. TODO: Add more format handlers!



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/sciruby/data/guardian.rb', line 61

def dataset source_id
  @dataset ||= {}
  @dataset[source_id] ||= begin # Datasets are stored by source ID
    pos = 0
    datasets[source_id].download_links.each do |link_info|

      unless ALLOWED_FORMATS.include?(link_info.format)
        pos += 1
        next # Format is incorrect.
      end

      # Format appears to be correct, prior to actually downloading. Proceed.

      # Attempt to read the cached one first, and if that fails, try downloading.
      raw = cached_dataset(source_id) || download_dataset(link_info.link)
      
      begin
        ds  = parse_dataset link_info.format, raw, datasets[source_id].title
        cache_dataset(source_id, raw, link_info.format)
      rescue TypeError => e
        if pos == datasets[source_id].download_links.size - 1
          raise DatasetNotFoundError.new(e)
        end
      ensure
        pos += 1
      end

      return ds unless ds.nil?

    end
  end
end

#datasetsObject

Return dataset meta-data found in the search, hashed by source_id. So, do datasets.keys if you want a list of source_ids.



40
41
42
43
44
45
46
47
48
# File 'lib/sciruby/data/guardian.rb', line 40

def datasets
  @datasets ||= begin
    h = {}
    search_result["results"].each do |res|
      h[res['source_id']] = DatasetInfo.new(res)
    end
    h
  end
end