Class: Infoboxer::MediaWiki

Inherits:
Object
  • Object
show all
Defined in:
lib/infoboxer/media_wiki.rb,
lib/infoboxer/media_wiki/page.rb,
lib/infoboxer/media_wiki/traits.rb

Overview

MediaWiki client class.

Usage:

client = Infoboxer::MediaWiki.new('http://en.wikipedia.org/w/api.php', user_agent: 'My Own Project')
page = client.get('Argentina')

Consider using shortcuts like #wiki, #wikipedia, #wp and so on instead of direct instation of this class (although you can if you want to!)

Defined Under Namespace

Classes: Page, Traits

Constant Summary collapse

UA =

Default Infoboxer User-Agent header.

You can set yours as an option to Infoboxer#wiki and its shortcuts, or to #initialize

"Infoboxer/#{Infoboxer::VERSION} (https://github.com/molybdenum-99/infoboxer; [email protected])"

Class Attribute Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(api_base_url, options = {}) ⇒ MediaWiki

Creating new MediaWiki client. Infoboxer#wiki provides shortcut for it, as well as shortcuts for some well-known wikis, like Infoboxer#wikipedia.

Parameters:

  • api_base_url

    URL of api.php file in your MediaWiki installation. Typically, its <domain>/w/api.php, but can vary in different wikis.

  • options (defaults to: {})

    Only one option is currently supported:

    • :user_agent (also aliased as :ua) -- custom User-Agent header.


52
53
54
55
56
# File 'lib/infoboxer/media_wiki.rb', line 52

def initialize(api_base_url, options = {})
  @api_base_url = Addressable::URI.parse(api_base_url)
  @client = MediaWiktory::Client.new(api_base_url, user_agent: user_agent(options))
  @traits = Traits.get(@api_base_url.host, namespaces: extract_namespaces)
end

Class Attribute Details

.user_agentObject

User agent getter/setter.

Default value is UA.

You can also use per-instance option, see #initialize



38
39
40
# File 'lib/infoboxer/media_wiki.rb', line 38

def user_agent
  @user_agent
end

Instance Attribute Details

#api_base_urlObject (readonly)

Returns the value of attribute api_base_url.



41
42
43
# File 'lib/infoboxer/media_wiki.rb', line 41

def api_base_url
  @api_base_url
end

#traitsObject (readonly)

Returns the value of attribute traits.



41
42
43
# File 'lib/infoboxer/media_wiki.rb', line 41

def traits
  @traits
end

Instance Method Details

#category(title) ⇒ Tree::Nodes<Page>

Receive list of parsed MediaWiki pages from specified category.

NB: currently, this API always fetches all pages from category, there is no option to "take first 20 pages". Pages are fetched in 50-page batches, then parsed. So, for large category it can really take a while to fetch all pages.

Parameters:

  • title

    Category title. You can use namespaceless title (like "Countries in South America"), title with namespace (like "Category:Countries in South America") or title with local namespace (like "Catégorie:Argentine" for French Wikipedia)

Returns:



149
150
151
152
153
# File 'lib/infoboxer/media_wiki.rb', line 149

def category(title)
  title = normalize_category_title(title)
  
  list(categorymembers: {title: title, limit: 50})
end

#get(*titles) ⇒ Tree::Nodes<Page>

Receive list of parsed MediaWiki pages for list of titles provided. All pages are received with single query to MediaWiki API.

NB: if you are requesting more than 50 titles at once (MediaWiki limitation for single request), Infoboxer will do as many queries as necessary to extract them all (it will be like (titles.count / 50.0).ceil requests)

Returns:

  • (Tree::Nodes<Page>)

    array of parsed pages. Notes:

    • if you call get with only one title, one page will be returned instead of an array
    • if some of pages are not in wiki, they will not be returned, therefore resulting array can be shorter than titles array; you can always check pages.map(&:title) to see what you've really received; this approach allows you to write absent-minded code like this:
      Infoboxer.wp.get('Argentina', 'Chile', 'Something non-existing').
         infobox.fetch('some value')
    

    and obtain meaningful results instead of NoMethodError or some NotFound.



102
103
104
105
106
107
108
109
110
111
112
# File 'lib/infoboxer/media_wiki.rb', line 102

def get(*titles)
  pages = raw(*titles).
    tap{|pages| pages.detect(&:invalid?).tap{|i| i && fail(i.raw.invalidreason)}}.
    select(&:exists?).
    map{|raw|
      Page.new(self,
        Parser.paragraphs(raw.content, traits),
        raw)
    }
  titles.count == 1 ? pages.first : Tree::Nodes[*pages]
end

#get_h(*titles) ⇒ Hash<String, Page>

Same as #get, but returns hash of title => page.

Useful quirks:

  • when requested page not existing, key will be still present in resulting hash (value will be nil);
  • when requested page redirects to another, key will still be the requested title. For ex., get_h('Einstein') will return hash with key 'Einstein' and page titled 'Albert Einstein'.

This allows you to be in full control of what pages of large list you've received.

Returns:

  • (Hash<String, Page>)


128
129
130
131
132
133
# File 'lib/infoboxer/media_wiki.rb', line 128

def get_h(*titles)
  pages = [*get(*titles)]
  titles.map{|t|
    [t, pages.detect{|p| p.source.alt_titles.map(&:downcase).include?(t.downcase)}]
  }.to_h
end

#inspectObject



192
193
194
# File 'lib/infoboxer/media_wiki.rb', line 192

def inspect
  "#<#{self.class}(#{@api_base_url.host})>"
end

#prefixsearch(prefix) ⇒ Tree::Nodes<Page>

Receive list of parsed MediaWiki pages with titles startin from prefix. See MediaWiki API docs for details.

NB: currently, this API always fetches all pages from category, there is no option to "take first 20 pages". Pages are fetched in 50-page batches, then parsed. So, for large category it can really take a while to fetch all pages.

Parameters:

  • prefix

    page title prefix.

Returns:



188
189
190
# File 'lib/infoboxer/media_wiki.rb', line 188

def prefixsearch(prefix)
  list(prefixsearch: {search: prefix, limit: 100})
end

#raw(*titles) ⇒ Array<Hash>

Receive "raw" data from Wikipedia (without parsing or wrapping in classes).

Returns:

  • (Array<Hash>)


62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/infoboxer/media_wiki.rb', line 62

def raw(*titles)
  return [] if titles.empty? # could emerge on "automatically" created page lists, should work
  
  titles.each_slice(50).map{|part|
    @client.query.
      titles(*part).
      prop(revisions: {prop: :content}, info: {prop: :url}).
      redirects(true). # FIXME: should be done transparently by MediaWiktory?
      perform.pages
  }.inject(:concat). # somehow flatten(1) fails!
  sort_by{|page|
    res_title = page.alt_titles.detect{|t| titles.map(&:downcase).include?(t.downcase)} # FIXME?..
    titles.index(res_title) || 1_000
  }
end

#search(query) ⇒ Tree::Nodes<Page>

Receive list of parsed MediaWiki pages for provided search query. See MediaWiki API docs for details.

NB: currently, this API always fetches all pages from category, there is no option to "take first 20 pages". Pages are fetched in 50-page batches, then parsed. So, for large category it can really take a while to fetch all pages.

Parameters:

Returns:



171
172
173
# File 'lib/infoboxer/media_wiki.rb', line 171

def search(query)
  list(search: {search: query, limit: 50})
end