Class: Wgit::Indexer

Inherits:
Object
  • Object
show all
Defined in:
lib/wgit/indexer.rb

Overview

Class which crawls and saves the Documents to a database. Can be thought of as a combination of Wgit::Crawler and Wgit::Database.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer

Initialize the Indexer.

Parameters:

  • database (Wgit::Database) (defaults to: Wgit::Database.new)

    The database instance (already initialized and connected) used to index.

  • crawler (Wgit::Crawler) (defaults to: Wgit::Crawler.new)

    The crawler instance used to index.



21
22
23
24
# File 'lib/wgit/indexer.rb', line 21

def initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new)
  @db      = database
  @crawler = crawler
end

Instance Attribute Details

#crawlerObject (readonly)

The crawler used to index the WWW.



11
12
13
# File 'lib/wgit/indexer.rb', line 11

def crawler
  @crawler
end

#dbObject (readonly) Also known as: database

The database instance used to store Urls and Documents in.



14
15
16
# File 'lib/wgit/indexer.rb', line 14

def db
  @db
end

Instance Method Details

#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r

Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index. Logs info on the crawl using Wgit.logger as it goes along.

Parameters:

  • url (Wgit::Url)

    The base Url of the website to crawl.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the website's external Url's into the database.

  • follow (String) (defaults to: :default)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given the Wgit::Document of each crawled web page before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.

Returns:

  • (Integer)

    The total number of webpages/documents indexed.



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# File 'lib/wgit/indexer.rb', line 112

def index_site(
  url, insert_externals: false, follow: :default,
  allow_paths: nil, disallow_paths: nil
)
  crawl_opts = {
    follow: follow,
    allow_paths: allow_paths,
    disallow_paths: disallow_paths
  }
  total_pages_indexed = 0

  ext_urls = @crawler.crawl_site(url, **crawl_opts) do |doc|
    result = block_given? ? yield(doc) : true

    if result && !doc.empty?
      write_doc_to_db(doc)
      total_pages_indexed += 1
    end
  end

  @db.upsert(url)

  if insert_externals && ext_urls
    num_inserted_urls = write_urls_to_db(ext_urls)
    Wgit.logger.info("Found and saved #{num_inserted_urls} external url(s)")
  end

  Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \
for the site: #{url}")

  total_pages_indexed
end

#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object

Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along.

Parameters:

  • url (Wgit::Url)

    The webpage Url to crawl.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the webpages external Url's into the database.

Yields:

  • (doc)

    Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.



177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# File 'lib/wgit/indexer.rb', line 177

def index_url(url, insert_externals: false)
  document = @crawler.crawl_url(url) do |doc|
    result = block_given? ? yield(doc) : true
    write_doc_to_db(doc) if result && !doc.empty?
  end

  @db.upsert(url)

  ext_urls = document&.external_links
  if insert_externals && ext_urls
    num_inserted_urls = write_urls_to_db(ext_urls)
    Wgit.logger.info("Found and saved #{num_inserted_urls} external url(s)")
  end

  nil
end

#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index

Crawls one or more webpages and stores them into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along.

Parameters:

  • urls (*Wgit::Url)

    The webpage Url's to crawl.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the webpages external Url's into the database.

Yields:

  • (doc)

    Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.

Raises:

  • (StandardError)

    if no urls are provided.



157
158
159
160
161
162
163
164
# File 'lib/wgit/indexer.rb', line 157

def index_urls(*urls, insert_externals: false, &block)
  raise 'You must provide at least one Url' if urls.empty?

  opts = { insert_externals: insert_externals }
  Wgit::Utils.each(urls) { |url| index_url(url, **opts, &block) }

  nil
end

#index_www(max_sites: -1,, max_data: 1_048_576_000) ⇒ Object

Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along.

Parameters:

  • max_sites (Integer) (defaults to: -1,)

    The number of separate and whole websites to be crawled before the method exits. Defaults to -1 which means the crawl will occur until manually stopped (Ctrl+C etc).

  • max_data (Integer) (defaults to: 1_048_576_000)

    The maximum amount of bytes that will be scraped from the web (default is 1GB). Note, that this value is used to determine when to stop crawling; it's not a guarantee of the max data that will be obtained.



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/wgit/indexer.rb', line 38

def index_www(max_sites: -1, max_data: 1_048_576_000)
  if max_sites.negative?
    Wgit.logger.info("Indexing until the database has been filled or it \
runs out of urls to crawl (which might be never).")
  end
  site_count = 0

  while keep_crawling?(site_count, max_sites, max_data)
    Wgit.logger.info("Current database size: #{@db.size}")

    uncrawled_urls = @db.uncrawled_urls(limit: 100)

    if uncrawled_urls.empty?
      Wgit.logger.info('No urls to crawl, exiting.')

      return
    end
    Wgit.logger.info("Starting crawl loop for: #{uncrawled_urls}")

    docs_count = 0
    urls_count = 0

    uncrawled_urls.each do |url|
      unless keep_crawling?(site_count, max_sites, max_data)
        Wgit.logger.info("Reached max number of sites to crawl or \
database capacity, exiting.")

        return
      end
      site_count += 1

      site_docs_count = 0
      ext_links = @crawler.crawl_site(url) do |doc|
        unless doc.empty?
          write_doc_to_db(doc)
          docs_count += 1
          site_docs_count += 1
        end
      end

      raise 'Error updating url' unless @db.update(url) == 1

      urls_count += write_urls_to_db(ext_links)
    end

    Wgit.logger.info("Crawled and indexed documents for #{docs_count} \
url(s) overall for this iteration.")
    Wgit.logger.info("Found and saved #{urls_count} external url(s) for \
the next iteration.")

    nil
  end
end

#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean (protected)

Returns whether or not to keep crawling based on the DB size and current loop iteration.

Parameters:

  • site_count (Integer)

    The current number of crawled sites.

  • max_sites (Integer)

    The maximum number of sites to crawl before stopping. Use -1 for an infinite number of sites.

  • max_data (Integer)

    The maximum amount of data to crawl before stopping.

Returns:

  • (Boolean)

    True if the crawl should continue, false otherwise.



205
206
207
208
209
210
# File 'lib/wgit/indexer.rb', line 205

def keep_crawling?(site_count, max_sites, max_data)
  return false if @db.size >= max_data
  return true  if max_sites.negative?

  site_count < max_sites
end

#write_doc_to_db(doc) ⇒ Object (protected)

Write the doc to the DB. Note that the unique url index on the documents collection deliberately prevents duplicate inserts.

Parameters:



216
217
218
219
220
221
222
# File 'lib/wgit/indexer.rb', line 216

def write_doc_to_db(doc)
  if @db.upsert(doc)
    Wgit.logger.info("Saved document for url: #{doc.url}")
  else
    Wgit.logger.info("Updated document for url: #{doc.url}")
  end
end

#write_urls_to_db(urls) ⇒ Integer (protected)

Write the urls to the DB. Note that the unique url index on the urls collection deliberately prevents duplicate inserts.

Parameters:

  • urls (Array<Wgit::Url>)

    The urls to write to the DB.

Returns:

  • (Integer)

    The number of inserted urls.



229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# File 'lib/wgit/indexer.rb', line 229

def write_urls_to_db(urls)
  count = 0

  return count unless urls.respond_to?(:each)

  urls.each do |url|
    if url.invalid?
      Wgit.logger.info("Ignoring invalid external url: #{url}")
      next
    end

    @db.insert(url)
    count += 1

    Wgit.logger.info("Inserted external url: #{url}")
  rescue Mongo::Error::OperationFailure
    Wgit.logger.info("External url already exists: #{url}")
  end

  count
end