Class: Wgit::Database

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/database/database.rb

Overview

Class providing a DB connection and CRUD operations for the Url and Document collections.

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(connection_string = nil) ⇒ Database

Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].

Parameters:

  • connection_string (String) (defaults to: nil)

    The connection string needed to connect to the database.

Raises:

  • (StandardError)

    If a connection string isn't provided, either as a parameter or via the environment.


30
31
32
33
34
35
36
37
# File 'lib/wgit/database/database.rb', line 30

def initialize(connection_string = nil)
  connection_string ||= ENV['WGIT_CONNECTION_STRING']
  raise "connection_string and ENV['WGIT_CONNECTION_STRING'] are nil" \
  unless connection_string

  @client = Database.establish_connection(connection_string)
  @connection_string = connection_string
end

Instance Attribute Details

#clientObject (readonly)

The database client object. Gets set when a connection is established.


21
22
23
# File 'lib/wgit/database/database.rb', line 21

def client
  @client
end

#connection_stringObject (readonly)

The connection string for the database.


18
19
20
# File 'lib/wgit/database/database.rb', line 18

def connection_string
  @connection_string
end

Class Method Details

.connect(connection_string = nil) ⇒ Wgit::Database

A class alias for Database.new.

Parameters:

  • connection_string (String) (defaults to: nil)

    The connection string needed to connect to the database.

Returns:

Raises:

  • (StandardError)

    If a connection string isn't provided, either as a parameter or via the environment.


46
47
48
# File 'lib/wgit/database/database.rb', line 46

def self.connect(connection_string = nil)
  new(connection_string)
end

.establish_connection(connection_string) ⇒ Mong::Client

Initializes a connected database client using the connection string.

Parameters:

  • connection_string (String)

    The connection string needed to connect to the database.

Returns:

  • (Mong::Client)

    The connected MongoDB client.

Raises:

  • (StandardError)

    If a connection cannot be established.


56
57
58
59
60
61
62
63
64
# File 'lib/wgit/database/database.rb', line 56

def self.establish_connection(connection_string)
  # Only log for error (and more severe) scenarios.
  Mongo::Logger.logger          = Wgit.logger.clone
  Mongo::Logger.logger.progname = 'mongo'
  Mongo::Logger.logger.level    = Logger::ERROR

  # Connects to the database here.
  Mongo::Client.new(connection_string)
end

Instance Method Details

#clear_dbInteger

Deletes everything in the urls and documents collections. This will nuke the entire database so yeah... be careful.

Returns:

  • (Integer)

    The number of deleted records.


323
324
325
# File 'lib/wgit/database/database.rb', line 323

def clear_db
  clear_urls + clear_docs
end

#clear_docsInteger

Deletes everything in the documents collection.

Returns:

  • (Integer)

    The number of deleted records.


315
316
317
# File 'lib/wgit/database/database.rb', line 315

def clear_docs
  @client[:documents].delete_many({}).n
end

#clear_urlsInteger

Deletes everything in the urls collection.

Returns:

  • (Integer)

    The number of deleted records.


308
309
310
# File 'lib/wgit/database/database.rb', line 308

def clear_urls
  @client[:urls].delete_many({}).n
end

#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns Url records that have been crawled.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The crawled Urls obtained from the DB.


122
123
124
# File 'lib/wgit/database/database.rb', line 122

def crawled_urls(limit: 0, skip: 0, &block)
  urls(crawled: true, limit: limit, skip: skip, &block)
end

#doc?(doc) ⇒ Boolean Also known as: document?

Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.

Parameters:

Returns:

  • (Boolean)

    True if doc exists, otherwise false.


278
279
280
281
282
# File 'lib/wgit/database/database.rb', line 278

def doc?(doc)
  assert_type(doc, Wgit::Document)
  hash = { 'url.url' => doc.url }
  @client[:documents].find(hash).any?
end

#insert(data) ⇒ Object

Insert one or more Url or Document objects into the DB.

Parameters:

Raises:

  • (StandardError)

    If data isn't valid.


74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/wgit/database/database.rb', line 74

def insert(data)
  data = data.dup # Avoid modifying by reference.
  type = data.is_a?(Enumerable) ? data.first : data

  case type
  when Wgit::Url
    insert_urls(data)
  when Wgit::Document
    insert_docs(data)
  else
    raise "Unsupported type - #{data.class}: #{data}"
  end
end

#insert_docs(data) ⇒ Integer (protected) Also known as: insert_doc

Insert one or more Document objects into the DB.

Parameters:

Returns:

  • (Integer)

    The number of inserted Documents.

Raises:

  • (StandardError)

    If data type isn't supported.


352
353
354
355
356
357
358
359
360
361
362
# File 'lib/wgit/database/database.rb', line 352

def insert_docs(data)
  if data.respond_to?(:map!)
    assert_arr_type(data, Wgit::Document)
    data.map! { |doc| Wgit::Model.document(doc) }
  else
    assert_types(data, Wgit::Document)
    data = Wgit::Model.document(data)
  end

  create(:documents, data)
end

#insert_urls(data) ⇒ Integer (protected) Also known as: insert_url

Insert one or more Url objects into the DB.

Parameters:

Returns:

  • (Integer)

    The number of inserted Urls.

Raises:

  • (StandardError)

    If data type isn't supported.


334
335
336
337
338
339
340
341
342
343
344
# File 'lib/wgit/database/database.rb', line 334

def insert_urls(data)
  if data.respond_to?(:map!)
    assert_arr_type(data, Wgit::Url)
    data.map! { |url| Wgit::Model.url(url) }
  else
    assert_type(data, Wgit::Url)
    data = Wgit::Model.url(data)
  end

  create(:urls, data)
end

#num_docsInteger

Returns the total number of Document records in the DB.

Returns:

  • (Integer)

    The current number of Document records.


251
252
253
# File 'lib/wgit/database/database.rb', line 251

def num_docs
  @client[:documents].count
end

#num_recordsInteger Also known as: num_objects

Returns the total number of records (urls + docs) in the DB.

Returns:

  • (Integer)

    The current number of URL and Document records.


258
259
260
# File 'lib/wgit/database/database.rb', line 258

def num_records
  num_urls + num_docs
end

#num_urlsInteger

Returns the total number of URL records in the DB.

Returns:

  • (Integer)

    The current number of URL records.


244
245
246
# File 'lib/wgit/database/database.rb', line 244

def num_urls
  @client[:urls].count
end

#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>

Searches the database's Documents for the given query.

The searched fields are decided by the text index setup on the documents collection. Currently we search against the following fields: "author", "keywords", "title" and "text" by default.

The MongoDB search algorithm ranks/sorts the results in order (highest first) based on each document's "textScore" (which records the number of query hits). The "textScore" is then stored in each Document result object for use elsewhere if needed; accessed via Wgit::Document#score.

Parameters:

  • query (String)

    The text query to search with.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • limit (Integer) (defaults to: 10)

    The max number of results to return.

  • skip (Integer) (defaults to: 0)

    The number of results to skip.

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the DB.

Returns:

  • (Array<Wgit::Document>)

    The search results obtained from the DB.


156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/wgit/database/database.rb', line 156

def search(
  query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
)
  query = query.to_s.strip
  query.replace('"' + query + '"') if whole_sentence

  # Sort based on the most search hits (aka "textScore").
  # We use the sort_proj hash as both a sort and a projection below.
  sort_proj = { score: { :$meta => 'textScore' } }
  query = { :$text => {
    :$search => query,
    :$caseSensitive => case_sensitive
  } }

  results = retrieve(:documents, query,
                     sort: sort_proj, projection: sort_proj,
                     limit: limit, skip: skip)
  return [] if results.count < 1 # respond_to? :empty? == false

  # results.respond_to? :map! is false so we use map and overwrite the var.
  results = results.map do |mongo_doc|
    doc = Wgit::Document.new(mongo_doc)
    yield(doc) if block_given?
    doc
  end

  results
end

#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>

Searches the database's Documents for the given query and then searches each result in turn using doc.search!. This method is therefore the equivalent of calling Wgit::Database#search and then Wgit::Document#search! in turn. See their documentation for more info.

Parameters:

  • query (String)

    The text query to search with.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • limit (Integer) (defaults to: 10)

    The max number of results to return.

  • skip (Integer) (defaults to: 0)

    The number of results to skip.

  • sentence_limit (Integer) (defaults to: 80)

    The max length of each search result sentence.

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the DB having called doc.search!(query).

Returns:

  • (Array<Wgit::Document>)

    The search results obtained from the DB having called doc.search!(query).


202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# File 'lib/wgit/database/database.rb', line 202

def search!(
  query, case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, sentence_limit: 80
)
  results = search(
    query,
    case_sensitive: case_sensitive,
    whole_sentence: whole_sentence,
    limit: limit,
    skip: skip
  )

  results.each do |doc|
    doc.search!(
      query,
      case_sensitive: case_sensitive,
      whole_sentence: whole_sentence,
      sentence_limit: sentence_limit
    )
    yield(doc) if block_given?
  end

  results
end

#sizeInteger Also known as: count, length

Returns the current size of the database.

Returns:

  • (Integer)

    The current size of the DB.


237
238
239
# File 'lib/wgit/database/database.rb', line 237

def size
  stats[:dataSize]
end

#statsBSON::Document#[]#fetch

Returns statistics about the database.

Returns:

  • (BSON::Document#[]#fetch)

    Similar to a Hash instance.


230
231
232
# File 'lib/wgit/database/database.rb', line 230

def stats
  @client.command(dbStats: 0).documents[0]
end

#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returned Url records that haven't yet been crawled.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The uncrawled Urls obtained from the DB.


132
133
134
# File 'lib/wgit/database/database.rb', line 132

def uncrawled_urls(limit: 0, skip: 0, &block)
  urls(crawled: false, limit: limit, skip: skip, &block)
end

#update(data) ⇒ Object

Update a Url or Document object in the DB.

Parameters:

Raises:

  • (StandardError)

    If the data is not valid.


290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/wgit/database/database.rb', line 290

def update(data)
  data = data.dup # Avoid modifying by reference.

  case data
  when Wgit::Url
    update_url(data)
  when Wgit::Document
    update_doc(data)
  else
    raise "Unsupported type - #{data.class}: #{data}"
  end
end

#update_doc(doc) ⇒ Integer (protected)

Update a Document record in the DB.

Parameters:

Returns:

  • (Integer)

    The number of updated records.


380
381
382
383
384
385
386
# File 'lib/wgit/database/database.rb', line 380

def update_doc(doc)
  assert_type(doc, Wgit::Document)
  selection = { 'url.url' => doc.url }
  doc_hash = Wgit::Model.document(doc).merge(Wgit::Model.common_update_data)
  update = { '$set' => doc_hash }
  mutate(true, :documents, selection, update)
end

#update_url(url) ⇒ Integer (protected)

Update a Url record in the DB.

Parameters:

Returns:

  • (Integer)

    The number of updated records.


368
369
370
371
372
373
374
# File 'lib/wgit/database/database.rb', line 368

def update_url(url)
  assert_type(url, Wgit::Url)
  selection = { url: url }
  url_hash = Wgit::Model.url(url).merge(Wgit::Model.common_update_data)
  update = { '$set' => url_hash }
  mutate(true, :urls, selection, update)
end

#url?(url) ⇒ Boolean

Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.

Parameters:

  • url (Wgit::Url)

    The Url to search the DB for.

Returns:

  • (Boolean)

    True if url exists, otherwise false.


267
268
269
270
271
# File 'lib/wgit/database/database.rb', line 267

def url?(url)
  assert_type(url, String) # This includes Wgit::Url's.
  hash = { 'url' => url }
  @client[:urls].find(hash).any?
end

#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns Url records from the DB.

All Urls are sorted by date_added ascending, in other words the first url returned is the first one that was inserted into the DB.

Parameters:

  • crawled (Boolean) (defaults to: nil)

    Filter by Url#crawled value. nil returns all.

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The Urls obtained from the DB.


100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# File 'lib/wgit/database/database.rb', line 100

def urls(crawled: nil, limit: 0, skip: 0)
  query = crawled.nil? ? {} : { crawled: crawled }
  sort = { date_added: 1 }

  results = retrieve(:urls, query,
                     sort: sort, projection: {},
                     limit: limit, skip: skip)
  return [] if results.count < 1 # results#empty? doesn't exist.

  # results.respond_to? :map! is false so we use map and overwrite the var.
  results = results.map { |url_doc| Wgit::Url.new(url_doc) }
  results.each { |url| yield(url) } if block_given?

  results
end