Class: Wgit::Database
- Inherits:
-
Object
- Object
- Wgit::Database
- Includes:
- Assertable
- Defined in:
- lib/wgit/database/database.rb
Overview
Class providing a DB connection and CRUD operations for the Url and Document collections.
Constant Summary collapse
- URLS_COLLECTION =
The default name of the urls collection.
:urls
- DOCUMENTS_COLLECTION =
The default name of the documents collection.
:documents
- TEXT_INDEX =
The default name of the documents collection text search index.
'text_search'
- UNIQUE_INDEX =
The default name of the urls and documents collections unique index.
'unique_url'
- DEFAULT_TEXT_INDEX =
The documents collection default text search index. Use
db.text_index = Wgit::Database::DEFAULT_TEXT_INDEX
to revert changes. { title: 2, description: 2, keywords: 2, text: 1 }.freeze
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG
Instance Attribute Summary collapse
-
#client ⇒ Object
readonly
The database client object.
-
#connection_string ⇒ Object
readonly
The connection string for the database.
-
#last_result ⇒ Object
readonly
The raw MongoDB result of the most recent operation.
-
#text_index ⇒ Object
The documents collection text index, used to search the DB.
Class Method Summary collapse
-
.connect(connection_string = nil) ⇒ Wgit::Database
A class alias for Database.new.
-
.establish_connection(connection_string) ⇒ Mong::Client
Initializes a connected database client using the connection string.
Instance Method Summary collapse
-
#clear_db ⇒ Integer
(also: #clear_db!)
Deletes everything in the urls and documents collections.
-
#clear_docs ⇒ Integer
Deletes everything in the documents collection.
-
#clear_urls ⇒ Integer
Deletes everything in the urls collection.
-
#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that have been crawled.
-
#create_collections ⇒ nil
Creates the 'urls' and 'documents' collections.
-
#create_unique_indexes ⇒ nil
Creates the urls and documents unique 'url' indexes.
-
#delete(obj) ⇒ Integer
Deletes a record from the database with the matching 'url' field.
-
#doc?(doc) ⇒ Boolean
Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.
-
#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Returns all Document records from the DB.
-
#exists?(obj) ⇒ Boolean
Returns if a record exists with the given obj's url.
-
#get(obj) ⇒ Wgit::Url, ...
Returns a record from the database with the matching 'url' field; or nil.
-
#initialize(connection_string = nil) ⇒ Database
constructor
Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].
-
#insert(data) ⇒ Object
Insert one or more Url or Document objects into the DB.
-
#num_docs ⇒ Integer
Returns the total number of Document records in the DB.
-
#num_records ⇒ Integer
(also: #num_objects)
Returns the total number of records (urls + docs) in the DB.
-
#num_urls ⇒ Integer
Returns the total number of URL records in the DB.
-
#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query.
-
#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query and then searches each result in turn using
doc.search!
. -
#search_text(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>
Searches the database's Documents for the given query and then searches each result in turn using
doc.search
. -
#size ⇒ Integer
Returns the current size of the database.
-
#stats ⇒ BSON::Document#[]#fetch
Returns statistics about the database.
-
#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returned Url records that haven't yet been crawled.
-
#update(obj) ⇒ Integer
Update a Url or Document object in the DB.
-
#upsert(obj) ⇒ Boolean
Inserts or updates the object in the database.
-
#url?(url) ⇒ Boolean
Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.
-
#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns all Url records from the DB.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(connection_string = nil) ⇒ Database
Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].
58 59 60 61 62 63 64 65 66 |
# File 'lib/wgit/database/database.rb', line 58 def initialize(connection_string = nil) connection_string ||= ENV['WGIT_CONNECTION_STRING'] raise "connection_string and ENV['WGIT_CONNECTION_STRING'] are nil" \ unless connection_string @client = Database.establish_connection(connection_string) @connection_string = connection_string @text_index = DEFAULT_TEXT_INDEX end |
Instance Attribute Details
#client ⇒ Object (readonly)
The database client object. Gets set when a connection is established.
42 43 44 |
# File 'lib/wgit/database/database.rb', line 42 def client @client end |
#connection_string ⇒ Object (readonly)
The connection string for the database.
39 40 41 |
# File 'lib/wgit/database/database.rb', line 39 def connection_string @connection_string end |
#last_result ⇒ Object (readonly)
The raw MongoDB result of the most recent operation.
49 50 51 |
# File 'lib/wgit/database/database.rb', line 49 def last_result @last_result end |
#text_index ⇒ Object
The documents collection text index, used to search the DB. A custom setter method is also provided for changing the search logic.
46 47 48 |
# File 'lib/wgit/database/database.rb', line 46 def text_index @text_index end |
Class Method Details
.connect(connection_string = nil) ⇒ Wgit::Database
A class alias for Database.new.
75 76 77 |
# File 'lib/wgit/database/database.rb', line 75 def self.connect(connection_string = nil) new(connection_string) end |
.establish_connection(connection_string) ⇒ Mong::Client
Initializes a connected database client using the connection string.
85 86 87 88 89 90 91 92 93 |
# File 'lib/wgit/database/database.rb', line 85 def self.establish_connection(connection_string) # Only log for error (and more severe) scenarios. Mongo::Logger.logger = Wgit.logger.clone Mongo::Logger.logger.progname = 'mongo' Mongo::Logger.logger.level = Logger::ERROR # Connects to the database here. Mongo::Client.new(connection_string) end |
Instance Method Details
#clear_db ⇒ Integer Also known as: clear_db!
Deletes everything in the urls and documents collections. This will nuke the entire database so yeah... be careful.
540 541 542 |
# File 'lib/wgit/database/database.rb', line 540 def clear_db clear_urls + clear_docs end |
#clear_docs ⇒ Integer
Deletes everything in the documents collection.
529 530 531 532 533 534 |
# File 'lib/wgit/database/database.rb', line 529 def clear_docs result = @client[DOCUMENTS_COLLECTION].delete_many({}) result.n ensure @last_result = result end |
#clear_urls ⇒ Integer
Deletes everything in the urls collection.
519 520 521 522 523 524 |
# File 'lib/wgit/database/database.rb', line 519 def clear_urls result = @client[URLS_COLLECTION].delete_many({}) result.n ensure @last_result = result end |
#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that have been crawled.
251 252 253 |
# File 'lib/wgit/database/database.rb', line 251 def crawled_urls(limit: 0, skip: 0, &block) urls(crawled: true, limit: limit, skip: skip, &block) end |
#create_collections ⇒ nil
Creates the 'urls' and 'documents' collections.
100 101 102 103 104 105 |
# File 'lib/wgit/database/database.rb', line 100 def create_collections @client[URLS_COLLECTION].create @client[DOCUMENTS_COLLECTION].create nil end |
#create_unique_indexes ⇒ nil
Creates the urls and documents unique 'url' indexes.
110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/wgit/database/database.rb', line 110 def create_unique_indexes @client[URLS_COLLECTION].indexes.create_one( { url: 1 }, name: UNIQUE_INDEX, unique: true ) @client[DOCUMENTS_COLLECTION].indexes.create_one( { 'url.url' => 1 }, name: UNIQUE_INDEX, unique: true ) nil end |
#delete(obj) ⇒ Integer
Deletes a record from the database with the matching 'url' field. Pass either a Wgit::Url or Wgit::Document instance.
508 509 510 511 512 513 514 |
# File 'lib/wgit/database/database.rb', line 508 def delete(obj) collection, query = get_type_info(obj) result = @client[collection].delete_one(query) result.n ensure @last_result = result end |
#doc?(doc) ⇒ Boolean
Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.
455 456 457 458 459 |
# File 'lib/wgit/database/database.rb', line 455 def doc?(doc) assert_type(doc, Wgit::Document) query = { 'url.url' => doc.url } retrieve(DOCUMENTS_COLLECTION, query, limit: 1).any? end |
#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Returns all Document records from the DB. Use #search to filter based on the text_index of the collection.
All Documents are sorted by date_added ascending, in other words the first doc returned is the first one that was inserted into the DB.
208 209 210 211 212 213 214 215 216 217 218 |
# File 'lib/wgit/database/database.rb', line 208 def docs(limit: 0, skip: 0) results = retrieve(DOCUMENTS_COLLECTION, {}, sort: { date_added: 1 }, limit: limit, skip: skip) return [] if results.count < 1 # results#empty? doesn't exist. # results.respond_to? :map! is false so we use map and overwrite the var. results = results.map { |doc_hash| Wgit::Document.new(doc_hash) } results.each { |doc| yield(doc) } if block_given? results end |
#exists?(obj) ⇒ Boolean
Returns if a record exists with the given obj's url.
466 467 468 |
# File 'lib/wgit/database/database.rb', line 466 def exists?(obj) obj.is_a?(String) ? url?(obj) : doc?(obj) end |
#get(obj) ⇒ Wgit::Url, ...
Returns a record from the database with the matching 'url' field; or nil. Pass either a Wgit::Url or Wgit::Document instance.
476 477 478 479 480 481 482 483 |
# File 'lib/wgit/database/database.rb', line 476 def get(obj) collection, query = get_type_info(obj) record = retrieve(collection, query, limit: 1).first return nil unless record obj.class.new(record) end |
#insert(data) ⇒ Object
Insert one or more Url or Document objects into the DB.
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/wgit/database/database.rb', line 164 def insert(data) data = data.dup # Avoid modifying by reference. collection = nil if data.respond_to?(:map!) data.map! do |obj| collection, _, model = get_type_info(obj) model end else collection, _, model = get_type_info(data) data = model end create(collection, data) end |
#num_docs ⇒ Integer
Returns the total number of Document records in the DB.
428 429 430 |
# File 'lib/wgit/database/database.rb', line 428 def num_docs @client[DOCUMENTS_COLLECTION].count end |
#num_records ⇒ Integer Also known as: num_objects
Returns the total number of records (urls + docs) in the DB.
435 436 437 |
# File 'lib/wgit/database/database.rb', line 435 def num_records num_urls + num_docs end |
#num_urls ⇒ Integer
Returns the total number of URL records in the DB.
421 422 423 |
# File 'lib/wgit/database/database.rb', line 421 def num_urls @client[URLS_COLLECTION].count end |
#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query.
The searched fields are decided by the text index setup on the documents collection. Currently we search against the following fields: "author", "keywords", "title" and "text" by default.
The MongoDB search algorithm ranks/sorts the results in order (highest first) based on each document's "textScore" (which records the number of query hits). The "textScore" is then stored in each Document result object for use elsewhere if needed; accessed via Wgit::Document#score.
285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
# File 'lib/wgit/database/database.rb', line 285 def search( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0 ) query = query.to_s.strip query.replace('"' + query + '"') if whole_sentence # Sort based on the most search hits (aka "textScore"). # We use the sort_proj hash as both a sort and a projection below. sort_proj = { score: { :$meta => 'textScore' } } query = { :$text => { :$search => query, :$caseSensitive => case_sensitive } } results = retrieve(DOCUMENTS_COLLECTION, query, sort: sort_proj, projection: sort_proj, limit: limit, skip: skip) results.map do |mongo_doc| doc = Wgit::Document.new(mongo_doc) yield(doc) if block_given? doc end end |
#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query and then searches
each result in turn using doc.search!
. This method is therefore the
equivalent of calling Wgit::Database#search
and then
Wgit::Document#search!
in turn. See their documentation for more info.
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 |
# File 'lib/wgit/database/database.rb', line 327 def search!( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80 ) results = search( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, limit: limit, skip: skip ) results.each do |doc| doc.search!( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, sentence_limit: sentence_limit ) yield(doc) if block_given? end results end |
#search_text(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>
Searches the database's Documents for the given query and then searches
each result in turn using doc.search
. Instead of an Array of Documents,
this method returns a Hash of the docs url => search_results creating a
search engine like result set for quick access to text matches.
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 |
# File 'lib/wgit/database/database.rb', line 372 def search_text( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false ) results = search( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, limit: limit, skip: skip ) results .map do |doc| yield(doc) if block_given? results = doc.search( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, sentence_limit: sentence_limit ) # Only return result if its text has a match - compact is called below. next nil if results.empty? [doc.url, (top_result_only ? results.first : results)] end .compact .to_h end |
#size ⇒ Integer
Returns the current size of the database.
414 415 416 |
# File 'lib/wgit/database/database.rb', line 414 def size stats[:dataSize] end |
#stats ⇒ BSON::Document#[]#fetch
Returns statistics about the database.
407 408 409 |
# File 'lib/wgit/database/database.rb', line 407 def stats @client.command(dbStats: 0).documents[0] end |
#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returned Url records that haven't yet been crawled.
261 262 263 |
# File 'lib/wgit/database/database.rb', line 261 def uncrawled_urls(limit: 0, skip: 0, &block) urls(crawled: false, limit: limit, skip: skip, &block) end |
#update(obj) ⇒ Integer
Update a Url or Document object in the DB.
492 493 494 495 496 497 |
# File 'lib/wgit/database/database.rb', line 492 def update(obj) collection, query, model = get_type_info(obj.dup) data_hash = model.merge(Wgit::Model.common_update_data) mutate(collection, query, { '$set' => data_hash }) end |
#upsert(obj) ⇒ Boolean
Inserts or updates the object in the database.
185 186 187 188 189 190 191 192 193 |
# File 'lib/wgit/database/database.rb', line 185 def upsert(obj) collection, query, model = get_type_info(obj.dup) data_hash = model.merge(Wgit::Model.common_update_data) result = @client[collection].replace_one(query, data_hash, upsert: true) result.matched_count.zero? ensure @last_result = result end |
#url?(url) ⇒ Boolean
Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.
444 445 446 447 448 |
# File 'lib/wgit/database/database.rb', line 444 def url?(url) assert_type(url, String) # This includes Wgit::Url's. query = { url: url } retrieve(URLS_COLLECTION, query, limit: 1).any? end |
#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns all Url records from the DB.
All Urls are sorted by date_added ascending, in other words the first url returned is the first one that was inserted into the DB.
230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
# File 'lib/wgit/database/database.rb', line 230 def urls(crawled: nil, limit: 0, skip: 0) query = crawled.nil? ? {} : { crawled: crawled } sort = { date_added: 1 } results = retrieve(URLS_COLLECTION, query, sort: sort, limit: limit, skip: skip) return [] if results.count < 1 # results#empty? doesn't exist. # results.respond_to? :map! is false so we use map and overwrite the var. results = results.map { |url_doc| Wgit::Url.new(url_doc) } results.each { |url| yield(url) } if block_given? results end |