Class: Wgit::Database
- Inherits:
-
Object
- Object
- Wgit::Database
- Includes:
- Assertable
- Defined in:
- lib/wgit/database/database.rb
Overview
Class providing a DB connection and CRUD operations for the Url and Document collections.
Constant Summary collapse
- URLS_COLLECTION =
The default name of the urls collection.
:urls- DOCUMENTS_COLLECTION =
The default name of the documents collection.
:documents- TEXT_INDEX =
The default name of the documents collection text search index.
'text_search'- UNIQUE_INDEX =
The default name of the urls and documents collections unique index.
'unique_url'- DEFAULT_TEXT_INDEX =
The documents collection default text search index. Use
db.text_index = Wgit::Database::DEFAULT_TEXT_INDEXto revert changes. { title: 2, description: 2, keywords: 2, text: 1 }.freeze
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG
Instance Attribute Summary collapse
-
#client ⇒ Object
readonly
The database client object.
-
#connection_string ⇒ Object
readonly
The connection string for the database.
-
#last_result ⇒ Object
readonly
The raw MongoDB result of the most recent operation.
-
#text_index ⇒ Object
The documents collection text index, used to search the DB.
Class Method Summary collapse
-
.connect(connection_string = nil) ⇒ Wgit::Database
A class alias for Database.new.
-
.establish_connection(connection_string) ⇒ Mong::Client
Initializes a connected database client using the connection string.
Instance Method Summary collapse
-
#bulk_upsert(objs) ⇒ Integer
Bulk upserts the objects in the database collection.
-
#clear_db ⇒ Integer
(also: #clear_db!)
Deletes everything in the urls and documents collections.
-
#clear_docs ⇒ Integer
Deletes everything in the documents collection.
-
#clear_urls ⇒ Integer
Deletes everything in the urls collection.
-
#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that have been crawled.
-
#create_collections ⇒ nil
Creates the 'urls' and 'documents' collections.
-
#create_unique_indexes ⇒ nil
Creates the urls and documents unique 'url' indexes.
-
#delete(obj) ⇒ Integer
Deletes a record from the database with the matching 'url' field.
-
#doc?(doc) ⇒ Boolean
Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.
-
#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Returns all Document records from the DB.
-
#exists?(obj) ⇒ Boolean
Returns if a record exists with the given obj's url.
-
#get(obj) ⇒ Wgit::Url, ...
Returns a record from the database with the matching 'url' field; or nil.
-
#initialize(connection_string = nil) ⇒ Database
constructor
Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].
-
#insert(data) ⇒ Object
Insert one or more Url or Document objects into the DB.
-
#num_docs ⇒ Integer
Returns the total number of Document records in the DB.
-
#num_records ⇒ Integer
(also: #num_objects)
Returns the total number of records (urls + docs) in the DB.
-
#num_urls ⇒ Integer
Returns the total number of URL records in the DB.
-
#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query.
-
#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query and then searches each result in turn using
doc.search!. -
#search_text(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>
Searches the database's Documents for the given query and then searches each result in turn using
doc.search. -
#size ⇒ Integer
Returns the current size of the database.
-
#stats ⇒ BSON::Document#[]#fetch
Returns statistics about the database.
-
#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returned Url records that haven't yet been crawled.
-
#update(obj) ⇒ Integer
Update a Url or Document object in the DB.
-
#upsert(obj) ⇒ Boolean
Inserts or updates the object in the database.
-
#url?(url) ⇒ Boolean
Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.
-
#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns all Url records from the DB.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(connection_string = nil) ⇒ Database
Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].
58 59 60 61 62 63 64 65 66 |
# File 'lib/wgit/database/database.rb', line 58 def initialize(connection_string = nil) connection_string ||= ENV['WGIT_CONNECTION_STRING'] raise "connection_string and ENV['WGIT_CONNECTION_STRING'] are nil" \ unless connection_string @client = Database.establish_connection(connection_string) @connection_string = connection_string @text_index = DEFAULT_TEXT_INDEX end |
Instance Attribute Details
#client ⇒ Object (readonly)
The database client object. Gets set when a connection is established.
42 43 44 |
# File 'lib/wgit/database/database.rb', line 42 def client @client end |
#connection_string ⇒ Object (readonly)
The connection string for the database.
39 40 41 |
# File 'lib/wgit/database/database.rb', line 39 def connection_string @connection_string end |
#last_result ⇒ Object (readonly)
The raw MongoDB result of the most recent operation.
49 50 51 |
# File 'lib/wgit/database/database.rb', line 49 def last_result @last_result end |
#text_index ⇒ Object
The documents collection text index, used to search the DB. A custom setter method is also provided for changing the search logic.
46 47 48 |
# File 'lib/wgit/database/database.rb', line 46 def text_index @text_index end |
Class Method Details
.connect(connection_string = nil) ⇒ Wgit::Database
A class alias for Database.new.
75 76 77 |
# File 'lib/wgit/database/database.rb', line 75 def self.connect(connection_string = nil) new(connection_string) end |
.establish_connection(connection_string) ⇒ Mong::Client
Initializes a connected database client using the connection string.
85 86 87 88 89 90 91 92 93 |
# File 'lib/wgit/database/database.rb', line 85 def self.establish_connection(connection_string) # Only log for error (and more severe) scenarios. Mongo::Logger.logger = Wgit.logger.clone Mongo::Logger.logger.progname = 'mongo' Mongo::Logger.logger.level = Logger::ERROR # Connects to the database here. Mongo::Client.new(connection_string) end |
Instance Method Details
#bulk_upsert(objs) ⇒ Integer
Bulk upserts the objects in the database collection. You cannot mix collection objs types, all must be Urls or Documents.
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
# File 'lib/wgit/database/database.rb', line 201 def bulk_upsert(objs) assert_arr_types(objs, [Wgit::Url, Wgit::Document]) raise 'objs is empty' if objs.empty? collection = nil request_objs = objs.map do |obj| collection, query, model = get_type_info(obj) data_hash = model.merge(Wgit::Model.common_update_data) { update_many: { filter: query, update: { '$set' => data_hash }, upsert: true } } end result = @client[collection].bulk_write(request_objs) result.upserted_count + result.modified_count ensure @last_result = result end |
#clear_db ⇒ Integer Also known as: clear_db!
Deletes everything in the urls and documents collections. This will nuke the entire database so yeah... be careful.
550 551 552 |
# File 'lib/wgit/database/database.rb', line 550 def clear_db clear_urls + clear_docs end |
#clear_docs ⇒ Integer
Deletes everything in the documents collection.
539 540 541 542 543 544 |
# File 'lib/wgit/database/database.rb', line 539 def clear_docs result = @client[DOCUMENTS_COLLECTION].delete_many({}) result.n ensure @last_result = result end |
#clear_urls ⇒ Integer
Deletes everything in the urls collection.
529 530 531 532 533 534 |
# File 'lib/wgit/database/database.rb', line 529 def clear_urls result = @client[URLS_COLLECTION].delete_many({}) result.n ensure @last_result = result end |
#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that have been crawled.
280 281 282 |
# File 'lib/wgit/database/database.rb', line 280 def crawled_urls(limit: 0, skip: 0, &block) urls(crawled: true, limit:, skip:, &block) end |
#create_collections ⇒ nil
Creates the 'urls' and 'documents' collections.
100 101 102 103 104 105 |
# File 'lib/wgit/database/database.rb', line 100 def create_collections @client[URLS_COLLECTION].create @client[DOCUMENTS_COLLECTION].create nil end |
#create_unique_indexes ⇒ nil
Creates the urls and documents unique 'url' indexes.
110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/wgit/database/database.rb', line 110 def create_unique_indexes @client[URLS_COLLECTION].indexes.create_one( { url: 1 }, name: UNIQUE_INDEX, unique: true ) @client[DOCUMENTS_COLLECTION].indexes.create_one( { 'url.url' => 1 }, name: UNIQUE_INDEX, unique: true ) nil end |
#delete(obj) ⇒ Integer
Deletes a record from the database with the matching 'url' field. Pass either a Wgit::Url or Wgit::Document instance.
518 519 520 521 522 523 524 |
# File 'lib/wgit/database/database.rb', line 518 def delete(obj) collection, query = get_type_info(obj) result = @client[collection].delete_one(query) result.n ensure @last_result = result end |
#doc?(doc) ⇒ Boolean
Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.
465 466 467 468 469 |
# File 'lib/wgit/database/database.rb', line 465 def doc?(doc) assert_type(doc, Wgit::Document) query = { 'url.url' => doc.url } retrieve(DOCUMENTS_COLLECTION, query, limit: 1).any? end |
#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Returns all Document records from the DB. Use #search to filter based on the text_index of the collection.
All Documents are sorted by date_added ascending, in other words the first doc returned is the first one that was inserted into the DB.
238 239 240 241 242 243 244 245 246 247 248 |
# File 'lib/wgit/database/database.rb', line 238 def docs(limit: 0, skip: 0, &block) results = retrieve(DOCUMENTS_COLLECTION, {}, sort: { date_added: 1 }, limit:, skip:) return [] if results.count < 1 # results#empty? doesn't exist. # results.respond_to? :map! is false so we use map and overwrite the var. results = results.map { |doc_hash| Wgit::Document.new(doc_hash) } results.each(&block) if block_given? results end |
#exists?(obj) ⇒ Boolean
Returns if a record exists with the given obj's url.
476 477 478 |
# File 'lib/wgit/database/database.rb', line 476 def exists?(obj) obj.is_a?(String) ? url?(obj) : doc?(obj) end |
#get(obj) ⇒ Wgit::Url, ...
Returns a record from the database with the matching 'url' field; or nil. Pass either a Wgit::Url or Wgit::Document instance.
486 487 488 489 490 491 492 493 |
# File 'lib/wgit/database/database.rb', line 486 def get(obj) collection, query = get_type_info(obj) record = retrieve(collection, query, limit: 1).first return nil unless record obj.class.new(record) end |
#insert(data) ⇒ Object
Insert one or more Url or Document objects into the DB.
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/wgit/database/database.rb', line 164 def insert(data) collection = nil request_obj = nil if data.respond_to?(:map) request_obj = data.map do |obj| collection, _, model = get_type_info(obj) model end else collection, _, model = get_type_info(data) request_obj = model end create(collection, request_obj) end |
#num_docs ⇒ Integer
Returns the total number of Document records in the DB.
438 439 440 |
# File 'lib/wgit/database/database.rb', line 438 def num_docs @client[DOCUMENTS_COLLECTION].count end |
#num_records ⇒ Integer Also known as: num_objects
Returns the total number of records (urls + docs) in the DB.
445 446 447 |
# File 'lib/wgit/database/database.rb', line 445 def num_records num_urls + num_docs end |
#num_urls ⇒ Integer
Returns the total number of URL records in the DB.
431 432 433 |
# File 'lib/wgit/database/database.rb', line 431 def num_urls @client[URLS_COLLECTION].count end |
#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query.
The searched fields are decided by the text index setup on the documents collection. Currently we search against the following fields: "author", "keywords", "title" and "text" by default.
The MongoDB search algorithm ranks/sorts the results in order (highest first) based on each document's "textScore" (which records the number of query hits). The "textScore" is then stored in each Document result object for use elsewhere if needed; accessed via Wgit::Document#score.
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 |
# File 'lib/wgit/database/database.rb', line 314 def search( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0 ) query = query.to_s.strip query.replace("\"#{query}\"") if whole_sentence # Sort based on the most search hits (aka "textScore"). # We use the sort_proj hash as both a sort and a projection below. sort_proj = { score: { :$meta => 'textScore' } } query = { :$text => { :$search => query, :$caseSensitive => case_sensitive } } results = retrieve(DOCUMENTS_COLLECTION, query, sort: sort_proj, projection: sort_proj, limit:, skip:) results.map do |mongo_doc| doc = Wgit::Document.new(mongo_doc) yield(doc) if block_given? doc end end |
#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query and then searches
each result in turn using doc.search!. This method is therefore the
equivalent of calling Wgit::Database#search and then
Wgit::Document#search! in turn. See their documentation for more info.
358 359 360 361 362 363 364 365 366 367 368 369 370 |
# File 'lib/wgit/database/database.rb', line 358 def search!( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80 ) results = search(query, case_sensitive:, whole_sentence:, limit:, skip:) results.each do |doc| doc.search!(query, case_sensitive:, whole_sentence:, sentence_limit:) yield(doc) if block_given? end results end |
#search_text(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>
Searches the database's Documents for the given query and then searches
each result in turn using doc.search. Instead of an Array of Documents,
this method returns a Hash of the docs url => search_results creating a
search engine like result set for quick access to text matches.
392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 |
# File 'lib/wgit/database/database.rb', line 392 def search_text( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false ) results = search(query, case_sensitive:, whole_sentence:, limit:, skip:) results .map do |doc| yield(doc) if block_given? # Only return result if its text has a match - compact is called below. results = doc.search( query, case_sensitive:, whole_sentence:, sentence_limit: ) next nil if results.empty? [doc.url, (top_result_only ? results.first : results)] end .compact .to_h end |
#size ⇒ Integer
Returns the current size of the database.
424 425 426 |
# File 'lib/wgit/database/database.rb', line 424 def size stats[:dataSize] end |
#stats ⇒ BSON::Document#[]#fetch
Returns statistics about the database.
417 418 419 |
# File 'lib/wgit/database/database.rb', line 417 def stats @client.command(dbStats: 0).documents[0] end |
#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returned Url records that haven't yet been crawled.
290 291 292 |
# File 'lib/wgit/database/database.rb', line 290 def uncrawled_urls(limit: 0, skip: 0, &block) urls(crawled: false, limit:, skip:, &block) end |
#update(obj) ⇒ Integer
Update a Url or Document object in the DB.
502 503 504 505 506 507 |
# File 'lib/wgit/database/database.rb', line 502 def update(obj) collection, query, model = get_type_info(obj) data_hash = model.merge(Wgit::Model.common_update_data) mutate(collection, query, { '$set' => data_hash }) end |
#upsert(obj) ⇒ Boolean
Inserts or updates the object in the database.
185 186 187 188 189 190 191 192 193 |
# File 'lib/wgit/database/database.rb', line 185 def upsert(obj) collection, query, model = get_type_info(obj) data_hash = model.merge(Wgit::Model.common_update_data) result = @client[collection].replace_one(query, data_hash, upsert: true) result.matched_count.zero? ensure @last_result = result end |
#url?(url) ⇒ Boolean
Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.
454 455 456 457 458 |
# File 'lib/wgit/database/database.rb', line 454 def url?(url) assert_type(url, String) # This includes Wgit::Url's. query = { url: } retrieve(URLS_COLLECTION, query, limit: 1).any? end |
#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns all Url records from the DB.
All Urls are sorted by date_added ascending, in other words the first url returned is the first one that was inserted into the DB.
260 261 262 263 264 265 266 267 268 269 270 271 272 |
# File 'lib/wgit/database/database.rb', line 260 def urls(crawled: nil, limit: 0, skip: 0, &block) query = crawled.nil? ? {} : { crawled: } sort = { date_added: 1 } results = retrieve(URLS_COLLECTION, query, sort:, limit:, skip:) return [] if results.count < 1 # results#empty? doesn't exist. # results.respond_to? :map! is false so we use map and overwrite the var. results = results.map { |url_doc| Wgit::Url.new(url_doc) } results.each(&block) if block_given? results end |