Module: Wgit::Utils
- Defined in:
- lib/wgit/utils.rb
Overview
Utility module containing generic methods that don't belong to a Class.
Class Method Summary collapse
-
.build_search_regex(query, case_sensitive: false, whole_sentence: true) ⇒ Regexp
Build a regular expression from a query string, for searching text with.
-
.each(obj_or_objs) {|el| ... } ⇒ Object
An improved :each method which supports both singleton and Enumerable objects (as opposed to just an Enumerable object).
-
.fetch(hash, key, default = nil) ⇒ Object
An improved Hash :fetch method which checks for multiple formats of the given key and returns the value, or the default value (nil unless provided).
-
.format_sentence_length(sentence, index, sentence_limit) ⇒ String
Formats the sentence (modifies the receiver) and returns its value.
-
.pprint(identifier, display: true, stream: $stdout, prefix: 'DEBUG', new_line: false, **vars) ⇒ Object
Pretty prints a log statement, used for debugging purposes.
-
.pprint_all_search_results(results, keyword_limit: 5, include_score: false, stream: $stdout) ⇒ Integer
Prints out the search results listing all of the matching text in each document.
-
.pprint_top_search_results(results, keyword_limit: 5, include_score: false, stream: $stdout) ⇒ Integer
Prints out the search results in a search engine like format.
-
.sanitize(obj, encode: true) ⇒ Object
Sanitises the obj to make it uniform by calling the correct sanitize_* method for its type e.g.
-
.sanitize_arr(arr, encode: true) ⇒ Enumerable
Sanitises an Array to make it uniform.
-
.sanitize_str(str, encode: true) ⇒ String
Sanitises a String to make it uniform.
-
.sanitize_url(url, encode: true) ⇒ Wgit::Url
Sanitises a Wgit::Url to make it uniform.
-
.time_stamp ⇒ Time
Returns the current time stamp.
-
.to_h(obj, ignore: [], use_strings_as_keys: true) ⇒ Hash
Returns a Hash created from obj's instance vars and values.
Class Method Details
.build_search_regex(query, case_sensitive: false, whole_sentence: true) ⇒ Regexp
Build a regular expression from a query string, for searching text with.
All searches using this regex are always whole word based while whole sentence searches are configurable using the whole_sentence: param. For example:
text = "hello world"
query = "world hello", whole_sentence: true # => No match
query = "world hello", whole_sentence: false # => Match
query = "he" # => Never matches
318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 |
# File 'lib/wgit/utils.rb', line 318 def self.build_search_regex( query, case_sensitive: false, whole_sentence: true ) return query if query.is_a?(Regexp) # query: "hello world", whole_sentence: false produces: # (?<=^|\s|[^a-zA-Z0-9])hello(?=$|\s|[^a-zA-Z0-9])|(?<=^|\s|[^a-zA-Z0-9])world(?=$|\s|[^a-zA-Z0-9]) sep = whole_sentence ? " " : "|" segs = query.split(" ").map do |word| word = Regexp.escape(word) "(?<=^|\\s|[^a-zA-Z0-9])#{word}(?=$|\\s|[^a-zA-Z0-9])" end query = segs.join(sep) Regexp.new(query, !case_sensitive) end |
.each(obj_or_objs) {|el| ... } ⇒ Object
An improved :each method which supports both singleton and Enumerable objects (as opposed to just an Enumerable object).
38 39 40 41 42 43 44 45 46 |
# File 'lib/wgit/utils.rb', line 38 def self.each(obj_or_objs, &block) if obj_or_objs.respond_to?(:each) obj_or_objs.each(&block) else yield(obj_or_objs) end obj_or_objs end |
.fetch(hash, key, default = nil) ⇒ Object
An improved Hash :fetch method which checks for multiple formats of the given key and returns the value, or the default value (nil unless provided).
For example, if key == :foo, hash is searched for: :foo, 'foo', 'Foo', 'FOO' in that order. The first value found is returned. If no value is found, the default value is returned.
61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/wgit/utils.rb', line 61 def self.fetch(hash, key, default = nil) key = key.to_s.downcase # Try (in order): :foo, 'foo', 'Foo', 'FOO'. [key.to_sym, key, key.capitalize, key.upcase].each do |k| value = hash[k] return value if value end default end |
.format_sentence_length(sentence, index, sentence_limit) ⇒ String
Formats the sentence (modifies the receiver) and returns its value. The formatting is essentially to shorten the sentence and ensure that the index is present somewhere in the sentence. Used for search query results with the index of the matching query.
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/wgit/utils.rb', line 87 def self.format_sentence_length(sentence, index, sentence_limit) raise 'A sentence value must be provided' if sentence.empty? raise 'The sentence length value must be even' if sentence_limit.odd? if index.negative? || (index > sentence.length) raise "Incorrect index value: #{index}" end return sentence if sentence_limit.zero? start = 0 finish = sentence.length if sentence.length > sentence_limit start = index - (sentence_limit / 2) finish = index + (sentence_limit / 2) if start.negative? diff = 0 - start if (finish + diff) > sentence.length finish = sentence.length else finish += diff end start = 0 elsif finish > sentence.length diff = finish - sentence.length if (start - diff).negative? start = 0 else start -= diff end finish = sentence.length end raise if sentence[start..(finish - 1)].length != sentence_limit end sentence.replace(sentence[start..(finish - 1)]) end |
.pprint(identifier, display: true, stream: $stdout, prefix: 'DEBUG', new_line: false, **vars) ⇒ Object
359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 |
# File 'lib/wgit/utils.rb', line 359 def self.pprint(identifier, display: true, stream: $stdout, prefix: 'DEBUG', new_line: false, **vars) return unless display sep1 = new_line ? "\n" : ' - ' sep1 = '' if vars.empty? sep2 = new_line ? "\n" : ' | ' stream.print "\n#{prefix}_#{identifier}#{sep1}" vars.each_with_index do |arr, i| is_last_item = (i + 1) == vars.size sep3 = sep2 sep3 = new_line ? "\n" : '' if is_last_item k, v = arr stream.print "#{k}: #{v.inspect}#{sep3}" end stream.puts "\n" stream.puts "\n" unless new_line nil end |
.pprint_all_search_results(results, keyword_limit: 5, include_score: false, stream: $stdout) ⇒ Integer
Prints out the search results listing all of the matching text in each document.
The given results should be matching documents from a DB and should have
doc.search_text! called for each document - to turn doc.text into only
matching text, which this method uses.
The format for each result looks something like:
Title
Keywords (if there are some)
URL
Score (if include_score: true)
"<text_snippet_1>"
"<text_snippet_2>"
...
<seperator>
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
# File 'lib/wgit/utils.rb', line 207 def self.pprint_all_search_results( results, keyword_limit: 5, include_score: false, stream: $stdout ) raise 'stream must respond_to? :puts' unless stream.respond_to?(:puts) results.each_with_index do |doc, i| last_result = i == (results.size-1) title = doc.title || '<no title>' keywords = doc.keywords&.take(keyword_limit)&.join(', ') url = doc.url score = doc.score stream.puts title stream.puts keywords if keywords stream.puts url stream.puts score if include_score stream.puts doc.text.each { |text| stream.puts text } stream.puts stream.puts "-----" unless last_result stream.puts unless last_result end results.size end |
.pprint_top_search_results(results, keyword_limit: 5, include_score: false, stream: $stdout) ⇒ Integer
Prints out the search results in a search engine like format.
The given results should be matching documents from a DB and should have
doc.search_text! called for each document - to turn doc.text into only
matching text, which this method uses.
The format for each result looks something like:
Title
Keywords (if there are some)
Text Snippet (formatted to show the searched for query)
URL
Score (if include_score: true)
<empty_line_seperator>
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# File 'lib/wgit/utils.rb', line 153 def self.pprint_top_search_results( results, keyword_limit: 5, include_score: false, stream: $stdout ) raise 'stream must respond_to? :puts' unless stream.respond_to?(:puts) results.each do |doc| title = doc.title || '<no title>' keywords = doc.keywords&.take(keyword_limit)&.join(', ') sentence = doc.text.first url = doc.url score = doc.score stream.puts title stream.puts keywords if keywords stream.puts sentence stream.puts url stream.puts score if include_score stream.puts end results.size end |
.sanitize(obj, encode: true) ⇒ Object
Sanitises the obj to make it uniform by calling the correct sanitize_* method for its type e.g. if obj.is_a? String then sanitize_str(obj) is called. Any type not in the case statement will be ignored and returned as is. Call this method if unsure what obj's type is.
243 244 245 246 247 248 249 250 251 252 253 254 |
# File 'lib/wgit/utils.rb', line 243 def self.sanitize(obj, encode: true) case obj when Wgit::Url sanitize_url(obj, encode:) when String sanitize_str(obj, encode:) when Array sanitize_arr(obj, encode:) else obj end end |
.sanitize_arr(arr, encode: true) ⇒ Enumerable
Sanitises an Array to make it uniform. Removes empty Strings and nils, processes non empty Strings using Wgit::Utils.sanitize and removes duplicates.
290 291 292 293 294 295 296 297 298 |
# File 'lib/wgit/utils.rb', line 290 def self.sanitize_arr(arr, encode: true) return arr unless arr.is_a?(Array) arr .map { |str| sanitize(str, encode:) } .reject { |str| str.is_a?(String) && str.empty? } .compact .uniq end |
.sanitize_str(str, encode: true) ⇒ String
Sanitises a String to make it uniform. Strips any leading/trailing white
space. Also applies UTF-8 encoding (replacing invalid characters) if
encode: true.
277 278 279 280 281 282 |
# File 'lib/wgit/utils.rb', line 277 def self.sanitize_str(str, encode: true) return str unless str.is_a?(String) str = str.encode('UTF-8', undef: :replace, invalid: :replace) if encode str.strip end |
.sanitize_url(url, encode: true) ⇒ Wgit::Url
Sanitises a Wgit::Url to make it uniform. First sanitizes the Url as a String before replacing the Url value with the sanitized version. This method therefore modifies the given url param and also returns it.
264 265 266 267 |
# File 'lib/wgit/utils.rb', line 264 def self.sanitize_url(url, encode: true) str = sanitize_str(url.to_s, encode:) url.replace(str) end |
.time_stamp ⇒ Time
Returns the current time stamp.
9 10 11 |
# File 'lib/wgit/utils.rb', line 9 def self.time_stamp Time.new end |
.to_h(obj, ignore: [], use_strings_as_keys: true) ⇒ Hash
Returns a Hash created from obj's instance vars and values.
20 21 22 23 24 25 26 27 28 29 30 |
# File 'lib/wgit/utils.rb', line 20 def self.to_h(obj, ignore: [], use_strings_as_keys: true) obj.instance_variables.reduce({}) do |hash, var| next hash if ignore.include?(var.to_s) key = var.to_s[1..] # Remove the @ prefix. key = key.to_sym unless use_strings_as_keys hash[key] = obj.instance_variable_get(var) hash end end |