Module: MMETools::Webparse
Overview
methods for processing strings while parsing webpages
Instance Method Summary collapse
-
#acronymize(str) ⇒ Object
Transforms a string
str
to an acronym. -
#asciify(str) ⇒ Object
Intenta convertir
str
a ASCII pur i dur. -
#clear_string(str, opts = {}) ⇒ Object
treu els espais innecessaris i codis HTML d’enmig i extrems a un string neteja l’string eliminant tots els no printables dels extrems i els d’enmig els substitueiux per un unic espai.
-
#clear_uri(uri) ⇒ Object
torna una uri treient-hi les invocacions javascript si n’hi ha.
-
#datify(str) ⇒ Object
Extracts and returns the first provable DateTime from a string.
-
#shorten(str) ⇒ Object
Transforms
str
to a shortened version: strips all non-alphanumeric chars, non-ascii and spaces and joins every word first two letters capitalized.
Instance Method Details
#acronymize(str) ⇒ Object
Transforms a string str
to an acronym
54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'lib/mme_tools/webparse.rb', line 54 def acronymize(str) cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ") # opcio 1 unwanted_words_pttrn = %w[de en].map {|w| "\\b#{w}\\b"}.join("|") res = cleared_str.gsub(/\b\w\b|#{unwanted_words_pttrn}/i," ") res = res.split(" ").map {|s| s[0..0].upcase}.join # opcio 2 if res == "" res = cleared_str.split(" ").map {|s| s[0..0].upcase}.join end res end |
#asciify(str) ⇒ Object
Intenta convertir str
a ASCII pur i dur
49 50 51 |
# File 'lib/mme_tools/webparse.rb', line 49 def asciify(str) Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', str) end |
#clear_string(str, opts = {}) ⇒ Object
treu els espais innecessaris i codis HTML d’enmig i extrems a un string neteja l’string eliminant tots els no printables dels extrems i els d’enmig els substitueiux per un unic espai. Les opcions opts
poden ser:
+:encoding+ => "ASCII" | "UTF8" (default)
"ASCII" converteix tots els caracters al mes semblant ASCII (amb Iconv)
"UTF8" torna una cadena UTF8
(based on an idea of Obie Fernandez www.jroller.com/obie/tags/unicode)
37 38 39 40 41 |
# File 'lib/mme_tools/webparse.rb', line 37 def clear_string(str, opts={}) = {:encoding=>'UTF8'}.merge opts # default option :encoding=>'UTF8' str=str.chars.map { |c| (c.bytes[0] <= 127) ? c : translation_hash[c] }.join if [:encoding]=='ASCII' str.gsub(/[\s\302\240]+/mu," ").strip # el caracter UTF8 "\302\240" correspon al de HTML end |
#clear_uri(uri) ⇒ Object
torna una uri treient-hi les invocacions javascript si n’hi ha. Per exemple
"javascript:openDoc('/gisa/documentos/cartes/PT.DOC')" -> "/gisa/documentos/cartes/PT.DOC"
22 23 24 25 26 27 |
# File 'lib/mme_tools/webparse.rb', line 22 def clear_uri uri case uri when /Doc\('.*'\)/ then uri.match(/Doc\('(.*)'\)/).captures[0] else uri end end |
#datify(str) ⇒ Object
Extracts and returns the first provable DateTime from a string
78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/mme_tools/webparse.rb', line 78 def datify(str) pttrn = /(\d+)[\/-](\d+)[\/-](\d+)(\W+(\d+)\:(\d+))?/ day, month, year, dummy, hour, min = str.match(pttrn).captures.map {|d| d ? d.to_i : 0 } case year when 0..69 year += 2000 when 70..99 year += 1900 end DateTime.civil year, month, day, hour, min end |
#shorten(str) ⇒ Object
Transforms str
to a shortened version: strips all non-alphanumeric chars, non-ascii and spaces and joins every word first two letters capitalized
72 73 74 75 |
# File 'lib/mme_tools/webparse.rb', line 72 def shorten(str) cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ") cleared_str.split(" ").map {|s| s[0..1].capitalize}.join end |