Module: MMETools::Webparse

Extended by:
Webparse
Included in:
Webparse
Defined in:
lib/mme_tools/webparse.rb

Overview

methods for processing strings while parsing webpages

Instance Method Summary collapse

Instance Method Details

#acronymize(str) ⇒ Object

Transforms a string str to an acronym



54
55
56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/mme_tools/webparse.rb', line 54

def acronymize(str)
  cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ")

  # opcio 1
  unwanted_words_pttrn = %w[de en].map {|w| "\\b#{w}\\b"}.join("|")
  res = cleared_str.gsub(/\b\w\b|#{unwanted_words_pttrn}/i," ")
  res = res.split(" ").map {|s| s[0..0].upcase}.join

  # opcio 2
  if res == ""
    res = cleared_str.split(" ").map {|s| s[0..0].upcase}.join
  end
  res
end

#asciify(str) ⇒ Object

Intenta convertir str a ASCII pur i dur



49
50
51
# File 'lib/mme_tools/webparse.rb', line 49

def asciify(str)
  Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', str)
end

#clear_string(str, opts = {}) ⇒ Object

treu els espais innecessaris i codis HTML d’enmig i extrems a un string neteja l’string eliminant tots els no printables dels extrems i els d’enmig els substitueiux per un unic espai. Les opcions opts poden ser:

+:encoding+ => "ASCII" | "UTF8" (default)
  "ASCII" converteix tots els caracters al mes semblant ASCII (amb Iconv)
  "UTF8" torna una cadena UTF8

(based on an idea of Obie Fernandez www.jroller.com/obie/tags/unicode)



37
38
39
40
41
# File 'lib/mme_tools/webparse.rb', line 37

def clear_string(str, opts={})
  options = {:encoding=>'UTF8'}.merge opts  # default option :encoding=>'UTF8'
  str=str.chars.map { |c| (c.bytes[0] <= 127) ? c : translation_hash[c] }.join if options[:encoding]=='ASCII'
  str.gsub(/[\s\302\240]+/mu," ").strip # el caracter UTF8 "\302\240" correspon al &nbsp; de HTML
end

#clear_uri(uri) ⇒ Object

torna una uri treient-hi les invocacions javascript si n’hi ha. Per exemple

"javascript:openDoc('/gisa/documentos/cartes/PT.DOC')" -> "/gisa/documentos/cartes/PT.DOC"


22
23
24
25
26
27
# File 'lib/mme_tools/webparse.rb', line 22

def clear_uri uri
  case uri
  when /Doc\('.*'\)/ then uri.match(/Doc\('(.*)'\)/).captures[0]
  else uri
  end
end

#datify(str) ⇒ Object

Extracts and returns the first provable DateTime from a string



78
79
80
81
82
83
84
85
86
87
88
# File 'lib/mme_tools/webparse.rb', line 78

def datify(str)
  pttrn = /(\d+)[\/-](\d+)[\/-](\d+)(\W+(\d+)\:(\d+))?/
  day, month, year, dummy, hour, min = str.match(pttrn).captures.map {|d| d ? d.to_i : 0 }
  case year
  when 0..69
    year += 2000
  when 70..99
    year += 1900
  end
  DateTime.civil year, month, day, hour, min
end

#shorten(str) ⇒ Object

Transforms str to a shortened version: strips all non-alphanumeric chars, non-ascii and spaces and joins every word first two letters capitalized



72
73
74
75
# File 'lib/mme_tools/webparse.rb', line 72

def shorten(str)
  cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ")
  cleared_str.split(" ").map {|s| s[0..1].capitalize}.join
end