Module: Sterile

Defined in:
lib/sterile/tags.rb,
lib/sterile/version.rb,
lib/sterile/entities.rb,
lib/sterile/titlecase.rb,
lib/sterile/utilities.rb,
lib/sterile/plain_format.rb,
lib/sterile/smart_format.rb,
lib/sterile/transliterate.rb,
lib/sterile/string_extensions.rb,
lib/sterile/data/codepoints_data.rb,
lib/sterile/data/html_entities_data.rb,
lib/sterile/data/plain_format_rules.rb,
lib/sterile/data/smart_format_rules.rb

Defined Under Namespace

Modules: StringExtensions

Constant Summary collapse

VERSION =
"1.0.26"

Class Method Summary collapse

Class Method Details

.decode_entities(string) ⇒ Object

The reverse of encode_entities. Turns HTML or numeric entities into their Unicode counterparts.



26
27
28
29
30
31
32
33
# File 'lib/sterile/entities.rb', line 26

def decode_entities(string)
  string.gsub!(/&#x([a-zA-Z0-9]{1,7});/) { [$1.to_i(16)].pack("U") }
  string.gsub!(/&#(\d{1,7});/) { [$1.to_i].pack("U") }
  string.gsub(/&([a-zA-Z0-9]+);/) do
    codepoint = html_entities_data[$1]
    codepoint ? [codepoint].pack("U") : $&
  end
end

.encode_entities(string) ⇒ Object

Turn Unicode characters into their HTML equivilents. If a valid HTML entity is not possible, it will create a numeric entity.

q{“Economy Hits Bottom,” ran the headline}.encode_entities # => “Economy Hits Bottom,” ran the headline


12
13
14
15
16
17
18
19
20
# File 'lib/sterile/entities.rb', line 12

def encode_entities(string)
  transmogrify(string) do |mapping, codepoint|
    if (32..126).include?(codepoint)
      mapping[0]
    else
      "&" + (mapping[2] || "#" + codepoint.to_s) + ";"
    end
  end
end

.gsub_tags(string, &block) ⇒ Object

Similar to gsub, except it works in between HTML/XML tags and yields text to a block. Text will be replaced by what the block returns. Warning: does not work in some degenerate cases.



57
58
59
60
61
62
63
64
65
# File 'lib/sterile/tags.rb', line 57

def gsub_tags(string, &block)
  raise "No block given" unless block_given?

  fragment = Nokogiri::HTML::DocumentFragment.parse string
  fragment.traverse do |node|
    node.content = yield(node.content) if node.text?
  end
  fragment.to_html
end

.plain_format(string) ⇒ Object



7
8
9
10
11
12
13
# File 'lib/sterile/plain_format.rb', line 7

def plain_format(string)
  string = string.encode_entities
  plain_format_rules.each do |rule|
    string.gsub! rule[0], rule[1]
  end
  string
end

.plain_format_tags(string) ⇒ Object

Like plain_format, but works with HTML/XML (somewhat).



18
19
20
21
22
# File 'lib/sterile/plain_format.rb', line 18

def plain_format_tags(string)
  string.gsub_tags do |text|
    text.plain_format.decode_entities
  end.encode_entities
end

.scan_tags(string, &block) ⇒ Object

Iterates over all text in between HTML/XML tags and yields it to a block. Warning: does not work in some degenerate cases.



72
73
74
75
76
77
78
79
80
# File 'lib/sterile/tags.rb', line 72

def scan_tags(string, &block)
  raise "No block given" unless block_given?

  fragment = Nokogiri::HTML::DocumentFragment.parse string
  fragment.traverse do |node|
    yield(node.content) if node.text?
  end
  nil
end

.sluggerize(string, options = {}) ⇒ Object Also known as: to_slug

Transliterate to ASCII, downcase and format for URL permalink/slug by stripping out all non-alphanumeric characters and replacing spaces with a delimiter (defaults to ‘-’).

"Hello World!".sluggerize # => "hello-world"


32
33
34
35
36
37
38
# File 'lib/sterile/utilities.rb', line 32

def sluggerize(string, options = {})
  options = {
    :delimiter => "-"
  }.merge!(options)

  sterilize(string).strip.gsub(/\s+/, "-").gsub(/[^a-zA-Z0-9\-]/, "").gsub(/-+/, options[:delimiter]).downcase
end

.smart_format(string) ⇒ Object

Format text with proper “curly” quotes, m-dashes, copyright, trademark, etc.

q{"He said, 'Away with you, Drake!'"}.smart_format # => “He said, ‘Away with you, Drake!’”


11
12
13
14
15
16
17
18
# File 'lib/sterile/smart_format.rb', line 11

def smart_format(string)
  string = string.to_s
  string = string.dup if string.frozen?
  smart_format_rules.each do |rule|
    string.gsub! rule[0], rule[1]
  end
  string
end

.smart_format_tags(string) ⇒ Object

Like smart_format, but works with HTML/XML (somewhat).



23
24
25
26
27
28
# File 'lib/sterile/smart_format.rb', line 23

def smart_format_tags(string)
  string = string.gsub(/[\p{Z}\s]+(<\/[a-zA-Z]+>)(['"][a-zA-Z])/, "\\1 \\2") # Fixes quote after whitespace + tag "<em>Dan. </em>'And"
  string.gsub_tags do |text|
    text.smart_format
  end.encode_entities.gsub(/(\<\/\w+\>)&ldquo;/, "\\1&rdquo;").gsub(/(\<\/\w+\>)&lsquo;/, "\\1&rsquo;")
end

.sterilize(string) ⇒ Object

Transliterate to ASCII and strip out any HTML/XML tags.

"<b>nåsty</b>".sterilize # => "nasty"


21
22
23
# File 'lib/sterile/utilities.rb', line 21

def sterilize(string)
  strip_tags(transliterate(string))
end

.strip_tags(string, options = {}) ⇒ Object

Remove HTML/XML tags from text. Also strips out comments, PHP and ERB style tags. CDATA is considered text unless :keep_cdata => false is specified. Redundant whitespace will be removed unless :keep_whitespace => true is specified.



13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/sterile/tags.rb', line 13

def strip_tags(string, options = {})
  options = {
    :keep_whitespace => false,
    :keep_cdata      => true
  }.merge!(options)

  string.gsub!(/<[%?](php)?[^>]*>/, '') # strip php, erb et al
  string.gsub!(/<!--[^-]*-->/, '')      # strip comments

  string.gsub!(
    /
      <!\[CDATA\[
      ([^\]]*)
      \]\]>
    /xi,
    options[:keep_cdata] ? '\\1' : ''
  )

  html_name = /[\w:-]+/
  html_data = /([A-Za-z0-9]+|('[^']*?'|"[^"]*?"))/
  html_attr = /(#{html_name}(\s*=\s*#{html_data})?)/

  string.gsub!(
    /
      <
      [\/]?
      #{html_name}
      (\s+(#{html_attr}(\s+#{html_attr})*))?
      \s*
      [\/]?
      >
    /xi,
    ''
  )

  options[:keep_whitespace] ? string : trim_whitespace(string)
end

.titlecase(string) ⇒ Object Also known as: titleize

Format text appropriately for titles. This method is much smarter than ActiveSupport’s titlecase. The algorithm is based on work done by John Gruber et al (daringfireball.net/2008/08/title_case_update)



11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/sterile/titlecase.rb', line 11

def titlecase(string)

  lsquo = [8216].pack("U")
  rsquo = [8217].pack("U")
  ldquo = [8220].pack("U")
  rdquo = [8221].pack("U")
  ndash = [8211].pack("U")

  string.strip!
  string.gsub!(/\s+/, " ")
  string.downcase! unless string =~ /[[:lower:]]/

  small_words = %w{ a an and as at(?!&t) but by en for if in nor of on or the to v[.]? via vs[.]? }.join("|")
  apos = / (?: ['#{rsquo}] [[:lower:]]* )? /xu

  string.gsub!(
    /
      \b
      ([_\*]*)
      (?:
        ( [-\+\w]+ [@.\:\/] [-\w@.\:\/]+ #{apos} )      # URL, domain, or email
        |
        ( (?i: #{small_words} ) #{apos} )               # or small word, case-insensitive
        |
        ( [[:alpha:]] [[:lower:]'#{rsquo}()\[\]{}]* #{apos} )  # or word without internal caps
        |
        ( [[:alpha:]] [[:alpha:]'#{rsquo}()\[\]{}]* #{apos} )  # or some other word
      )
      ([_\*]*)
      \b
    /xu
  ) do
    ($1 ? $1 : "") +
    ($2 ? $2 : ($3 ? $3.downcase : ($4 ? $4.downcase.capitalize : $5))) +
    ($6 ? $6 : "")
  end

  if RUBY_VERSION < "1.9.0"
    string.gsub!(
      /
        \b
        ([:alpha:]+)
        (#{ndash})
        ([:alpha:]+)
        \b
      /xu
    ) do
      $1.downcase.capitalize + $2 + $1.downcase.capitalize
    end
  end

  string.gsub!(
    /
      (
        \A [[:punct:]]*     # start of title
        | [:.;?!][ ]+       # or of subsentence
        | [ ]['"#{ldquo}#{lsquo}(\[][ ]*  # or of inserted subphrase
      )
      ( #{small_words} )    # followed by a small-word
      \b
    /xiu
  ) do
    $1 + $2.downcase.capitalize
  end

  string.gsub!(
    /
      \b
      ( #{small_words} )    # small-word
      (?=
        [[:punct:]]* \Z     # at the end of the title
        |
        ['"#{rsquo}#{rdquo})\]] [ ]       # or of an inserted subphrase
      )
    /xu
  ) do
    $1.downcase.capitalize
  end

  string.gsub!(
    /
      (
        \b
        [[:alpha:]]         # single first letter
        [\-#{ndash}]               # followed by a dash
      )
      ( [[:alpha:]] )       # followed by a letter
    /xu
  ) do
    $1 + $2.downcase
  end

  string.gsub!(/q&a/i, 'Q&A')

  string
end

.transliterate(string, options = {}) ⇒ Object Also known as: to_ascii

Transliterate Unicode [and accented ASCII] characters to their plain-text ASCII equivalents. This is based on data from the stringex gem (github.com/rsl/stringex) which is in turn a port of Perl’s Unidecode and ostensibly provides superior results to iconv. The optical conversion data is based on work by Eric Boehs at github.com/ericboehs/to_slug Passing an option of :optical => true will prefer optical mapping instead of more pedantic matches.

"ýůçký".transliterate # => "yucky"


34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/sterile/transliterate.rb', line 34

def transliterate(string, options = {})
  options = {
    :optical => false
  }.merge!(options)

  if options[:optical]
    transmogrify(string) do |mapping, codepoint|
      mapping[1] || mapping[0] || ""
    end
  else
    transmogrify(string) do |mapping, codepoint|
      mapping[0] || mapping[1] || ""
    end
  end
end

.transmogrify(string, &block) ⇒ Object



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# File 'lib/sterile/transliterate.rb', line 7

def transmogrify(string, &block)
  raise "No block given" unless block_given?

  result = ""
  string.unpack("U*").each do |codepoint|
    cg = codepoint >> 8
    cp = codepoint & 0xFF
    begin
      mapping = Array(codepoints_data[cg][cp])
      result << yield(mapping, codepoint)
    rescue
    end
  end

  result
end

.trim_whitespace(string) ⇒ Object

Trim whitespace from start and end of string and remove any redundant whitespace in between.

" Hello  world! ".transliterate # => "Hello world!"


12
13
14
# File 'lib/sterile/utilities.rb', line 12

def trim_whitespace(string)
  string.gsub(/\s+/, " ").strip
end