Module: Sterile
- Defined in:
- lib/sterile/tags.rb,
lib/sterile/version.rb,
lib/sterile/entities.rb,
lib/sterile/titlecase.rb,
lib/sterile/utilities.rb,
lib/sterile/plain_format.rb,
lib/sterile/smart_format.rb,
lib/sterile/transliterate.rb,
lib/sterile/string_extensions.rb,
lib/sterile/data/codepoints_data.rb,
lib/sterile/data/html_entities_data.rb,
lib/sterile/data/plain_format_rules.rb,
lib/sterile/data/smart_format_rules.rb
Defined Under Namespace
Modules: StringExtensions
Constant Summary collapse
- VERSION =
"1.0.26"
Class Method Summary collapse
-
.decode_entities(string) ⇒ Object
The reverse of
encode_entities
. -
.encode_entities(string) ⇒ Object
Turn Unicode characters into their HTML equivilents.
-
.gsub_tags(string, &block) ⇒ Object
Similar to
gsub
, except it works in between HTML/XML tags and yields text to a block. - .plain_format(string) ⇒ Object
-
.plain_format_tags(string) ⇒ Object
Like
plain_format
, but works with HTML/XML (somewhat). -
.scan_tags(string, &block) ⇒ Object
Iterates over all text in between HTML/XML tags and yields it to a block.
-
.sluggerize(string, options = {}) ⇒ Object
(also: to_slug)
Transliterate to ASCII, downcase and format for URL permalink/slug by stripping out all non-alphanumeric characters and replacing spaces with a delimiter (defaults to ‘-’).
-
.smart_format(string) ⇒ Object
Format text with proper “curly” quotes, m-dashes, copyright, trademark, etc.
-
.smart_format_tags(string) ⇒ Object
Like
smart_format
, but works with HTML/XML (somewhat). -
.sterilize(string) ⇒ Object
Transliterate to ASCII and strip out any HTML/XML tags.
-
.strip_tags(string, options = {}) ⇒ Object
Remove HTML/XML tags from text.
-
.titlecase(string) ⇒ Object
(also: titleize)
Format text appropriately for titles.
-
.transliterate(string, options = {}) ⇒ Object
(also: to_ascii)
Transliterate Unicode [and accented ASCII] characters to their plain-text ASCII equivalents.
- .transmogrify(string, &block) ⇒ Object
-
.trim_whitespace(string) ⇒ Object
Trim whitespace from start and end of string and remove any redundant whitespace in between.
Class Method Details
.decode_entities(string) ⇒ Object
The reverse of encode_entities
. Turns HTML or numeric entities into their Unicode counterparts.
26 27 28 29 30 31 32 33 |
# File 'lib/sterile/entities.rb', line 26 def decode_entities(string) string.gsub!(/&#x([a-zA-Z0-9]{1,7});/) { [$1.to_i(16)].pack("U") } string.gsub!(/&#(\d{1,7});/) { [$1.to_i].pack("U") } string.gsub(/&([a-zA-Z0-9]+);/) do codepoint = html_entities_data[$1] codepoint ? [codepoint].pack("U") : $& end end |
.encode_entities(string) ⇒ Object
Turn Unicode characters into their HTML equivilents. If a valid HTML entity is not possible, it will create a numeric entity.
q{“Economy Hits Bottom,” ran the headline}.encode_entities # => “Economy Hits Bottom,” ran the headline
12 13 14 15 16 17 18 19 20 |
# File 'lib/sterile/entities.rb', line 12 def encode_entities(string) transmogrify(string) do |mapping, codepoint| if (32..126).include?(codepoint) mapping[0] else "&" + (mapping[2] || "#" + codepoint.to_s) + ";" end end end |
.gsub_tags(string, &block) ⇒ Object
Similar to gsub
, except it works in between HTML/XML tags and yields text to a block. Text will be replaced by what the block returns. Warning: does not work in some degenerate cases.
57 58 59 60 61 62 63 64 65 |
# File 'lib/sterile/tags.rb', line 57 def (string, &block) raise "No block given" unless block_given? fragment = Nokogiri::HTML::DocumentFragment.parse string fragment.traverse do |node| node.content = yield(node.content) if node.text? end fragment.to_html end |
.plain_format(string) ⇒ Object
7 8 9 10 11 12 13 |
# File 'lib/sterile/plain_format.rb', line 7 def plain_format(string) string = string.encode_entities plain_format_rules.each do |rule| string.gsub! rule[0], rule[1] end string end |
.plain_format_tags(string) ⇒ Object
Like plain_format
, but works with HTML/XML (somewhat).
18 19 20 21 22 |
# File 'lib/sterile/plain_format.rb', line 18 def (string) string. do |text| text.plain_format.decode_entities end.encode_entities end |
.scan_tags(string, &block) ⇒ Object
Iterates over all text in between HTML/XML tags and yields it to a block. Warning: does not work in some degenerate cases.
72 73 74 75 76 77 78 79 80 |
# File 'lib/sterile/tags.rb', line 72 def (string, &block) raise "No block given" unless block_given? fragment = Nokogiri::HTML::DocumentFragment.parse string fragment.traverse do |node| yield(node.content) if node.text? end nil end |
.sluggerize(string, options = {}) ⇒ Object Also known as: to_slug
Transliterate to ASCII, downcase and format for URL permalink/slug by stripping out all non-alphanumeric characters and replacing spaces with a delimiter (defaults to ‘-’).
"Hello World!".sluggerize # => "hello-world"
32 33 34 35 36 37 38 |
# File 'lib/sterile/utilities.rb', line 32 def sluggerize(string, = {}) = { :delimiter => "-" }.merge!() sterilize(string).strip.gsub(/\s+/, "-").gsub(/[^a-zA-Z0-9\-]/, "").gsub(/-+/, [:delimiter]).downcase end |
.smart_format(string) ⇒ Object
Format text with proper “curly” quotes, m-dashes, copyright, trademark, etc.
q{"He said, 'Away with you, Drake!'"}.smart_format # => “He said, ‘Away with you, Drake!’”
11 12 13 14 15 16 17 18 |
# File 'lib/sterile/smart_format.rb', line 11 def smart_format(string) string = string.to_s string = string.dup if string.frozen? smart_format_rules.each do |rule| string.gsub! rule[0], rule[1] end string end |
.smart_format_tags(string) ⇒ Object
Like smart_format
, but works with HTML/XML (somewhat).
23 24 25 26 27 28 |
# File 'lib/sterile/smart_format.rb', line 23 def (string) string = string.gsub(/[\p{Z}\s]+(<\/[a-zA-Z]+>)(['"][a-zA-Z])/, "\\1 \\2") # Fixes quote after whitespace + tag "<em>Dan. </em>'And" string. do |text| text.smart_format end.encode_entities.gsub(/(\<\/\w+\>)“/, "\\1”").gsub(/(\<\/\w+\>)‘/, "\\1’") end |
.sterilize(string) ⇒ Object
Transliterate to ASCII and strip out any HTML/XML tags.
"<b>nåsty</b>".sterilize # => "nasty"
21 22 23 |
# File 'lib/sterile/utilities.rb', line 21 def sterilize(string) (transliterate(string)) end |
.strip_tags(string, options = {}) ⇒ Object
Remove HTML/XML tags from text. Also strips out comments, PHP and ERB style tags. CDATA is considered text unless :keep_cdata => false is specified. Redundant whitespace will be removed unless :keep_whitespace => true is specified.
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/sterile/tags.rb', line 13 def (string, = {}) = { :keep_whitespace => false, :keep_cdata => true }.merge!() string.gsub!(/<[%?](php)?[^>]*>/, '') # strip php, erb et al string.gsub!(/<!--[^-]*-->/, '') # strip comments string.gsub!( / <!\[CDATA\[ ([^\]]*) \]\]> /xi, [:keep_cdata] ? '\\1' : '' ) html_name = /[\w:-]+/ html_data = /([A-Za-z0-9]+|('[^']*?'|"[^"]*?"))/ html_attr = /(#{html_name}(\s*=\s*#{html_data})?)/ string.gsub!( / < [\/]? #{html_name} (\s+(#{html_attr}(\s+#{html_attr})*))? \s* [\/]? > /xi, '' ) [:keep_whitespace] ? string : trim_whitespace(string) end |
.titlecase(string) ⇒ Object Also known as: titleize
Format text appropriately for titles. This method is much smarter than ActiveSupport’s titlecase
. The algorithm is based on work done by John Gruber et al (daringfireball.net/2008/08/title_case_update)
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/sterile/titlecase.rb', line 11 def titlecase(string) lsquo = [8216].pack("U") rsquo = [8217].pack("U") ldquo = [8220].pack("U") rdquo = [8221].pack("U") ndash = [8211].pack("U") string.strip! string.gsub!(/\s+/, " ") string.downcase! unless string =~ /[[:lower:]]/ small_words = %w{ a an and as at(?!&t) but by en for if in nor of on or the to v[.]? via vs[.]? }.join("|") apos = / (?: ['#{rsquo}] [[:lower:]]* )? /xu string.gsub!( / \b ([_\*]*) (?: ( [-\+\w]+ [@.\:\/] [-\w@.\:\/]+ #{apos} ) # URL, domain, or email | ( (?i: #{small_words} ) #{apos} ) # or small word, case-insensitive | ( [[:alpha:]] [[:lower:]'#{rsquo}()\[\]{}]* #{apos} ) # or word without internal caps | ( [[:alpha:]] [[:alpha:]'#{rsquo}()\[\]{}]* #{apos} ) # or some other word ) ([_\*]*) \b /xu ) do ($1 ? $1 : "") + ($2 ? $2 : ($3 ? $3.downcase : ($4 ? $4.downcase.capitalize : $5))) + ($6 ? $6 : "") end if RUBY_VERSION < "1.9.0" string.gsub!( / \b ([:alpha:]+) (#{ndash}) ([:alpha:]+) \b /xu ) do $1.downcase.capitalize + $2 + $1.downcase.capitalize end end string.gsub!( / ( \A [[:punct:]]* # start of title | [:.;?!][ ]+ # or of subsentence | [ ]['"#{ldquo}#{lsquo}(\[][ ]* # or of inserted subphrase ) ( #{small_words} ) # followed by a small-word \b /xiu ) do $1 + $2.downcase.capitalize end string.gsub!( / \b ( #{small_words} ) # small-word (?= [[:punct:]]* \Z # at the end of the title | ['"#{rsquo}#{rdquo})\]] [ ] # or of an inserted subphrase ) /xu ) do $1.downcase.capitalize end string.gsub!( / ( \b [[:alpha:]] # single first letter [\-#{ndash}] # followed by a dash ) ( [[:alpha:]] ) # followed by a letter /xu ) do $1 + $2.downcase end string.gsub!(/q&a/i, 'Q&A') string end |
.transliterate(string, options = {}) ⇒ Object Also known as: to_ascii
Transliterate Unicode [and accented ASCII] characters to their plain-text ASCII equivalents. This is based on data from the stringex gem (github.com/rsl/stringex) which is in turn a port of Perl’s Unidecode and ostensibly provides superior results to iconv. The optical conversion data is based on work by Eric Boehs at github.com/ericboehs/to_slug Passing an option of :optical => true will prefer optical mapping instead of more pedantic matches.
"ýůçký".transliterate # => "yucky"
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/sterile/transliterate.rb', line 34 def transliterate(string, = {}) = { :optical => false }.merge!() if [:optical] transmogrify(string) do |mapping, codepoint| mapping[1] || mapping[0] || "" end else transmogrify(string) do |mapping, codepoint| mapping[0] || mapping[1] || "" end end end |
.transmogrify(string, &block) ⇒ Object
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# File 'lib/sterile/transliterate.rb', line 7 def transmogrify(string, &block) raise "No block given" unless block_given? result = "" string.unpack("U*").each do |codepoint| cg = codepoint >> 8 cp = codepoint & 0xFF begin mapping = Array(codepoints_data[cg][cp]) result << yield(mapping, codepoint) rescue end end result end |
.trim_whitespace(string) ⇒ Object
Trim whitespace from start and end of string and remove any redundant whitespace in between.
" Hello world! ".transliterate # => "Hello world!"
12 13 14 |
# File 'lib/sterile/utilities.rb', line 12 def trim_whitespace(string) string.gsub(/\s+/, " ").strip end |