Module: Normatron::Filters::KeepFilter

Extended by:
Helpers
Defined in:
lib/normatron/filters/keep_filter.rb

Class Method Summary collapse

Methods included from Helpers

acronym_regex, acronyms, evaluate_regexp, inflections, mb_send

Class Method Details

.evaluate(input, *properties) ⇒ String

TODO:

Raise exception for empty properties

Remove the characters that doesn’t match the given properties. The character properties follow the rule of \p{} construct described in Regexp class. The \p{} construct matches characters with the named property, much like POSIX bracket classes.

To pass named properties to this filter, use them as Symbols:

  • :Alnum - Alphabetic and numeric character

  • :Alpha - Alphabetic character

  • :Blank - Space or tab

  • :Cntrl - Control character

  • :Digit - Digit

  • :Graph - Non-blank character (excludes spaces, control characters, and similar)

  • :Lower - Lowercase alphabetical character

  • :Print - Like :Graph, but includes the space character

  • :Punct - Punctuation character

  • :Space - Whitespace character ([:blank:], newline, carriage return, etc.)

  • :Upper - Uppercase alphabetical

  • :XDigit - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)

  • :Word - A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation

  • :ASCII - A character in the ASCII character set

  • :Any - Any Unicode character (including unassigned characters)

  • :Assigned - An assigned character

A Unicode character’s General Category value can also be matched with :Ab where Ab is the category’s abbreviation as described below:

  • :L - ‘Letter’

  • :Ll - ‘Letter: Lowercase’

  • :Lm - ‘Letter: Mark’

  • :Lo - ‘Letter: Other’

  • :Lt - ‘Letter: Titlecase’

  • :Lu - ‘Letter: Uppercase

  • :Lo - ‘Letter: Other’

  • :M - ‘Mark’

  • :Mn - ‘Mark: Nonspacing’

  • :Mc - ‘Mark: Spacing Combining’

  • :Me - ‘Mark: Enclosing’

  • :N - ‘Number’

  • :Nd - ‘Number: Decimal Digit’

  • :Nl - ‘Number: Letter’

  • :No - ‘Number: Other’

  • :P - ‘Punctuation’

  • :Pc - ‘Punctuation: Connector’

  • :Pd - ‘Punctuation: Dash’

  • :Ps - ‘Punctuation: Open’

  • :Pe - ‘Punctuation: Close’

  • :Pi - ‘Punctuation: Initial Quote’

  • :Pf - ‘Punctuation: Final Quote’

  • :Po - ‘Punctuation: Other’

  • :S - ‘Symbol’

  • :Sm - ‘Symbol: Math’

  • :Sc - ‘Symbol: Currency’

  • :Sc - ‘Symbol: Currency’

  • :Sk - ‘Symbol: Modifier’

  • :So - ‘Symbol: Other’

  • :Z - ‘Separator’

  • :Zs - ‘Separator: Space’

  • :Zl - ‘Separator: Line’

  • :Zp - ‘Separator: Paragraph’

  • :C - ‘Other’

  • :Cc - ‘Other: Control’

  • :Cf - ‘Other: Format’

  • :Cn - ‘Other: Not Assigned’

  • :Co - ‘Other: Private Use’

  • :Cs - ‘Other: Surrogate’

Lastly, this method matches a character’s Unicode script. The following scripts are supported:

Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Examples:

KeepFilter.evaluate("Doom 3", :L)      #=> "Doom"    keep only letters
KeepFilter.evaluate("Doom 3", :N)      #=> "3"       keep only numbers
KeepFilter.evaluate("Doom 3", :L, :N)  #=> "Doom3"   keep only letters and numbers
KeepFilter.evaluate("Doom 3", :Lu, :N) #=> "D3"      keep only uppercased letters or numbers
KeepFilter.evaluate("Doom ˩", :Latin)  #=> "Doom"    keep only latin characters

Using as ActiveRecord::Base normalizer

normalize :attribute_a, :with => [[:keep, :Lu]]
normalize :attribute_b, :with => [{:keep =>[:Lu]}]
normalize :attribute_c, :with => [:custom_filter, [:keep, :Ll, :Space]]
normalize :attribute_d, :with => [:custom_filter, {:keep => [:Ll, :Space]}]

Parameters:

  • input (String)

    A character sequence

  • properties ([Symbol]*)

    Array of Symbols equivalent to Regexp property for \p{} construct.

Returns:

  • (String)

    The clean character sequence or the object itself

See Also:



95
96
97
# File 'lib/normatron/filters/keep_filter.rb', line 95

def self.evaluate(input, *properties)
  input.kind_of?(String) ? evaluate_regexp(input, :keep, properties) : input
end