Class: Regex::Character

Inherits:
AtomicExpression show all
Defined in:
lib/regex/character.rb

Overview

A regular expression that matches a specific character in a given character set

Constant Summary collapse

DigramSequences =

Constant with all special 2-characters escape sequences

{
  "\\a" => 0x7, # alarm
  "\\n" => 0xA, # newline
  "\\r" => 0xD, # carriage return
  "\\t" => 0x9, # tab
  "\\e" => 0x1B, # escape
  "\\f" => 0xC, # form feed
  "\\v" => 0xB, # vertical feed
  # Single octal digit literals
  "\\0" => 0,
  "\\1" => 1,
  "\\2" => 2,
  "\\3" => 3,
  "\\4" => 4,
  "\\5" => 5, 
  "\\6" => 6, 
  "\\7" => 7  
}.freeze
MetaChars =
'\^$+?.'.freeze

Instance Attribute Summary collapse

Attributes inherited from Expression

#begin_anchor, #end_anchor

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from AtomicExpression

#atomic?

Methods inherited from Expression

#atomic?, #cardinality, #options, #to_str

Constructor Details

#initialize(aValue) ⇒ Character

Constructor. [aValue] Initialize the character with a either a String literal or a codepoint value. Examples: Initializing with codepoint value... RegAn::Character.new(0x3a3) # Represents: Σ (Unicode GREEK CAPITAL LETTER SIGMA) RegAn::Character.new(931) # Also represents: Σ (931 dec == 3a3 hex)

Initializing with a single character string RegAn::Character.new(?\u03a3) # Also represents: Σ RegAn::Character.new('Σ') # Obviously, represents a Σ

Initializing with an escape sequence string Recognized escaped characters are: \a (alarm, 0x07), \n (newline, 0xA), \r (carriage return, 0xD), \t (tab, 0x9), \e (escape, 0x1B), \f (form feed, 0xC) \uXXXX where XXXX is a 4 hex digits integer value, \uX..., \ooo (octal) \xXX (hex) Any other escaped character will be treated as a literal character RegAn::Character.new('\n') # Represents a newline RegAn::Character.new('\u03a3') # Represents a Σ



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/regex/character.rb', line 58

def initialize(aValue)
  case aValue
    when String
      if aValue.size == 1
        # Literal single character case...
        @codepoint = self.class.char2codepoint(aValue)
      else
        # Should be an escape sequence...
        @codepoint = self.class.esc2codepoint(aValue)
      end
      @lexeme = aValue

    when Integer
      @codepoint = aValue
    else
      raise StandardError, "Cannot initialize a Character with a '#{aValue}'."
  end
end

Instance Attribute Details

#codepointObject (readonly)

The integer value that uniquely identifies the character.



31
32
33
# File 'lib/regex/character.rb', line 31

def codepoint
  @codepoint
end

#lexemeObject (readonly)

The initial text representation of the character (if any).



34
35
36
# File 'lib/regex/character.rb', line 34

def lexeme
  @lexeme
end

Class Method Details

.char2codepoint(aChar) ⇒ Object

Convertion method that returns the codepoint for the given single character. Example: RegAn::Character::char2codepoint('Σ') # Returns: 0x3a3



88
89
90
# File 'lib/regex/character.rb', line 88

def self.char2codepoint(aChar)
  return aChar.ord
end

.codepoint2char(aCodepoint) ⇒ Object

Convertion method that returns a character given a codepoint (integer) value. Example: RegAn::Character::codepoint2char(0x3a3) # Returns: Σ ( The Unicode GREEK CAPITAL LETTER SIGMA)



81
82
83
# File 'lib/regex/character.rb', line 81

def self.codepoint2char(aCodepoint)
  return [aCodepoint].pack('U') # Remark: chr() fails with codepoints > 256
end

.esc2codepoint(anEscapeSequence) ⇒ Object

Convertion method that returns the codepoint for the given escape sequence (a String). Recognized escaped characters are: \a (alarm, 0x07), \n (newline, 0xA), \r (carriage return, 0xD), \t (tab, 0x9), \e (escape, 0x1B), \f (form feed, 0xC), \v (vertical feed, 0xB) \uXXXX where XXXX is a 4 hex digits integer value, \uX..., \ooo (octal) \xXX (hex) Any other escaped character will be treated as a literal character Example: RegAn::Character::esc2codepoint('\n') # Returns: 0xd

Raises:

  • (StandardError)


102
103
104
105
106
107
108
# File 'lib/regex/character.rb', line 102

def self.esc2codepoint(anEscapeSequence)
  msg = "Escape sequence #{anEscapeSequence} does not begin with a backslash (\)."
  raise StandardError, msg unless anEscapeSequence[0] == "\\"
  result = (anEscapeSequence.length == 2)? digram2codepoint(anEscapeSequence) : esc_number2codepoint(anEscapeSequence)

  return result
end

Instance Method Details

#==(other) ⇒ Object

Returns true iff this Character and parameter 'another' represent the same character. [another] any Object. The way the equality is tested depends on the another's class Example: newOne = Character.new(?\u03a3) newOne == newOne # true. Identity newOne == Character.new(?\u03a3) # true. Both have same codepoint newOne == ?\u03a3 # true. The single character String match exactly the char attribute. newOne == 0x03a3 # true. The Integer is compared to the codepoint value. Will test equality with any Object that knows the to_s method



124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/regex/character.rb', line 124

def ==(other)
  result = case other
    when Character
      self.to_str == other.to_str

    when Integer
      self.codepoint == other

    when String
      other.size > 1 ? false : to_str == other

    else
      # Unknown type: try with a convertion
      self == other.to_s # Recursive call
  end

  return result
end

#charObject

Return the character as a String object



111
112
113
# File 'lib/regex/character.rb', line 111

def char()
  self.class.codepoint2char(@codepoint)
end

#explainObject

Return a plain English description of the character



144
145
146
# File 'lib/regex/character.rb', line 144

def explain()
  return "the character '#{to_str}'"
end