Class: PlainText::ParseRule
- Inherits:
-
Object
- Object
- PlainText::ParseRule
- Defined in:
- lib/plain_text/parse_rule.rb
Overview
It would be smarter each instance (Regexp and Part) has its own “name” rather than this class holds @names as an Array.
Class to describe rules to parse a String (and Array of them)
An instance (say, pr) of this class describes how a String (or Array of them) is parsed to a structure, that is, an Array of String or maybe Part, PlainText::Part::Paragraph, PlainText::Part::Boundary. Once pr is created, a String str is parsed as
ary = pr.apply(str)
which returns an Array (referred to as ary hereafter).
The returned array ary may contain Strings at the basic level. In that case, any even elements are semantically PlainText::Part::Boundary and any odd elements are semantically PlainText::Part::Paragraph or Part, which can be further parsed in the later processing.
Alternatively, the returned array ary may contain PlainText::Part::Paragraph, PlainText::Part::Boundary, or even Part, depending how the instance pr is constructed.
An instance pr consists of an array of the rules (which can be retrieved by #rules); each rule of it is either a Proc instance or Regexp. The rule is applied to either String (for the first-time application only) or Array (for any subsequent applications), the latter of which is (though it does not have to be) the result of the previous applications, and an Array is returned. Elements of #rules (particularly common for for #rules[ 0 ]) can be Regexp, in which case either the given String or every element of an even index (starting from 0; they all are semantically Paragraphs) of the given Array is String#split as defined in the rule to return an Array. This manipulation with String#split in general increases the number of the elements (Array#size) if an Array is given as the argument. For example, suppose the given Array has initially two elements, and suppose String#split is applied to the first element (only), and it may create 5 elements. Then, the resultant number of elements of the returned array is 6.
For the second or later application, the element, Proc, must assume the argument is an Array (of String or even PlainText::SOMETHING objects) and process them accordingly.
For example, the predefined constant RuleConsecutiveLbs is one of the instances and it splits a String based on any consecutive linebreaks (it is typical to regard paragraphs as being separated by consecutive linebreaks). An example is like this:
pr.rules[0] # => The rule is: PlainText::ParseRule::RuleConsecutiveLbs.rules[0]
# Once applied, the returned Array is like
# ["My story\n======\nHere is my report.",
# "\n\n", "abc", "\n\n", "xyz"]
pr.rules[1] # => /(\n={4,}\n)/
# Once applied, the returned Array is like
# ["My story", "\n======\n", "Here is my report.",
# "\n\n", "abc", "\n\n", "xyz"]
Or another example may be like this:
pr.rules[0] # => The rule: PlainText::ParseRule::RuleConsecutiveLbs.rules[0]
# Once applied, the returned Array is like
# ["# Breaking! #\nBy Mary Smith\n======\nHere is my report.",
# "\n\n", "abc", "\n\n", "xyz"]
pr.rules[1] # => The rule: For the first element of the input argument (Array), if it has one "\n======\n",
# it is regarded as a (the first) boundary, and the text before
# is regarded as {PlainText::Part}. The returned Array is like
# [Part("# Breaking! #\nBy Mary Smith"),
# Boundary("\n======\n"),
# Paragraph("Here is my report."),
# "\n\n", "abc", "\n\n", "xyz"]
pr.rules[2] # => The rule: For the first element of the input argument (Array), if it satisfies /# (.+) #/,
# it is regarded as a title of a header. The returned Array is like
# [Part::Header(Paragraph(""), Boundary("# "), Paragraph::Title("Breaking!"), Boundary(" #\n")),
# Boundary(""),
# Paragraph("By Mary Smith"),
# Boundary("\n======\n"),
# Paragraph("Here is my report."),
# "\n\n", "abc", "\n\n", "xyz"]
With this, a Part instance can be created like:
pt1 = PlainText::Part.parse(str, rule: pr)
Then,
pt1.parts[0].parts[1] # => Paragraph::Title("Breaking!")
pt1.boundaries[1] # => Boundary("\n======\n")
Constant Summary collapse
- RuleConsecutiveLbs =
PlainText::ParseRule instance to split a String with 2 or more linebreaks (with potentially white-spaces in between). This instance can be dup-ped and used normally. However, if it is clone-d, the cloned instance would be unmodifiable.
self.new(/((?:#{def_lb_q})(?:#{def_lb_q}|[[:blank:]])*(?:#{def_lb_q}))/, name: 'ConsecutiveLbs')
- RuleEachLineStrip =
PlainText::ParseRule instance to split a String with 1 linebreak that is potentially sandwiched with white-spaces (or a whitespace(s) at the very beginning or end). Essentially, each line (after Ruby-strip-ped) is treated as Paragraph. This instance can be dup-ped and used normally. However, if it is clone-d, the cloned instance would be unmodifiable.
self.new(/(\A[[:space:]]+|[[:space:]]*\n[[:space:]]*|[[:space:]]+\z)/, name: 'EachLineStrip')
Instance Attribute Summary collapse
-
#names ⇒ Object
readonly
User-specified human-readable names Array, corresponding to each element of #rules.
-
#rules ⇒ Object
readonly
Main Array of rules (Proc or Regexp).
Instance Method Summary collapse
-
#apply(inprm, index: nil, from_string: true, from_array: true) ⇒ Array
Apply the rules to a given String.
-
#clone ⇒ Object
Deeper clone.
- #clone_original_b4_parse_rule? ⇒ Object
-
#dup ⇒ Object
Deeper dup.
- #dup_original_b4_parse_rule? ⇒ Object
-
#initialize(rule = nil, name: nil) {|inprm| ... } ⇒ ParseRule
constructor
Constructor.
-
#pop(*rest) ⇒ Proc+
Pop a rule(s).
-
#push(*rule, name: nil) {|inprm| ... } ⇒ self
Add a rule(s).
-
#rule_at(key) ⇒ Proc, ...
Get a rule for the specified index or human-readable key.
-
#set_name_at(name, index_rules) ⇒ Integer
Set (or reset) a human-readable name for #rules at a specified index.
-
#size ⇒ Integer
The number of defined rules.
Constructor Details
#initialize(rule = nil, name: nil) {|inprm| ... } ⇒ ParseRule
Constructor
The main argument is a single or an Array of Proc or Regexp. Alternatively, a block can be given. If Regexp(s) is given, it should include grouping (to enclose the entire Regexp usually). If not, grouping is added forcibly.
Note that the method (private method #add_grouping) wrongly recognizes patterns like /[(x]/ to contain grouping. Also, it does not raise warning when more than one grouping is defined. In fact, multiple groupings might be useful in some cases, such as,
/(\n{2,})([^\n]*\S)([[:blank:]]*={2,}\n{2,})/
would produce, when applied, a series of
[Paragraph, Boundary("\n\n"), Paragraph::Title, Boundary("==\n\n")]
Just make sure the number of groupings is an odd number, though.
Optionally, when a non-Array argument or block is given, a name can be specified as the human-readable name for the rule.
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'lib/plain_text/parse_rule.rb', line 128 def initialize(rule=nil, name: nil, &rule_block) if defined?(rule.rules) && defined?(rule.names) # ParseRule given @rules = rule.rules.clone.map{|i| i.clone rescue i} # Deep copy @names = rule.names.clone.map{|i| i.clone rescue i} # Deep copy return end if defined? rule.to_ary # Array given @rules = rule @names = Array.new(@rules.size) return end @rules = [] @names = [] push(rule, name: name, &rule_block) end |
Instance Attribute Details
#names ⇒ Object (readonly)
User-specified human-readable names Array, corresponding to each element of #rules. The elements of this array are either String or nil, though it can be referred to as, or set with #set_name_at, with Symbol. In other words, an element of #rules can be specified with a human-readable name, if set, as well as its index. Use #rule_at to get a rule for the index/key.
103 104 105 |
# File 'lib/plain_text/parse_rule.rb', line 103 def names @names end |
#rules ⇒ Object (readonly)
Main Array of rules (Proc or Regexp). Do not delete or add the contents, as it would have a knock-on effect, especially with #names! Use #rule_at to get a rule for the index/key. The private method #rule_at(-1) is the same as #rules, but is more versatile and can be called like #rules_at(:my_rule1, :my_rule2).
96 97 98 |
# File 'lib/plain_text/parse_rule.rb', line 96 def rules @rules end |
Instance Method Details
#apply(inprm, index: nil, from_string: true, from_array: true) ⇒ Array
Apply the rules to a given String
In default, all the rules are applied in the registered sequence, unless an Option is specified
This method receives either String (for the first-time application only) or Array (for any subsequent applications), the latter of which is (though not necessarily) the result of the previous applications, applies the #rules one by one sequentially, and returns an Array.
Elements of #rules can be Regexp (particularly common for for #rules). In that case, if the given argument is a String, String#split is simply applied. If it is an Array, String#split is applied to every element of an even index (starting from 0; n.b., all even-index elements are semantically Paragraphs). Importantly, this manipulation with String#split to Array unfolds the result of split on the spot, which means in general it increases the number of the elements (Array#size) from the given one. For example, suppose the given Array has initially two elements and then String#split is applied to the first element only (because it is the only even-index element). Suppose the application creates 3 elements. They are interpreted as a sequence of Paragraph, Boundary, and Paragraph. Then the returned array will contain 4 elements. Or, suppose the split application to the first element of the given array resulted in an array of 4 elements. Then, the last element of this array and the next element of the original array are both Boundary. In this case, the two Boundaries are merged so that the elements of the returned array are in the right order of Paragraphs and Boundaries.
400 401 402 403 404 405 406 407 408 |
# File 'lib/plain_text/parse_rule.rb', line 400 def apply(inprm, index: nil, from_string: true, from_array: true) allrules = (index ? rules_at(index) : @rules) arret = (inprm.class.method_defined?(:to_ary) ? inprm : [inprm]) allrules.each do |each_r| arret = (defined?(each_r.match) ? apply_split(arret, each_r) : each_r.call(arret)) end arret end |
#clone ⇒ Object
Deeper clone
Without this, if @rules or @names are modified in a cloned instance, even the original is affected.
182 183 184 185 186 187 188 189 190 191 |
# File 'lib/plain_text/parse_rule.rb', line 182 def clone ret = clone_original_b4_parse_rule? begin ret.instance_eval{ @rules = rules.clone } ret.instance_eval{ @names = names.clone } rescue FrozenError warn "Instances in the original remain frozen after clone." end ret end |
#clone_original_b4_parse_rule? ⇒ Object
174 |
# File 'lib/plain_text/parse_rule.rb', line 174 alias_method :clone_original_b4_parse_rule?, :clone |
#dup ⇒ Object
Deeper dup
Without this, if @rules or @names are modified in a dupped instance, even the original is affected.
202 203 204 205 206 207 |
# File 'lib/plain_text/parse_rule.rb', line 202 def dup ret = dup_original_b4_parse_rule? ret.instance_eval{ @rules = rules.dup } ret.instance_eval{ @names = names.dup } ret end |
#dup_original_b4_parse_rule? ⇒ Object
194 |
# File 'lib/plain_text/parse_rule.rb', line 194 alias_method :dup_original_b4_parse_rule?, :dup |
#pop(*rest) ⇒ Proc+
Pop a rule(s)
333 334 335 336 337 338 339 340 341 342 |
# File 'lib/plain_text/parse_rule.rb', line 333 def pop(*rest) if (rest.size == 0) (@rules.size > 0) ? @names.slice!((@rules.size-1)..-1) : @names.clear else i_beg = @rules.size - rest[0] i_beg = 0 if i_beg < 0 @names.slice!(i_beg..-1) end (rest.size == 0) ? @rules.pop : @rules.pop(*rest) end |
#push(*rule, name: nil) {|inprm| ... } ⇒ self
Add a rule(s)
If Regexp is given, it should include grouping (to enclose the entire Regexp usually). If not, grouping is added forcibly. Or, Proc or block can be given. Consecutive rules can be given. Note if a rule(s) is given, a block is ignored even if present.
Any given rules, except the very first one, where the Proc argument is a String, should assume the Proc argument is an Array. If Regexp is given for the second or later one, it will raise an Exception when #apply-ed.
Optionally, providing non-Array argument or block is given, a name can be specified as the human-readable name for the rule.
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
# File 'lib/plain_text/parse_rule.rb', line 228 def push(*rule, name: nil, &rule_block) #if rule.size > 1 # rule.each do |each_r| # push each_r, rule_block # end # return self #end push_rule_core(*rule, &rule_block) set_name_at(name, -1) if !rules.empty? # rulesize = ((0 != rule.size) ? rule.size : (block_given? ? 1 : 0)) ### print "DEBUG-p: rulesize=#{rulesize}\n" # arnames = (name ? [name].flatten : []) # ((-rulesize)..-1).each_with_index do |i_rule, i_given| # set_name_at(arnames[i_given], i_rule) # end if !rule.empty? self end |
#rule_at(key) ⇒ Proc, ...
Get a rule for the specified index or human-readable key
303 304 305 306 307 308 309 310 |
# File 'lib/plain_text/parse_rule.rb', line 303 def rule_at(key) begin ( defined?(key.to_int) ? @rules[key.to_int] : @rules[@names.find_index(key.to_s)] ) rescue TypeError # no implicit conversion from nil to integer nil # raise TypeError, "Specified key (#{key.inspect}) is not found for the rules among the registered names=#{@names.inspect}" end end |
#set_name_at(name, index_rules) ⇒ Integer
Set (or reset) a human-readable name for #rules at a specified index
285 286 287 288 289 290 291 292 293 294 295 296 |
# File 'lib/plain_text/parse_rule.rb', line 285 def set_name_at(name, index_rules) index = PlainText::Util.positive_array_index_checked(index_rules, @rules, accept_too_big: false, varname: 'rules') if !name @names[index] = nil return index end ns = name.to_s index_exist = @names.find_index(ns) raise "Name #{ns} is already used for the index #{index}" if index_exist && (index_exist != index) @names[index] = ns index end |
#size ⇒ Integer
Returns The number of defined rules.
458 459 460 461 462 463 464 465 |
# File 'lib/plain_text/parse_rule.rb', line 458 def size si_rules = rules.size si_names = names.size if si_rules != si_names warn "WARNING: Inconsistent sizes for between rules (#{si_rules}) and names (#{si_names})." end si_rules end |