Module: PlainText

Includes:
Util
Defined in:
lib/plain_text.rb,
lib/plain_text/part.rb,
lib/plain_text/util.rb,
lib/plain_text/split.rb,
lib/plain_text/parse_rule.rb,
lib/plain_text/part/boundary.rb,
lib/plain_text/part/paragraph.rb

Overview

Utility methods for mainly line-based processing of String

This module contains methods useful in processing a String object of a text file, that is, a String that contains an entire or a multiple-line part of a text file. The methods include normalizing the line-break codes, removing extra spaces from each line, etc. Many of the methods work on tha basis of a line. For example, #head and #tail methods work like the respective UNIX-shell commands, returning a specified line at the head/tail parts of self.

Most methods in this module are meant to be included in String, except for a few module functions. It is however debatable whether it is a good practice to include a third-party module in the core class. This module contains a helper module function PlainText.extend_this, with which an object extends this module easily as Singleton if this module is not already included.

A few methods in this module assume that Split is included in String, which in default is the case, as soon as this file is read (by Ruby’s require).

Author:

  • Masa Sakano (Wise Babel Ltd)

Defined Under Namespace

Modules: Split, Util Classes: ParseRule, Part

Constant Summary collapse

DefLineBreaks =

List of the default line breaks.

[ "\r\n", "\n", "\r" ]
DEF_HEADTAIL_N_LINES =

Default number of lines to extract for #head and #tail

10

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Util

even_odd_arrays, positive_array_index, positive_array_index_checked, raise_typeerror

Class Method Details

.__call_inst_method__(method, instr, *rest, **k) ⇒ #instr

Call instance method as a Module function

The return String includes PlainText as Singleton.



35
36
37
38
39
# File 'lib/plain_text.rb', line 35

def self.__call_inst_method__(method, instr, *rest, **k)
  newself = instr.clone
  PlainText.extend_this(newself)
  newself.public_send(method, *rest, **k)
end

.clean_text(prt, preserve_paragraph: true, boundary_style: true, lbs_style: :truncate, lb_is_space: false, sps_style: :truncate, delete_asian_space: true, linehead_style: :delete, linetail_style: :delete, firstsps_style: :delete, lastsps_style: :truncate, lb: $/, lb_out: nil, is_debug: false) ⇒ Object

Cleans the text

Such as, removing extra spaces, normalising the linebreaks, etc.

In default,

  • Paragraphs (more than 2 \n) are taken into account (one \n between two): preserve_paragraph=true

  • Blank lines are truncated into one line with no white spaces: boundary_style=lb_out*2(=$/*2)

  • Consecutive white spaces are truncated into a single space: sps_style=:truncate

  • White spaces before or after a CJK character is deleted: delete_asian_space=true

  • Preceding white spaces in each line are deleted: linehead_style=:delete

  • Trailing white spaces in each line are deleted: linetail_style=:delete

  • Preceding line-breaks and white spaces at the beginning of the entire input string are truncated into one space: firstsps_style=:truncate

  • Trailing white spaces and line-breaks at the end of the entire input string are truncated into a single linebreak: lastsps_style=:truncate

For a String with predominantly CJK characters, the following setting is recommended:

  • lbs_style: :delete

  • delete_asian_space: true (Default)

Note for the Symbols in optional arguments, the Symbol with the first character only is accepted, e.g., :d instead of :delete (nb., :t2 for :truncate2).

For more detail, see the description.



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# File 'lib/plain_text.rb', line 105

def self.clean_text(
      prt,
      preserve_paragraph: true,
      boundary_style: true,  # If unspecified, will be replaced with lb_out * 2
      lbs_style: :truncate,
      lb_is_space: false,
      sps_style: :truncate,
      delete_asian_space: true,
      linehead_style: :delete, 
      linetail_style: :delete, 
      firstsps_style: :delete,
      lastsps_style:  :truncate,
      lb: $/,
      lb_out: nil,           # If unspecified, will be replaced with lb
      is_debug: false
    )

#isdebug = true if prt == "\n  ab\n  \ncd\n \n  \n ef\n \n  \n   \n  gh\n \n \n \n" #DEBUG
  lb_out ||= lb  # Output linebreak
  boundary_style = lb_out*2 if true       == boundary_style
  boundary_style = ""       if [:delete, :d].include? boundary_style
  lastsps_style  = lb_out   if :linebreak == lastsps_style 

  if !prt.class.method_defined? :last_significant_element
    # Construct a Part instance from the given String.
    ret = ''
    prt = prt.unicode_normalize
    prt = normalize_lb(prt, "\n", lb_from: (DefLineBreaks.include?(lb) ? nil : lb)).dup
    kwd = (["\r\n", "\r", "\n"].include?(lb) ? {} : { rules: /#{Regexp.quote lb}{2,}/})
    prt = (preserve_paragraph ? Part.parse(prt, **kwd) : Part.new([prt]))
  else
    # If not preserve_paragraph, reconstructs it as a Part with a single Paragraph.
    # Also, deepcopy is needed, as this method is destructive.
    prt = (preserve_paragraph ? prt : Part.new([prt.join])).deepcopy
  end
  prt.squash_boundaryies!  # Boundaries are squashed.

  # Handles Boundary
  clean_text_boundary!(prt, boundary_style: boundary_style)

  # Handles linebreaks and spaces (within Paragraphs)
  clean_text_lbs_sps!( prt,
    lbs_style: lbs_style,
    lb_is_space: lb_is_space,
    sps_style: sps_style,
    delete_asian_space: delete_asian_space,
  )
  # Handles the line head/tails.
  clean_text_line_head_tail!( prt,
    linehead_style: linehead_style,
    linetail_style: linetail_style
  )

  # Handles the file head/tail.
  clean_text_file_head_tail!( prt,
    firstsps_style: firstsps_style,
    lastsps_style:  lastsps_style,
  )

  # Replaces the linebreaks to the specified one
  prt.map{ |i| i.gsub!(/\n/m, lb_out) }

  (ret ? prt.join : prt)  # prt.to_s may be different from prt.join
end

.count_char(instr, *rest, lbs_style: :delete, lastsps_style: :delete, lb_out: "\n", **k) ⇒ Integer

Module function of #count_char



56
57
58
59
60
61
62
# File 'lib/plain_text.rb', line 56

def self.count_char(instr, *rest,
               lbs_style: :delete,
               lastsps_style: :delete,
               lb_out: "\n",
               **k)
  clean_text(instr, *rest, lbs_style: lbs_style, lastsps_style: lastsps_style, lb_out: lb_out, **k).size
end

.delete_spaces_bw_cjk_european(instr, *rest) ⇒ Object

Module function of #delete_spaces_bw_cjk_european



174
175
176
# File 'lib/plain_text.rb', line 174

def self.delete_spaces_bw_cjk_european(instr, *rest)
  __call_inst_method__(:delete_spaces_bw_cjk_european, instr, *rest)
end

.extend_this(obj) ⇒ TrueClass, NilClass

If the class of the obj does not “include” this module, do so in the singular class.



45
46
47
48
49
# File 'lib/plain_text.rb', line 45

def self.extend_this(obj)
  return nil if defined? obj.delete_spaces_bw_cjk_european! 
  obj.extend(PlainText)
  true
end

.head(instr, *rest, **k) ⇒ Object

Module function of #head

The return String includes PlainText as Singleton.



186
187
188
# File 'lib/plain_text.rb', line 186

def self.head(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head, instr, *rest, **k)
end

.head_inverse(instr, *rest, **k) ⇒ Object

Module function of #head_inverse

The return String includes PlainText as Singleton.



198
199
200
# File 'lib/plain_text.rb', line 198

def self.head_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head_inverse, instr, *rest, **k)
end

.normalize_lb(instr, *rest, **k) ⇒ Object

Module function of #normalize_lb

The return String includes PlainText as Singleton.



210
211
212
# File 'lib/plain_text.rb', line 210

def self.normalize_lb(instr, *rest, **k)
  return PlainText.__call_inst_method__(:normalize_lb, instr, *rest, **k)
end

.tail(instr, *rest, **k) ⇒ Object

Module function of #tail

The return String includes PlainText as Singleton.



222
223
224
# File 'lib/plain_text.rb', line 222

def self.tail(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail, instr, *rest, **k)
end

.tail_inverse(instr, *rest, **k) ⇒ Object

Module function of #tail_inverse

The return String includes PlainText as Singleton.



234
235
236
# File 'lib/plain_text.rb', line 234

def self.tail_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail_inverse, instr, *rest, **k)
end

Instance Method Details

#count_char(*rest, lbs_style: :delete, lastsps_style: :none, lb_out: "\n", **k) ⇒ Integer

Count the number of characters

See #clean_text! for the optional parameters. The defaults of a few of the optional parameters are different from #clean_text!, such as the default for lb_out is “n” (so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.



439
440
441
442
443
444
445
# File 'lib/plain_text.rb', line 439

def count_char(*rest,
               lbs_style: :delete,
               lastsps_style: :none,
               lb_out: "\n",
               **k)
  PlainText.clean_text(self, *rest, lbs_style: lbs_style, lastsps_style: lastsps_style, lb_out: lb_out, **k).size
end

#delete_spaces_bw_cjk_european(*rest) ⇒ Object

Non-destructive version of #delete_spaces_bw_cjk_european!



466
467
468
469
470
# File 'lib/plain_text.rb', line 466

def delete_spaces_bw_cjk_european(*rest)
  newself = clone
  newself.delete_spaces_bw_cjk_european!(*rest)
  newself
end

#delete_spaces_bw_cjk_european!(repl = "") ⇒ MatchData, NilClass

Delete all the spaces between CJK and European characters or numbers.

All the spaces between CJK and European characters, numbers or punctuations are deleted or converted into a specified replacement character. Or, in short, any spaces between, before, and after a CJK characters are deleted. If the return is non-nil, there is at least one match.



456
457
458
459
# File 'lib/plain_text.rb', line 456

def delete_spaces_bw_cjk_european!(repl="")
  ret = gsub!(/(\p{Hiragana}|\p{Katakana}|[ー-]|[一-龠々]|\p{Han}|\p{Hangul})([[:blank:]]+)([[:upper:][:lower:][:digit:][:punct:]])/, '\1\3')
  ret ||= gsub!(/([[:upper:][:lower:][:digit:][:punct:]])([[:blank:]]+)(\p{Hiragana}|\p{Katakana}|[ー-]|[一-龠々]|\p{Han}|\p{Hangul})/, '\1\3')
end

#head(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/) ⇒ String

Returns the first num lines (or characters, bytes) or before the last n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use string[ 0..5 ] and string#byteslice(0,5) instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the last (e.g., -1 (lines, if unit is :line) means everything but the last 1 line, and -5 means everything but the last 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines up to the first line that matches the given Regexp is returned, where the process is based on the lines. For example, if num is /ABC/ (Regexp), String of the lines from the beginning up to the line that contains the character “ABC” is returned.



501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
# File 'lib/plain_text.rb', line 501

def head(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/)
  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "Non-positive num (#{num_in}) is given in #{__method__}" if num.to_int < 1
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, "-n"
    # Regexp (for boundary)
    return head_regexp(re_in, inclusive: inclusive, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    ret = split(linebreak)[0..(num-1)].join(linebreak)
    return ret if size <= ret.size  # Specified line is larger than the original or the last NL is missing.
    return(ret << linebreak)  # NL is added to the tail as in the original.
  when :char
    return self[0..(num-1)]
  when :byte, "-c"
    return self.byteslice(0..(num-1))
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end

#head!(*rest, **key) ⇒ self

Destructive version of #head



477
478
479
# File 'lib/plain_text.rb', line 477

def head!(*rest, **key)
  replace(head(*rest, **key))
end

#head_inverse(*rest, **key) ⇒ Object

Inverse of head - returns the content except for the first num lines (or characters, bytes)



542
543
544
545
# File 'lib/plain_text.rb', line 542

def head_inverse(*rest, **key)
  s2 = head(*rest, **key)
  (s2.size >= size) ? '' : self[s2.size..-1]
end

#head_inverse!(*rest, **key) ⇒ self

Destructive version of #head_inverse



534
535
536
# File 'lib/plain_text.rb', line 534

def head_inverse!(*rest, **key)
  replace(head_inverse(*rest, **key))
end

#normalize_lb(*rest, **k) ⇒ Object

Non-destructive version of #normalize_lb!



571
572
573
574
575
# File 'lib/plain_text.rb', line 571

def normalize_lb(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.normalize_lb!(*rest, **k)
  newself
end

#normalize_lb!(repl = $/, lb_from: nil) ⇒ MatchData, NilClass

Normalizes line-breaks

All the line-breaks of self are converted into a new character or n If the return is non-nil, self contains unexpected line-break characters for the OS.



556
557
558
559
560
561
562
563
564
565
# File 'lib/plain_text.rb', line 556

def normalize_lb!(repl=$/, lb_from: nil)
  ret = nil
  lb_from ||= DefLineBreaks
  lb_from = [lb_from].flatten
  lb_from.each do |ea_lb|
    gsub!(/#{ea_lb}/, repl) if ($/ != ea_lb) || ($/ == ea_lb && repl != ea_lb)
    ret = $~ if ($/ != ea_lb) && !ret
  end
  ret
end

#strip_at_lines(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines!



596
597
598
599
600
# File 'lib/plain_text.rb', line 596

def strip_at_lines(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines!(*rest, **k)
  newself
end

#strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) ⇒ self, NilClass

String#strip! for each line



585
586
587
588
589
590
# File 'lib/plain_text.rb', line 585

def strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/)
  strip_head = false if markdown
  r1 = strip_at_lines_head!(                    linebreak: linebreak) if strip_head
  r2 = strip_at_lines_tail!(markdown: markdown, linebreak: linebreak) if strip_tail
  (r1 || r2) ? self : nil
end

#strip_at_lines_head(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines_head!



616
617
618
619
620
# File 'lib/plain_text.rb', line 616

def strip_at_lines_head(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_head!(*rest, **k)
  newself
end

#strip_at_lines_head!(linebreak: $/) ⇒ self, NilClass

String#strip! for each line but only for the head part (NOT tail part)



607
608
609
610
# File 'lib/plain_text.rb', line 607

def strip_at_lines_head!(linebreak: $/)
  lb_quo = Regexp.quote linebreak
  gsub!(/(\A|#{lb_quo})[[:blank:]]+/m, '\1')
end

#strip_at_lines_tail(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines_tail!



640
641
642
643
644
# File 'lib/plain_text.rb', line 640

def strip_at_lines_tail(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_tail!(*rest, **k)
  newself
end

#strip_at_lines_tail!(markdown: false, linebreak: $/) ⇒ self, NilClass

String#strip! for each line but only for the tail part (NOT head part)



627
628
629
630
631
632
633
634
# File 'lib/plain_text.rb', line 627

def strip_at_lines_tail!(markdown: false, linebreak: $/)
  lb_quo = Regexp.quote linebreak
  return gsub!(/(?<=^|[^[:blank:]])[[:blank:]]+(#{lb_quo}|\z)/m, '\1') if ! markdown

  r1 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]]{3,}(#{lb_quo}|\z)/m, '\1')
  r2 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]](#{lb_quo}|\z)/m, '\1')
  (r1 || r2) ? self : nil
end

#tail(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/) ⇒ String

Returns the last num lines (or characters, bytes) or of and after the first n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use string[ -5..-1 ] and string#byteslice(-5,5) instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the first (e.g., -1 [lines, if unit is :line] means everything but the first 1 line, and -5 means everything but the first 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines after the first line that matches the given Regexp is returned (not inclusive), where the process is based on the lines. For example, if num is /ABC/, String of the lines from the next line of the first line that contains the character “ABC” till the last one is returned. “The next line” means (1) the line immediately after the match if the matched string has the linebreak at the end, or (2) the line after the first linebreak after the matched string, where the trailing characters after the matched string to the linebreak (inclusive) is ignored.

Tips =

To specify the last line that matches the Regexp, consider prefixing (?:.*) with the option m, e.g., /(?:.*)ABC/m

Note for developers =

The line that matches with Regexp has to be exclusive. Because otherwise to specify the last line that matches would be impossible in principle. For example, to specify the last line that matches ABC, the given regexp should be /(?:.*)ABC/m (see the above Tips); in this case, if this matched line was inclusive, *all the lines from Line 1* would be included, which is most likely not what the caller wants.



689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
# File 'lib/plain_text.rb', line 689

def tail(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/)
  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "num of zero is given in #{__method__}" if num == 0
    num += 1 if num < 0
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, '-n'
    # Regexp (for boundary)
    return tail_regexp(re_in, inclusive: inclusive, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    return tail_linenum(num_in, num, linebreak: linebreak)
  when :char
    num = 0 if num >= size && num_in > 0
    return self[(-num)..-1]
  when :byte, '-c'
    num = 0 if num >= bytesize && num_in > 0
    return self.byteslice((-num)..-1)
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end

#tail!(*rest, **key) ⇒ self

Destructive version of #tail



651
652
653
# File 'lib/plain_text.rb', line 651

def tail!(*rest, **key)
  replace(tail(*rest, **key))
end

#tail_inverse(*rest, **key) ⇒ Object

Inverse of tail - returns the content except for the first num lines (or characters, bytes)



730
731
732
733
# File 'lib/plain_text.rb', line 730

def tail_inverse(*rest, **key)
  s2 = tail(*rest, **key)
  (s2.size >= size) ? '' : self[0..(size-s2.size-1)]
end

#tail_inverse!(*rest, **key) ⇒ self

Destructive version of #tail_inverse



722
723
724
# File 'lib/plain_text.rb', line 722

def tail_inverse!(*rest, **key)
  replace(tail_inverse(*rest, **key))
end