Module: PlainText

Includes:
Util
Defined in:
lib/plain_text.rb,
lib/plain_text/part.rb,
lib/plain_text/util.rb,
lib/plain_text/split.rb,
lib/plain_text/parse_rule.rb,
lib/plain_text/part/boundary.rb,
lib/plain_text/part/paragraph.rb

Overview

Utility methods for mainly line-based processing of String

This module contains methods useful in processing a String object of a text file, that is, a String that contains an entire or a multiple-line part of a text file. The methods include normalizing the line-break codes, removing extra spaces from each line, etc. Many of the methods work on tha basis of a line. For example, #head and #tail methods work like the respective UNIX-shell commands, returning a specified line at the head/tail parts of self.

Most methods in this module are meant to be included in String, except for a few module functions. It is however debatable whether it is a good practice to include a third-party module in the core class. This module contains a helper module function PlainText.extend_this, with which an object extends this module easily as Singleton if this module is not already included.

A few methods in this module assume that Split is included in String, which in default is the case, as soon as this file is read (by Ruby’s require).

Author:

  • Masa Sakano (Wise Babel Ltd)

Defined Under Namespace

Modules: Split, Util Classes: ParseRule, Part

Constant Summary collapse

DefLineBreaks =

List of the default line breaks.

[ "\r\n", "\n", "\r" ]
DEF_HEADTAIL_N_LINES =

Default number of lines to extract for #head and #tail

10
DEF_METHOD_OPTS =

Default options for class/instance methods

{
  :clean_text => {
    preserve_paragraph: true,
    boundary_style: true,  # If unspecified, will be replaced with lb_out * 2
    lbs_style: :truncate,
    lb_is_space: false,
    sps_style: :truncate,
    delete_asian_space: true,
    linehead_style: :none,
    linetail_style: :delete,
    firstlbs_style: :delete,
    lastsps_style:  :truncate,
    lb: $/,
    lb_out: nil,           # If unspecified, will be replaced with lb
  },
  :count_char => {
    lbs_style: :delete,
    linehead_style: :delete,
    lastsps_style: :delete,
    lb_out: "\n",
  },
}

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Util

even_odd_arrays, positive_array_index, positive_array_index_checked, raise_typeerror

Class Method Details

.__call_inst_method__(method, instr, *rest, **k) ⇒ #instr

Call instance method as a Module function

The return String includes PlainText as Singleton.



65
66
67
68
69
# File 'lib/plain_text.rb', line 65

def self.__call_inst_method__(method, instr, *rest, **k)
  newself = instr.clone
  PlainText.extend_this(newself)
  newself.public_send(method, *rest, **k)
end

.clean_text(prt, preserve_paragraph: , boundary_style: , lbs_style: , lb_is_space: , sps_style: , delete_asian_space: , linehead_style: , linetail_style: , firstlbs_style: , lastsps_style: , lb: , lb_out: , is_debug: false) ⇒ Object

Cleans the text

Such as, removing extra spaces, normalising the linebreaks, etc.

In default,

  • Paragraphs (more than 2 \n) are taken into account (one \n between two): preserve_paragraph=true

  • Blank lines are truncated into one line with no white spaces: boundary_style=lb_out*2(=$/*2)

  • Consecutive white spaces are truncated into a single space: sps_style=:truncate

  • White spaces before or after a CJK character is deleted: delete_asian_space=true

  • Preceding white spaces in each line are preserved: linehead_style=:none

  • Trailing white spaces in each line are deleted: linetail_style=:delete

  • Line-breaks at the beginning of the entire input string are deleted: firstlbs_style=:delete

  • Trailing white spaces and line-breaks at the end of the entire input string are truncated into a single linebreak: lastsps_style=:truncate

For a String with predominantly CJK characters, the following setting is recommended:

  • lbs_style: :delete

  • delete_asian_space: true (Default)

Note for the Symbols in optional arguments, the Symbol with the first character only is accepted, e.g., :d instead of :delete (nb., :t2 for :truncate2).

For more detail, see the description of each command-line options.

Note that for the case of traditional genko-yoshi-style Japanese texts with “jisage” for each new paragraph marking a new paragraph, probably the best way is to make your own Part instance to give to this method, where the rule for the Part should be something like:

/(\A[[:blank:]]+|\n[[:space:]]+)/


148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
# File 'lib/plain_text.rb', line 148

def self.clean_text(
      prt,
      preserve_paragraph: DEF_METHOD_OPTS[:clean_text][:preserve_paragraph],
      boundary_style:     DEF_METHOD_OPTS[:clean_text][:boundary_style], # If unspecified, will be replaced with lb_out * 2
      lbs_style:      DEF_METHOD_OPTS[:clean_text][:lbs_style],
      lb_is_space:    DEF_METHOD_OPTS[:clean_text][:lb_is_space],
      sps_style:      DEF_METHOD_OPTS[:clean_text][:sps_style],
      delete_asian_space: DEF_METHOD_OPTS[:clean_text][:delete_asian_space],
      linehead_style: DEF_METHOD_OPTS[:clean_text][:linehead_style],
      linetail_style: DEF_METHOD_OPTS[:clean_text][:linetail_style],
      firstlbs_style: DEF_METHOD_OPTS[:clean_text][:firstlbs_style],
      lastsps_style:  DEF_METHOD_OPTS[:clean_text][:lastsps_style],
      lb:     DEF_METHOD_OPTS[:clean_text][:lb],
      lb_out: DEF_METHOD_OPTS[:clean_text][:lb_out], # If unspecified, will be replaced with lb
      is_debug: false
    )

#isdebug = true if prt == "foo\n\n\nbar\n"
  lb_out ||= lb  # Output linebreak
  boundary_style = lb_out*2 if true       == boundary_style
  boundary_style = ""       if [:delete, :d].include? boundary_style
  lastsps_style  = lb_out   if :linebreak == lastsps_style

  if !prt.class.method_defined? :last_significant_element
    # Construct a Part instance from the given String.
    ret = ''
    begin
      prt = prt.unicode_normalize
    rescue ArgumentError  # (invalid byte sequence in UTF-8)
      warn "The given String in (#{self.name}\##{__method__}) seems wrong."
      raise
    end
    prt = normalize_lb(prt, "\n", lb_from: (DefLineBreaks.include?(lb) ? nil : lb)).dup
    kwd = (["\r\n", "\r", "\n"].include?(lb) ? {} : { rules: /#{Regexp.quote lb}{2,}/})
    prt = (preserve_paragraph ? Part.parse(prt, **kwd) : Part.new([prt]))
  else
    # If not preserve_paragraph, reconstructs it as a Part with a single Paragraph.
    # Also, deepcopy is needed, as this method is destructive.
    prt = (preserve_paragraph ? prt : Part.new([prt.join])).deepcopy
  end
  prt.squash_boundaryies!  # Boundaries are squashed.

  # Handles Boundary
  clean_text_boundary!(prt, boundary_style: boundary_style)

  # Handles linebreaks and spaces (within Paragraphs)
  clean_text_lbs_sps!( prt,
    lbs_style: lbs_style,
    lb_is_space: lb_is_space,
    sps_style: sps_style,
    delete_asian_space: delete_asian_space,
    is_debug: is_debug
  )
  # Handles the line head/tails.
  clean_text_line_head_tail!( prt,
    linehead_style: linehead_style,
    linetail_style: linetail_style
  )

  # Handles the file head/tail.
  clean_text_file_head_tail!( prt,
    firstlbs_style: firstlbs_style,
    lastsps_style:  lastsps_style,
    is_debug: is_debug
  )

  # Replaces the linebreaks to the specified one
  prt.map{ |i| i.gsub!(/\n/m, lb_out) }

  (ret ? prt.join : prt)  # prt.to_s may be different from prt.join
end

.count_char(instr, *rest, lbs_style: , linehead_style: , lastsps_style: , lb_out: , **k) ⇒ Integer

Count the number of characters

See #clean_text! for the optional parameters. The defaults of a few of the optional parameters are different from it, such as the default for lb_out is “n” (newline, so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.



91
92
93
94
95
96
97
98
99
# File 'lib/plain_text.rb', line 91

def self.count_char(instr, *rest,
      lbs_style:      DEF_METHOD_OPTS[:count_char][:lbs_style],
      linehead_style: DEF_METHOD_OPTS[:count_char][:linehead_style],
      lastsps_style:  DEF_METHOD_OPTS[:count_char][:lastsps_style],
      lb_out:         DEF_METHOD_OPTS[:count_char][:lb_out],
      **k
    )
  clean_text(instr, *rest, lbs_style: lbs_style, linehead_style: linehead_style, lastsps_style: lastsps_style, lb_out: lb_out, **k).size
end

.delete_spaces_bw_cjk_european(instr, *rest) ⇒ Object

Module function of #delete_spaces_bw_cjk_european



224
225
226
# File 'lib/plain_text.rb', line 224

def self.delete_spaces_bw_cjk_european(instr, *rest)
  __call_inst_method__(:delete_spaces_bw_cjk_european, instr, *rest)
end

.extend_this(obj) ⇒ TrueClass, NilClass

If the class of the obj does not “include” this module, do so in the singular class.



75
76
77
78
79
# File 'lib/plain_text.rb', line 75

def self.extend_this(obj)
  return nil if defined? obj.delete_spaces_bw_cjk_european!
  obj.extend(PlainText)
  true
end

.head(instr, *rest, **k) ⇒ Object

Module function of #head

The return String includes PlainText as Singleton.



236
237
238
# File 'lib/plain_text.rb', line 236

def self.head(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head, instr, *rest, **k)
end

.head_inverse(instr, *rest, **k) ⇒ Object

Module function of #head_inverse

The return String includes PlainText as Singleton.



248
249
250
# File 'lib/plain_text.rb', line 248

def self.head_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head_inverse, instr, *rest, **k)
end

.normalize_lb(instr, *rest, **k) ⇒ Object

Module function of #normalize_lb

The return String includes PlainText as Singleton.



260
261
262
# File 'lib/plain_text.rb', line 260

def self.normalize_lb(instr, *rest, **k)
  return PlainText.__call_inst_method__(:normalize_lb, instr, *rest, **k)
end

.tail(instr, *rest, **k) ⇒ Object

Module function of #tail

The return String includes PlainText as Singleton.



272
273
274
# File 'lib/plain_text.rb', line 272

def self.tail(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail, instr, *rest, **k)
end

.tail_inverse(instr, *rest, **k) ⇒ Object

Module function of #tail_inverse

The return String includes PlainText as Singleton.



284
285
286
# File 'lib/plain_text.rb', line 284

def self.tail_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail_inverse, instr, *rest, **k)
end

Instance Method Details

#count_char(*rest, **k) ⇒ Integer

Count the number of characters

See count_char and further clean_text! for the optional parameters. The defaults of a few of the optional parameters are different from the latter, such as the default for lb_out is “n” (newline, so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.



536
537
538
# File 'lib/plain_text.rb', line 536

def count_char(*rest, **k)
  PlainText.public_send(__method__, self, *rest, **k)
end

#delete_spaces_bw_cjk_european(*rest) ⇒ Object

Non-destructive version of #delete_spaces_bw_cjk_european!



559
560
561
562
563
# File 'lib/plain_text.rb', line 559

def delete_spaces_bw_cjk_european(*rest)
  newself = clone
  newself.delete_spaces_bw_cjk_european!(*rest)
  newself
end

#delete_spaces_bw_cjk_european!(repl = "") ⇒ MatchData, NilClass

Delete all the spaces between CJK and European characters or numbers.

All the spaces between CJK and European characters, numbers or punctuations are deleted or converted into a specified replacement character. Or, in short, any spaces between, before, and after a CJK characters are deleted. If the return is non-nil, there is at least one match.



549
550
551
552
# File 'lib/plain_text.rb', line 549

def delete_spaces_bw_cjk_european!(repl="")
  ret = gsub!(/(\p{Hiragana}|\p{Katakana}|[ー-]|[一-龠々]|\p{Han}|\p{Hangul})([[:blank:]]+)([[:upper:][:lower:][:digit:][:punct:]])/, '\1\3')
  ret ||= gsub!(/([[:upper:][:lower:][:digit:][:punct:]])([[:blank:]]+)(\p{Hiragana}|\p{Katakana}|[ー-]|[一-龠々]|\p{Han}|\p{Hangul})/, '\1\3')
end

#head(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/) ⇒ String

Returns the first num lines (or characters, bytes) or before the last n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use string[ 0..5 ] and string#byteslice(0,5) instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the last (e.g., -1 (lines, if unit is :line) means everything but the last 1 line, and -5 means everything but the last 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines up to the first line that matches the given Regexp is returned, where the process is based on the lines. For example, if num is /ABC/ (Regexp), String of the lines from the beginning up to the line that contains the character “ABC” is returned.



594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
# File 'lib/plain_text.rb', line 594

def head(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/)
  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "Non-positive num (#{num_in}) is given in #{__method__}" if num.to_int < 1
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, "-n"
    # Regexp (for boundary)
    return head_regexp(re_in, inclusive: inclusive, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    ret = split(linebreak)[0..(num-1)].join(linebreak)
    return ret if size <= ret.size  # Specified line is larger than the original or the last NL is missing.
    return(ret << linebreak)  # NL is added to the tail as in the original.
  when :char
    return self[0..(num-1)]
  when :byte, "-c"
    return self.byteslice(0..(num-1))
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end

#head!(*rest, **key) ⇒ self

Destructive version of #head



570
571
572
# File 'lib/plain_text.rb', line 570

def head!(*rest, **key)
  replace(head(*rest, **key))
end

#head_inverse(*rest, **key) ⇒ Object

Inverse of head - returns the content except for the first num lines (or characters, bytes)



635
636
637
638
# File 'lib/plain_text.rb', line 635

def head_inverse(*rest, **key)
  s2 = head(*rest, **key)
  (s2.size >= size) ? '' : self[s2.size..-1]
end

#head_inverse!(*rest, **key) ⇒ self

Destructive version of #head_inverse



627
628
629
# File 'lib/plain_text.rb', line 627

def head_inverse!(*rest, **key)
  replace(head_inverse(*rest, **key))
end

#normalize_lb(*rest, **k) ⇒ Object

Non-destructive version of #normalize_lb!



664
665
666
667
668
# File 'lib/plain_text.rb', line 664

def normalize_lb(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.normalize_lb!(*rest, **k)
  newself
end

#normalize_lb!(repl = $/, lb_from: nil) ⇒ MatchData, NilClass

Normalizes line-breaks

All the line-breaks of self are converted into a new character or n If the return is non-nil, self contains unexpected line-break characters for the OS.



649
650
651
652
653
654
655
656
657
658
# File 'lib/plain_text.rb', line 649

def normalize_lb!(repl=$/, lb_from: nil)
  ret = nil
  lb_from ||= DefLineBreaks
  lb_from = [lb_from].flatten
  lb_from.each do |ea_lb|
    gsub!(/#{ea_lb}/, repl) if ($/ != ea_lb) || ($/ == ea_lb && repl != ea_lb)
    ret = $~ if ($/ != ea_lb) && !ret
  end
  ret
end

#strip_at_lines(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines!



689
690
691
692
693
# File 'lib/plain_text.rb', line 689

def strip_at_lines(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines!(*rest, **k)
  newself
end

#strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) ⇒ self, NilClass

String#strip! for each line



678
679
680
681
682
683
# File 'lib/plain_text.rb', line 678

def strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/)
  strip_head = false if markdown
  r1 = strip_at_lines_head!(                    linebreak: linebreak) if strip_head
  r2 = strip_at_lines_tail!(markdown: markdown, linebreak: linebreak) if strip_tail
  (r1 || r2) ? self : nil
end

#strip_at_lines_head(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines_head!



709
710
711
712
713
# File 'lib/plain_text.rb', line 709

def strip_at_lines_head(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_head!(*rest, **k)
  newself
end

#strip_at_lines_head!(linebreak: $/) ⇒ self, NilClass

String#strip! for each line but only for the head part (NOT tail part)



700
701
702
703
# File 'lib/plain_text.rb', line 700

def strip_at_lines_head!(linebreak: $/)
  lb_quo = Regexp.quote linebreak
  gsub!(/(\A|#{lb_quo})[[:blank:]]+/m, '\1')
end

#strip_at_lines_tail(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines_tail!



733
734
735
736
737
# File 'lib/plain_text.rb', line 733

def strip_at_lines_tail(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_tail!(*rest, **k)
  newself
end

#strip_at_lines_tail!(markdown: false, linebreak: $/) ⇒ self, NilClass

String#strip! for each line but only for the tail part (NOT head part)



720
721
722
723
724
725
726
727
# File 'lib/plain_text.rb', line 720

def strip_at_lines_tail!(markdown: false, linebreak: $/)
  lb_quo = Regexp.quote linebreak
  return gsub!(/(?<=^|[^[:blank:]])[[:blank:]]+(#{lb_quo}|\z)/m, '\1') if ! markdown

  r1 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]]{3,}(#{lb_quo}|\z)/m, '\1')
  r2 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]](#{lb_quo}|\z)/m, '\1')
  (r1 || r2) ? self : nil
end

#tail(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/) ⇒ String

Returns the last num lines (or characters, bytes) or of and after the first n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use string[ -5..-1 ] and string#byteslice(-5,5) instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the first (e.g., -1 [lines, if unit is :line] means everything but the first 1 line, and -5 means everything but the first 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines after the first line that matches the given Regexp is returned (not inclusive), where the process is based on the lines. For example, if num is /ABC/, String of the lines from the next line of the first line that contains the character “ABC” till the last one is returned. “The next line” means (1) the line immediately after the match if the matched string has the linebreak at the end, or (2) the line after the first linebreak after the matched string, where the trailing characters after the matched string to the linebreak (inclusive) is ignored.

Tips =

To specify the last line that matches the Regexp, consider prefixing (?:.*) with the option m, e.g., /(?:.*)ABC/m

Note for developers =

The line that matches with Regexp has to be exclusive. Because otherwise to specify the last line that matches would be impossible in principle. For example, to specify the last line that matches ABC, the given regexp should be /(?:.*)ABC/m (see the above Tips); in this case, if this matched line was inclusive, *all the lines from Line 1* would be included, which is most likely not what the caller wants.



782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
# File 'lib/plain_text.rb', line 782

def tail(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, linebreak: $/)
  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "num of zero is given in #{__method__}" if num == 0
    num += 1 if num < 0
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, '-n'
    # Regexp (for boundary)
    return tail_regexp(re_in, inclusive: inclusive, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    return tail_linenum(num_in, num, linebreak: linebreak)
  when :char
    num = 0 if num >= size && num_in > 0
    return self[(-num)..-1]
  when :byte, '-c'
    num = 0 if num >= bytesize && num_in > 0
    return self.byteslice((-num)..-1)
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end

#tail!(*rest, **key) ⇒ self

Destructive version of #tail



744
745
746
# File 'lib/plain_text.rb', line 744

def tail!(*rest, **key)
  replace(tail(*rest, **key))
end

#tail_inverse(*rest, **key) ⇒ Object

Inverse of tail - returns the content except for the first num lines (or characters, bytes)



823
824
825
826
# File 'lib/plain_text.rb', line 823

def tail_inverse(*rest, **key)
  s2 = tail(*rest, **key)
  (s2.size >= size) ? '' : self[0..(size-s2.size-1)]
end

#tail_inverse!(*rest, **key) ⇒ self

Destructive version of #tail_inverse



815
816
817
# File 'lib/plain_text.rb', line 815

def tail_inverse!(*rest, **key)
  replace(tail_inverse(*rest, **key))
end