Module: PlainText

Includes:: Util

Defined in:: lib/plain_text.rb,
lib/plain_text/part.rb,
lib/plain_text/util.rb,
lib/plain_text/error.rb,
lib/plain_text/split.rb,
lib/plain_text/parse_rule.rb,
lib/plain_text/builtin_type.rb,
lib/plain_text/part/boundary.rb,
lib/plain_text/part/paragraph.rb,
lib/plain_text/part/string_type.rb

Overview

Utility methods for mainly line-based processing of String

This module contains methods useful in processing a String object of a text file, that is, a String that contains an entire or a multiple-line part of a text file. The methods include normalizing the line-break codes, removing extra spaces from each line, etc. Many of the methods work on tha basis of a line. For example, #head and #tail methods work like the respective UNIX-shell commands, returning a specified line at the head/tail parts of self.

Many of the methods contained directly in this module are meant to be included in String. Obviously, though, it is debatable if it is a good practice to include a third-party module in the core class.

Several module functions are also available. This module contains a helper module function PlainText.extend_this, with which an object extends this module easily as Singleton if this module is not already included.

A few methods in this module assume that Split is included in String, which in default is the case, as soon as this file is read (by Ruby’s require). The specification may be subject to change in the future release.

Author:

Masa Sakano (Wise Babel Ltd)

Defined Under Namespace

Modules: BuiltinType, Split, Util Classes: ParseRule, Part, PartNormalizeError

Constant Summary collapse

DefLineBreaks = List of the default line breaks.

[ "\r\n", "\n", "\r" ]

DEF_HEADTAIL_N_LINES = Default number of lines to extract for #head and #tail

DEF_METHOD_OPTS = Default options for class/instance methods

{
  :clean_text => {
    preserve_paragraph: true,
    boundary_style: true,  # If unspecified, will be replaced with lb_out * 2
    lbs_style: :truncate,
    lb_is_space: false,
    sps_style: :truncate,
    delete_asian_space: true,
    linehead_style: :none,
    linetail_style: :delete,
    firstlbs_style: :delete,
    lastsps_style:  :truncate,
    lb: $/,
    lb_out: nil,           # If unspecified, will be replaced with lb
  },
  :count_char => {
    lbs_style: :delete,
    linehead_style: :delete,
    lastsps_style: :delete,
    lb_out: "\n",
  },
}

Class Method Summary collapse

.__call_inst_method__(method, instr, *rest, **k) ⇒ #instr

Call instance method as a Module function.
.clean_text(prt, preserve_paragraph: , boundary_style: , lbs_style: , lb_is_space: , sps_style: , delete_asian_space: , linehead_style: , linetail_style: , firstlbs_style: , lastsps_style: , lb: , lb_out: , is_debug: false) ⇒ Object

Cleans the text.
.count_char(instr, *rest, lbs_style: , linehead_style: , lastsps_style: , lb_out: , **k) ⇒ Integer

Count the number of characters.
.delete_spaces_bw_cjk_european(instr, *rest) ⇒ Object

Module function of #delete_spaces_bw_cjk_european.
.extend_this(obj) ⇒ TrueClass, NilClass

If the class of the obj does not “include” this module, do so in the singular class.
.head(instr, *rest, **k) ⇒ Object

Module function of #head.
.head_inverse(instr, *rest, **k) ⇒ Object

Module function of #head_inverse.
.normalize_lb(instr, *rest, **k) ⇒ Object

Module function of #normalize_lb.
.tail(instr, *rest, **k) ⇒ Object

Module function of #tail.
.tail_inverse(instr, *rest, **k) ⇒ Object

Module function of #tail_inverse.

Instance Method Summary collapse

#count_char(*rest, **k) ⇒ Integer

Count the number of characters.
#delete_spaces_bw_cjk_european(*rest) ⇒ Object

Non-destructive version of #delete_spaces_bw_cjk_european!.
#delete_spaces_bw_cjk_european!(repl = "") ⇒ MatchData, NilClass

Delete all the spaces between CJK and European characters or numbers.
#head(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ String

Returns the first num lines (or characters, bytes) or before the last n-th line.
#head!(*rest, **key) ⇒ self

Destructive version of #head.
#head_inverse(*rest, **key) ⇒ Object

Inverse of head - returns the content except for the first num lines (or characters, bytes).
#head_inverse!(*rest, **key) ⇒ self

Destructive version of #head_inverse.
#normalize_lb(*rest, **k) ⇒ Object

Non-destructive version of #normalize_lb!.
#normalize_lb!(repl = $/, lb_from: nil) ⇒ MatchData, NilClass

Normalizes line-breaks.
#strip_at_lines(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines!.
#strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) ⇒ self, NilClass

String#strip! for each line.
#strip_at_lines_head(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines_head!.
#strip_at_lines_head!(linebreak: $/) ⇒ self, NilClass

String#strip! for each line but only for the head part (NOT tail part).
#strip_at_lines_tail(*rest, **k) ⇒ Object

Non-destructive version of #strip_at_lines_tail!.
#strip_at_lines_tail!(markdown: false, linebreak: $/) ⇒ self, NilClass

String#strip! for each line but only for the tail part (NOT head part).
#tail(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ String

Returns the last num lines (or characters, bytes) or of and after the first n-th line.
#tail!(*rest, **key) ⇒ self

Destructive version of #tail.
#tail_inverse(*rest, **key) ⇒ Object

Inverse of tail - returns the content except for the first num lines (or characters, bytes).
#tail_inverse!(*rest, **key) ⇒ self

Destructive version of #tail_inverse.

Class Method Details

.__call_inst_method__(method, instr, *rest, **k) ⇒ `#instr`

Call instance method as a Module function

The return String includes PlainText as Singleton.

Parameters:

method (Symbol) —

module method name
instr (String) —

String that is examined.

Returns:

(#instr)

# File 'lib/plain_text.rb', line 70

def self.__call_inst_method__(method, instr, *rest, **k)
  newself = instr.clone
  PlainText.extend_this(newself)
  newself.public_send(method, *rest, **k)
end

.clean_text(prt, preserve_paragraph: , boundary_style: , lbs_style: , lb_is_space: , sps_style: , delete_asian_space: , linehead_style: , linetail_style: , firstlbs_style: , lastsps_style: , lb: , lb_out: , is_debug: false) ⇒ `Object`

Cleans the text

Such as, removing extra spaces, normalising the linebreaks, etc.

In default,

Paragraphs (more than 2 \n) are taken into account (one \n between two): preserve_paragraph=true
Blank lines are truncated into one line with no white spaces: boundary_style=lb_out*2(=$/*2)
Consecutive white spaces are truncated into a single space: sps_style=:truncate
White spaces before or after a CJK character is deleted: delete_asian_space=true
Preceding white spaces in each line are preserved: linehead_style=:none
Trailing white spaces in each line are deleted: linetail_style=:delete
Line-breaks at the beginning of the entire input string are deleted: firstlbs_style=:delete
Trailing white spaces and line-breaks at the end of the entire input string are truncated into a single linebreak: lastsps_style=:truncate

For a String with predominantly CJK characters, the following setting is recommended:

lbs_style: :delete
delete_asian_space: true (Default)

Note for the Symbols in optional arguments, the Symbol with the first character only is accepted, e.g., :d instead of :delete (nb., :t2 for :truncate2).

For more detail, see the description of each command-line options.

Note that for the case of traditional genko-yoshi-style Japanese texts with “jisage” for each new paragraph marking a new paragraph, probably the best way is to make your own Part instance to give to this method, where the rule for the Part should be something like:

/(\A[[:blank:]]+|\n[[:space:]]+)/

Parameters:

prt (PlainText:Part, String) —

Part or String to examine.
preserve_paragraph (Boolean) (defaults to: ) —

Paragraphs are taken into account if true (Def: False). In the input, paragraphs are defined to be separated with more than one lb with potentially some space characters in between. Their output style is specified with boundary_style.
boundary_style (String, Symbol) (defaults to: ) —

One of (:truncate|:truncate2|:delete|:none) or String. If String, the boundaries between paragraphs are replaced with this String (Def: lb_out*2). If :truncate, consecutive linebreaks and spaces are truncated into 2 linebreaks. :truncate2 are similar, but they are not truncated beyond 3 linebreaks (ie., up to 2 blank lines between Paragraphs). If :none, nothing is done about them. Unless :none, all the white spaces between linebreaks are deleted.
lbs_style (Symbol) (defaults to: ) —

One of (:truncate|:delete|:none) (Def: :truncate). If :delete, all the linebreaks within paragraphs are deleted. :truncate is meaningful only when preserve_paragraph=false and consecutive linebreaks are truncated into 1 linebreak.
sps_style (Symbol) (defaults to: ) —

One of (:truncate|:delete|:none) (Def: :truncate). If :truncate, the consecutive white spaces within paragraphs, except for those at the line-head or line-tail (which are controlled by linehead_style and linehead_style, respectively), are truncated into a single white space. If :delete, they are deleted.
lb_is_space (Boolean) (defaults to: ) —

If true, a line-break, except those for the boundaries (unless preserve_paragraph is false), is equivalent to a space (Def: False).
delete_asian_space (Boolean) (defaults to: ) —

Any spaces between, before, after Asian characters (but punctuation) are deleted, if true (Default).
linehead_style (Symbol) (defaults to: ) —

One of (:truncate|:delete|:none) (Def: :none). Determine how to handle consecutive white spaces at the beggining of each line.
linetail_style (Symbol) (defaults to: ) —

One of (:truncate|:delete|:markdown|:none) (Def: :delete). Determine how to handle consecutive white spaces at the end of each line. If +:markdown, 1 space is always deleted, and two or more spaces are truncated into two ASCII whitespaces if the last two spaces are ASCII whitespaces, or else untouched.
firstlbs_style (Symbol, String) (defaults to: ) —

One of (:truncate|:delete|:none) or String (Def: :delete). If :truncate, any linebreaks at the very beginning of self (and whitespaces in between), if exist, are truncated to a single linebreak. If String, they are, even if not exists, replaced with the specified String (such as a linebreak). If :delete, they are deleted. Note This option has nothing to do with the whitespaces at the beginning of the first significant line (hence the name of the option). Note if a (random) Part is given, this option only considers the first significant element of it.
lastsps_style (Symbol, String) (defaults to: ) —

One of (:truncate|:delete|:none|:linebreak) or String (Def: :truncate). If :truncate, any of linebreaks AND white spaces at the tail of self, if exist, are truncated to a single linebreak. If :delete, they are deleted. If String, they are, even if not exists, replaced with the specified String (such as a linebreak, in which case lb_out is used as String, i.e., it guarantees only 1 linebreak to exist at the end of the String). Note if a (random) Part is given, this option only considers the last significant element of it.
lb (String) (defaults to: ) —

Linebreak character like \n etc (Default: $/). If this is one of the standard line-breaks, irregular line-breaks (for example, existence of CR when only LF should be there) are corrected.
lb_out (String) (defaults to: ) —

Linebreak used for output (Default: lb)

Returns:

same as prt

# File 'lib/plain_text.rb', line 153

def self.clean_text(
      prt,
      preserve_paragraph: DEF_METHOD_OPTS[:clean_text][:preserve_paragraph],
      boundary_style:     DEF_METHOD_OPTS[:clean_text][:boundary_style], # If unspecified, will be replaced with lb_out * 2
      lbs_style:      DEF_METHOD_OPTS[:clean_text][:lbs_style],
      lb_is_space:    DEF_METHOD_OPTS[:clean_text][:lb_is_space],
      sps_style:      DEF_METHOD_OPTS[:clean_text][:sps_style],
      delete_asian_space: DEF_METHOD_OPTS[:clean_text][:delete_asian_space],
      linehead_style: DEF_METHOD_OPTS[:clean_text][:linehead_style],
      linetail_style: DEF_METHOD_OPTS[:clean_text][:linetail_style],
      firstlbs_style: DEF_METHOD_OPTS[:clean_text][:firstlbs_style],
      lastsps_style:  DEF_METHOD_OPTS[:clean_text][:lastsps_style],
      lb:     DEF_METHOD_OPTS[:clean_text][:lb],
      lb_out: DEF_METHOD_OPTS[:clean_text][:lb_out], # If unspecified, will be replaced with lb
      is_debug: false
    )

#isdebug = true if prt == "foo\n\n\nbar\n"
  lb_out ||= lb  # Output linebreak
  boundary_style = lb_out*2 if true       == boundary_style
  boundary_style = ""       if [:delete, :d].include? boundary_style
  lastsps_style  = lb_out   if :linebreak == lastsps_style

  if !prt.class.method_defined? :last_significant_element
    # Construct a Part instance from the given String.
    ret = ''
    begin
      prt = prt.unicode_normalize
    rescue ArgumentError  # (invalid byte sequence in UTF-8)
      warn "The given String in (#{self.name}\##{__method__}) seems wrong."
      raise
    end
    prt = normalize_lb(prt, "\n", lb_from: (DefLineBreaks.include?(lb) ? nil : lb)).dup
    kwd = (["\r\n", "\r", "\n"].include?(lb) ? {} : { rules: /#{Regexp.quote lb}{2,}/})
    prt = (preserve_paragraph ? Part.parse(prt, **kwd) : Part.new([prt]))
  else
    # If not preserve_paragraph, reconstructs it as a Part with a single Paragraph.
    # Also, deepcopy is needed, as this method is destructive.
    prt = (preserve_paragraph ? prt : Part.new([prt.join])).deepcopy
  end
  prt.squash_boundaries!  # Boundaries are squashed.

  # Handles Boundary
  clean_text_boundary!(prt, boundary_style: boundary_style)

  # Handles linebreaks and spaces (within Paragraphs)
  clean_text_lbs_sps!( prt,
    lbs_style: lbs_style,
    lb_is_space: lb_is_space,
    sps_style: sps_style,
    delete_asian_space: delete_asian_space,
    is_debug: is_debug
  )
  # Handles the line head/tails.
  clean_text_line_head_tail!( prt,
    linehead_style: linehead_style,
    linetail_style: linetail_style
  )

  # Handles the file head/tail.
  clean_text_file_head_tail!( prt,
    firstlbs_style: firstlbs_style,
    lastsps_style:  lastsps_style,
    is_debug: is_debug
  )

  # Replaces the linebreaks to the specified one
  prt.map{ |i| i.gsub!(/\n/m, lb_out) }

  (ret ? prt.join : prt)  # prt.to_s may be different from prt.join
end

.count_char(instr, *rest, lbs_style: , linehead_style: , lastsps_style: , lb_out: , **k) ⇒ `Integer`

Count the number of characters

See #clean_text! for the optional parameters. The defaults of a few of the optional parameters are different from it, such as the default for lb_out is “n” (newline, so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.

Parameters:

instr (String) —

String for which the number of chars is counted

Returns:

(Integer)

# File 'lib/plain_text.rb', line 96

def self.count_char(instr, *rest,
      lbs_style:      DEF_METHOD_OPTS[:count_char][:lbs_style],
      linehead_style: DEF_METHOD_OPTS[:count_char][:linehead_style],
      lastsps_style:  DEF_METHOD_OPTS[:count_char][:lastsps_style],
      lb_out:         DEF_METHOD_OPTS[:count_char][:lb_out],
      **k
    )
  clean_text(instr, *rest, lbs_style: lbs_style, linehead_style: linehead_style, lastsps_style: lastsps_style, lb_out: lb_out, **k).size
end

.delete_spaces_bw_cjk_european(instr, *rest) ⇒ `Object`

Module function of #delete_spaces_bw_cjk_european

Parameters:

repl (String) —

Replacement character (Default: “”).

Returns:

as instr



229
230
231

# File 'lib/plain_text.rb', line 229

def self.delete_spaces_bw_cjk_european(instr, *rest)
  __call_inst_method__(:delete_spaces_bw_cjk_european, instr, *rest)
end

.extend_this(obj) ⇒ `TrueClass`, `NilClass`

If the class of the obj does not “include” this module, do so in the singular class.

Parameters:

obj (Object) —

Maybe String. For which a singular class def is run, if the condition is met.

Returns:

(TrueClass, NilClass) —

true if the singular class def is run. Else nil.

# File 'lib/plain_text.rb', line 80

def self.extend_this(obj)
  return nil if defined? obj.delete_spaces_bw_cjk_european!
  obj.extend(PlainText)
  true
end

.head(instr, *rest, **k) ⇒ `Object`

Module function of #head

The return String includes PlainText as Singleton.

Parameters:

instr (String) —

String that is examined.
num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line.
unit (Symbol, String) —

One of :line (or “-n”), :char, :byte (or “-c”)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

as instr



241
242
243

# File 'lib/plain_text.rb', line 241

def self.head(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head, instr, *rest, **k)
end

.head_inverse(instr, *rest, **k) ⇒ `Object`

Module function of #head_inverse

The return String includes PlainText as Singleton.

Parameters:

instr (String) —

String that is examined.
num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line.
unit (Symbol, String) —

One of :line (or “-n”), :char, :byte (or “-c”)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

as instr



253
254
255

# File 'lib/plain_text.rb', line 253

def self.head_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head_inverse, instr, *rest, **k)
end

.normalize_lb(instr, *rest, **k) ⇒ `Object`

Module function of #normalize_lb

The return String includes PlainText as Singleton.

Parameters:

instr (String) —

String that is examined.
repl (String) —

Replacement character (Default: $/ which is \n in UNIX).
lb_from (String, Array, NilClass) —

Candidate line-break(s) (Defaut: [CRLF, CR, LF]+)

Returns:

as instr



265
266
267

# File 'lib/plain_text.rb', line 265

def self.normalize_lb(instr, *rest, **k)
  return PlainText.__call_inst_method__(:normalize_lb, instr, *rest, **k)
end

.tail(instr, *rest, **k) ⇒ `Object`

Module function of #tail

The return String includes PlainText as Singleton.

Parameters:

instr (String) —

String that is examined.
num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as tail(3) and tail(-3) is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively.
unit (Symbol) —

One of :line (as in -n option), :char, :byte (-c option)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

as instr



277
278
279

# File 'lib/plain_text.rb', line 277

def self.tail(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail, instr, *rest, **k)
end

.tail_inverse(instr, *rest, **k) ⇒ `Object`

Module function of #tail_inverse

The return String includes PlainText as Singleton.

Parameters:

instr (String) —

String that is examined.
num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as tail(3) and tail(-3) is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively.
unit (Symbol) —

One of :line (as in -n option), :char, :byte (-c option)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

as instr



289
290
291

# File 'lib/plain_text.rb', line 289

def self.tail_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail_inverse, instr, *rest, **k)
end

Instance Method Details

#count_char(*rest, **k) ⇒ `Integer`

Count the number of characters

See count_char and further clean_text! for the optional parameters. The defaults of a few of the optional parameters are different from the latter, such as the default for lb_out is “n” (newline, so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.

Returns:

(Integer)



540
541
542

# File 'lib/plain_text.rb', line 540

def count_char(*rest, **k)
  PlainText.public_send(__method__, self, *rest, **k)
end

#delete_spaces_bw_cjk_european(*rest) ⇒ `Object`

Non-destructive version of #delete_spaces_bw_cjk_european!

Parameters:

repl (String) —

Replacement character (Default: “”).

Returns:

same class as self

# File 'lib/plain_text.rb', line 563

def delete_spaces_bw_cjk_european(*rest)
  newself = clone
  newself.delete_spaces_bw_cjk_european!(*rest)
  newself
end

#delete_spaces_bw_cjk_european!(repl = "") ⇒ `MatchData`, `NilClass`

Delete all the spaces between CJK and European characters or numbers.

All the spaces between CJK and European characters, numbers or punctuations are deleted or converted into a specified replacement character. Or, in short, any spaces between, before, and after a CJK characters are deleted. If the return is non-nil, there is at least one match.

Parameters:

repl (String) (defaults to: "") —

Replacement character (Default: “”).

Returns:

(MatchData, NilClass) —

MatchData of (one of) the last match if there is a positive match, else nil.

# File 'lib/plain_text.rb', line 553

def delete_spaces_bw_cjk_european!(repl="")
  ret = gsub!(/(\p{Hiragana}|\p{Katakana}|[ー－]|[一-龠々]|\p{Han}|\p{Hangul})([[:blank:]]+)([[:upper:][:lower:][:digit:][:punct:]])/, '\1\3')
  ret ||= gsub!(/([[:upper:][:lower:][:digit:][:punct:]])([[:blank:]]+)(\p{Hiragana}|\p{Katakana}|[ー－]|[一-龠々]|\p{Han}|\p{Hangul})/, '\1\3')
end

#head(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ `String`

Returns the first num lines (or characters, bytes) or before the last n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use string[ 0..5 ] and string#byteslice(0,5) instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the last (e.g., -1 (lines, if unit is :line) means everything but the last 1 line, and -5 means everything but the last 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines up to the first line that matches the given Regexp is returned, where the process is based on the lines. For example, if num is /ABC/ (Regexp), String of the lines from the beginning up to the line that contains the character “ABC” is returned.

Parameters:

num_in (Integer, Regexp) (defaults to: DEF_HEADTAIL_N_LINES) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line.
unit (Symbol, String) (defaults to: :line) —

One of :line (or “-n”), :char, :byte (or “-c”)
inclusive (Boolean) (defaults to: true) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) (defaults to: $/) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

(String) —

as self

# File 'lib/plain_text.rb', line 598

def head(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/)
  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "Non-positive num (#{num_in}) is given in #{__method__}" if num.to_int < 1
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, "-n"
    # Regexp (for boundary)
    return head_regexp(re_in, inclusive: inclusive, padding: padding, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    ret = split(linebreak, -1)[0..(num-1)].join(linebreak)  # -1 is specified to preserve the last linebreak(s).
    return ret if size <= ret.size  # Specified line is larger than the original or the last NL is missing.
    return(ret << linebreak)  # NL is added to the tail as in the original.
  when :char
    return self[0..(num-1)]
  when :byte, "-c"
    return self.byteslice(0..(num-1))
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end

#head!(*rest, **key) ⇒ `self`

Destructive version of #head

Parameters:

num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line.
unit (Symbol, String) —

One of :line (or “-n”), :char, :byte (or “-c”)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

(self)



574
575
576

# File 'lib/plain_text.rb', line 574

def head!(*rest, **key)
  replace(head(*rest, **key))
end

#head_inverse(*rest, **key) ⇒ `Object`

Inverse of head - returns the content except for the first num lines (or characters, bytes)

Parameters:

num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line.
unit (Symbol, String) —

One of :line (or “-n”), :char, :byte (or “-c”)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

same as self

# File 'lib/plain_text.rb', line 639

def head_inverse(*rest, **key)
  s2 = head(*rest, **key)
  (s2.size >= size) ? self[0,0] : self[s2.size..-1]
end

#head_inverse!(*rest, **key) ⇒ `self`

Destructive version of #head_inverse

Parameters:

num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line.
unit (Symbol, String) —

One of :line (or “-n”), :char, :byte (or “-c”)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

(self)



631
632
633

# File 'lib/plain_text.rb', line 631

def head_inverse!(*rest, **key)
  replace(head_inverse(*rest, **key))
end

#normalize_lb(*rest, **k) ⇒ `Object`

Non-destructive version of #normalize_lb!

Parameters:

repl (String) —

Replacement character (Default: $/ which is \n in UNIX).
lb_from (String, Array, NilClass) —

Candidate line-break(s) (Defaut: [CRLF, CR, LF]+)

Returns:

same class as self

# File 'lib/plain_text.rb', line 668

def normalize_lb(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.normalize_lb!(*rest, **k)
  newself
end

#normalize_lb!(repl = $/, lb_from: nil) ⇒ `MatchData`, `NilClass`

Normalizes line-breaks

All the line-breaks of self are converted into a new character or n If the return is non-nil, self contains unexpected line-break characters for the OS.

Parameters:

repl (String) (defaults to: $/) —

Replacement character (Default: $/ which is \n in UNIX).
lb_from (String, Array, NilClass) (defaults to: nil) —

Candidate line-break(s) (Defaut: [CRLF, CR, LF]+)

Returns:

(MatchData, NilClass) —

MatchData of the last match if there is non-$/ match, else nil.

# File 'lib/plain_text.rb', line 653

def normalize_lb!(repl=$/, lb_from: nil)
  ret = nil
  lb_from ||= DefLineBreaks
  lb_from = [lb_from].flatten
  lb_from.each do |ea_lb|
    gsub!(/#{ea_lb}/, repl) if ($/ != ea_lb) || ($/ == ea_lb && repl != ea_lb)
    ret = $~ if ($/ != ea_lb) && !ret
  end
  ret
end

#strip_at_lines(*rest, **k) ⇒ `Object`

Non-destructive version of #strip_at_lines!

Parameters:

strip_head (Boolean) —

if true (Default), spaces at each line head are removed.
strip_tail (Boolean) —

if true (Default), spaces at each line tail are removed (see markdown option).
markdown (Boolean) —

if true (Def: false), a double space at each tail remains and strip_head is forcibly false.
linebreak (String) —

\n etc (Default: $/)

Returns:

same class as self

# File 'lib/plain_text.rb', line 693

def strip_at_lines(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines!(*rest, **k)
  newself
end

#strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) ⇒ `self`, `NilClass`

String#strip! for each line

Parameters:

strip_head (Boolean) (defaults to: true) —

if true (Default), spaces at each line head are removed.
strip_tail (Boolean) (defaults to: true) —

if true (Default), spaces at each line tail are removed (see markdown option).
markdown (Boolean) (defaults to: false) —

if true (Def: false), a double space at each tail remains and strip_head is forcibly false.
linebreak (String) (defaults to: $/) —

\n etc (Default: $/)

Returns:

(self, NilClass) —

nil if gsub! does not match at all, i.e., there are no spaces to remove.

# File 'lib/plain_text.rb', line 682

def strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/)
  strip_head = false if markdown
  r1 = strip_at_lines_head!(                    linebreak: linebreak) if strip_head
  r2 = strip_at_lines_tail!(markdown: markdown, linebreak: linebreak) if strip_tail
  (r1 || r2) ? self : nil
end

#strip_at_lines_head(*rest, **k) ⇒ `Object`

Non-destructive version of #strip_at_lines_head!

Parameters:

linebreak (String) —

“n” etc (Default: $/)

Returns:

same class as self

# File 'lib/plain_text.rb', line 713

def strip_at_lines_head(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_head!(*rest, **k)
  newself
end

#strip_at_lines_head!(linebreak: $/) ⇒ `self`, `NilClass`

String#strip! for each line but only for the head part (NOT tail part)

Parameters:

linebreak (String) (defaults to: $/) —

“n” etc (Default: $/)

Returns:

(self, NilClass) —

nil if gsub! does not match at all, i.e., there are no spaces to remove.

# File 'lib/plain_text.rb', line 704

def strip_at_lines_head!(linebreak: $/)
  lb_quo = Regexp.quote linebreak
  gsub!(/(\A|#{lb_quo})[[:blank:]]+/m, '\1')
end

#strip_at_lines_tail(*rest, **k) ⇒ `Object`

Non-destructive version of #strip_at_lines_tail!

Parameters:

markdown (Boolean) —

if true (Def: false), a double space at each tail remains.
linebreak (String) —

“n” etc (Default: $/)

Returns:

same class as self

# File 'lib/plain_text.rb', line 737

def strip_at_lines_tail(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_tail!(*rest, **k)
  newself
end

#strip_at_lines_tail!(markdown: false, linebreak: $/) ⇒ `self`, `NilClass`

String#strip! for each line but only for the tail part (NOT head part)

Parameters:

markdown (Boolean) (defaults to: false) —

if true (Def: false), a double space at each tail remains.
linebreak (String) (defaults to: $/) —

“n” etc (Default: $/)

Returns:

(self, NilClass) —

nil if gsub! does not match at all, i.e., there are no spaces to remove.

# File 'lib/plain_text.rb', line 724

def strip_at_lines_tail!(markdown: false, linebreak: $/)
  lb_quo = Regexp.quote linebreak
  return gsub!(/(?<=^|[^[:blank:]])[[:blank:]]+(#{lb_quo}|\z)/m, '\1') if ! markdown

  r1 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]]{3,}(#{lb_quo}|\z)/m, '\1')
  r2 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]](#{lb_quo}|\z)/m, '\1')
  (r1 || r2) ? self : nil
end

#tail(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ `String`

Returns the last num lines (or characters, bytes) or of and after the first n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use string[ -5..-1 ] and string#byteslice(-5,5) instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the first (e.g., -1 [lines, if unit is :line] means everything but the first 1 line, and -5 means everything but the first 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines after the first line that matches the given Regexp is returned (not inclusive), where the process is based on the lines. For example, if num is /ABC/, String of the lines from the next line of the first line that contains the character “ABC” till the last one is returned. “The next line” means (1) the line immediately after the match if the matched string has the linebreak at the end, or (2) the line after the first linebreak after the matched string, where the trailing characters after the matched string to the linebreak (inclusive) is ignored.

Tips =

To specify the last line that matches the Regexp, consider prefixing (?:.*) with the option m, e.g., /(?:.*)ABC/m

Note for developers =

The line that matches with Regexp has to be exclusive. Because otherwise to specify the last line that matches would be impossible in principle. For example, to specify the last line that matches ABC, the given regexp should be /(?:.*)ABC/m (see the above Tips); in this case, if this matched line was inclusive, *all the lines from Line 1* would be included, which is most likely not what the caller wants.

Parameters:

num_in (Integer, Regexp) (defaults to: DEF_HEADTAIL_N_LINES) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as tail(3) and tail(-3) is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively.
unit (Symbol) (defaults to: :line) —

One of :line (as in -n option), :char, :byte (-c option)
inclusive (Boolean) (defaults to: true) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) (defaults to: $/) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

(String) —

as self

# File 'lib/plain_text.rb', line 786

def tail(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/)

  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "num of zero is given in #{__method__}" if num == 0
    num += 1 if num < 0
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, '-n'
    # Regexp (for boundary)
    return tail_regexp(re_in, inclusive: inclusive, padding: padding, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    return tail_linenum(num_in, num, linebreak: linebreak)
  when :char
    num = 0 if num >= size && num_in > 0
    return self[(-num)..-1]
  when :byte, '-c'
    num = 0 if num >= bytesize && num_in > 0
    return self.byteslice((-num)..-1)
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end

#tail!(*rest, **key) ⇒ `self`

Destructive version of #tail

Parameters:

num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as tail(3) and tail(-3) is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively.
unit (Symbol) —

One of :line (as in -n option), :char, :byte (-c option)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

(self)



748
749
750

# File 'lib/plain_text.rb', line 748

def tail!(*rest, **key)
  replace(tail(*rest, **key))
end

#tail_inverse(*rest, **key) ⇒ `Object`

Inverse of tail - returns the content except for the first num lines (or characters, bytes)

Parameters:

num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as tail(3) and tail(-3) is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively.
unit (Symbol) —

One of :line (as in -n option), :char, :byte (-c option)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

same as self

# File 'lib/plain_text.rb', line 828

def tail_inverse(*rest, **key)
  s2 = tail(*rest, **key)
  (s2.size >= size) ? self[0,0] : self[0..(size-s2.size-1)]
end

#tail_inverse!(*rest, **key) ⇒ `self`

Destructive version of #tail_inverse

Parameters:

num_in (Integer, Regexp) —

Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as tail(3) and tail(-3) is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively.
unit (Symbol) —

One of :line (as in -n option), :char, :byte (-c option)
inclusive (Boolean) —

read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result.
linebreak (String) —

\n etc (Default: $/), used when unit==:line (Default)

Returns:

(self)



820
821
822

# File 'lib/plain_text.rb', line 820

def tail_inverse!(*rest, **key)
  replace(tail_inverse(*rest, **key))
end

Module: PlainText

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.__call_inst_method__(method, instr, *rest, **k) ⇒ #instr

.clean_text(prt, preserve_paragraph: , boundary_style: , lbs_style: , lb_is_space: , sps_style: , delete_asian_space: , linehead_style: , linetail_style: , firstlbs_style: , lastsps_style: , lb: , lb_out: , is_debug: false) ⇒ Object

.count_char(instr, *rest, lbs_style: , linehead_style: , lastsps_style: , lb_out: , **k) ⇒ Integer

.delete_spaces_bw_cjk_european(instr, *rest) ⇒ Object

.extend_this(obj) ⇒ TrueClass, NilClass

.head(instr, *rest, **k) ⇒ Object

.head_inverse(instr, *rest, **k) ⇒ Object

.normalize_lb(instr, *rest, **k) ⇒ Object

.tail(instr, *rest, **k) ⇒ Object

.tail_inverse(instr, *rest, **k) ⇒ Object

Instance Method Details

#count_char(*rest, **k) ⇒ Integer

#delete_spaces_bw_cjk_european(*rest) ⇒ Object

#delete_spaces_bw_cjk_european!(repl = "") ⇒ MatchData, NilClass

#head(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ String

#head!(*rest, **key) ⇒ self

#head_inverse(*rest, **key) ⇒ Object

#head_inverse!(*rest, **key) ⇒ self

#normalize_lb(*rest, **k) ⇒ Object

#normalize_lb!(repl = $/, lb_from: nil) ⇒ MatchData, NilClass

#strip_at_lines(*rest, **k) ⇒ Object

#strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) ⇒ self, NilClass

#strip_at_lines_head(*rest, **k) ⇒ Object

#strip_at_lines_head!(linebreak: $/) ⇒ self, NilClass

#strip_at_lines_tail(*rest, **k) ⇒ Object

#strip_at_lines_tail!(markdown: false, linebreak: $/) ⇒ self, NilClass

#tail(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ String

Tips =

Note for developers =

#tail!(*rest, **key) ⇒ self

#tail_inverse(*rest, **key) ⇒ Object

#tail_inverse!(*rest, **key) ⇒ self

.__call_inst_method__(method, instr, *rest, **k) ⇒ `#instr`

.clean_text(prt, preserve_paragraph: , boundary_style: , lbs_style: , lb_is_space: , sps_style: , delete_asian_space: , linehead_style: , linetail_style: , firstlbs_style: , lastsps_style: , lb: , lb_out: , is_debug: false) ⇒ `Object`

.count_char(instr, *rest, lbs_style: , linehead_style: , lastsps_style: , lb_out: , **k) ⇒ `Integer`

.delete_spaces_bw_cjk_european(instr, *rest) ⇒ `Object`

.extend_this(obj) ⇒ `TrueClass`, `NilClass`

.head(instr, *rest, **k) ⇒ `Object`

.head_inverse(instr, *rest, **k) ⇒ `Object`

.normalize_lb(instr, *rest, **k) ⇒ `Object`

.tail(instr, *rest, **k) ⇒ `Object`

.tail_inverse(instr, *rest, **k) ⇒ `Object`

#count_char(*rest, **k) ⇒ `Integer`

#delete_spaces_bw_cjk_european(*rest) ⇒ `Object`

#delete_spaces_bw_cjk_european!(repl = "") ⇒ `MatchData`, `NilClass`

#head(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ `String`

#head!(*rest, **key) ⇒ `self`

#head_inverse(*rest, **key) ⇒ `Object`

#head_inverse!(*rest, **key) ⇒ `self`

#normalize_lb(*rest, **k) ⇒ `Object`

#normalize_lb!(repl = $/, lb_from: nil) ⇒ `MatchData`, `NilClass`

#strip_at_lines(*rest, **k) ⇒ `Object`

#strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) ⇒ `self`, `NilClass`

#strip_at_lines_head(*rest, **k) ⇒ `Object`

#strip_at_lines_head!(linebreak: $/) ⇒ `self`, `NilClass`

#strip_at_lines_tail(*rest, **k) ⇒ `Object`

#strip_at_lines_tail!(markdown: false, linebreak: $/) ⇒ `self`, `NilClass`

#tail(num_in = DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) ⇒ `String`

#tail!(*rest, **key) ⇒ `self`

#tail_inverse(*rest, **key) ⇒ `Object`

#tail_inverse!(*rest, **key) ⇒ `self`