PlainText - Module and classes to handle plain text

Summary

This module provides utility functions and methods to handle plain text. In the namespace, classes Part/Paragraph/Boundary are defined, which represent the logical structure of a document and another class ParseRule, which describes the rules to parse plain text to produce a Part-type Ruby instance. This package also provides a command-line program to count the number of characters, especially useful for documents in Asian (CJK) chatacters.

Design concept

PlainText - Module and root Namespace

The original plain text should be in String in Ruby.

The module PlainText offers some useful methods, such as, PlainText#head and PlainText#tail. They are meant to be included in String. However, it also contains some useful module functions, such as, PlainText.clean_text and PlainText.count_char.

PlainText::Part - Core class to describe the logical structure

In the namespace of this module, it contains PlainText::Part class, which is the heart to describe the logical structure of documents. It is basically a container class and indeed a sub-class of Array. It can contain either of another PlainText::Part or more basic components of either of PlainText::Part::Paragraph and PlainText::Part::Boundary, both of which are sub-classes of String.

An example instance looks like this:

Part (
  (0) Paragraph::Empty,
  (1) Boundary::General,
  (2) Part::ArticleHeader(
        (0) Paragraph::Title,
        (1) Boundary::Empty
      ),
  (3) Boundary::TitleMain,
  (4) Part::ArticleMain(
        (0) Part::ArticleSection(
              (0) Paragraph::Title,
              (1) Boundary::General,
              (2) Paragraph::General,
              (3) Boundary::General,
              (4) Part::ArticleSubSection(...),
              (5) Boundary::General,
              (6) Paragraph::General,
              (7) Boundary::Empty
            ),
        (1) Boundary::General,
        (2) Paragraph::General,
        (3) Boundary::Empty
      ),
  (5) Boundary::General
)

where the name of subclasses (or constants) here arbitrary, except for PlainText::Part::Paragraph::Empty and PlainText::Part::Boundary::Empty, which are pre-defined. Users can define their own subclasses to help organize the logical structure at their will.

Basically, at every layer, every PlainText::Part or PlainText::Part::Paragraph is sandwiched by PlainText::Part::Boundary, except for the very first one.

By performing join method, one can retrieve the entire document as a String instance any time.

PlainText::ParseRule - Class to describe the rule of how to parse

PlainText::ParseRule is the class to describe how to parse initially String, and subsequently PlainText::Part, which is basically an Array. PlainText::ParseRule is a container class and holds a set of ordered rules, each of which is either Proc or Regexp as a more simple rule. A rule, Proc, is defined by a user and is designed to receive either String (the first application only) or PlainText::ParseRule (Array) and to return a fully (or partially) parsed PlainText::ParseRule. In short, the rule descries how to determine from where to where a paragraphs and boundaries are located, and maybe what and where the sections and sub-sections and so on are.

For example, if a rule is Regexp, it describes how to split a String; it is applied to String in the first application, but if it is applied (and maybe registered as such) at the second or later stage, it is applied to each Paragraph and Section separately to split them further.

PlainText::ParseRule#apply and PlainText::Part.parse are the standard methods to apply the rules to an object (either String or PlainText::Part.

Command-line tool

countchar

Counts the number of characters in a file(s) or STDIN.

The simplest example to run the command-line script is

countchar YourFile.txt

You may start with

countchar --help

to see the available options.

Miscellaneous

Module PlainText::Split contains an instance method (and class method with the same name) PlainText::Split#split_with_delimiter, which is included in String in default. The method realises a reversible split of String with a delimiter of an arbitrary Regexp.

In the standard String#split, the following is the result, when sent by a String instance s = “XQabXXcXQ”:

s.split(/X+Q?/)         #=> ["", "ab", "c"],                   
s.split(/X+Q?/, -1)     #=> ["", "ab", "c", ""],               
s.split(/X+(Q?)/, -1)   #=> ["", "Q", "ab", "", "c", "Q", ""], 
s.split(/(X+(Q?))/, -1) #=> ["", "XQ", "Q", "ab", "XX", "", "c", "XQ", "Q", ""],

With this method,

s.split_with_delimiter(/X+(Q?)/)
                        #=> ["", "XQ", "ab", "XX", "c", "XQ"]

from which the original string is always easily recovered by simple join.

Also, PlainText::Util contains some miscellaneous methods.

Description

Work in progress…

It is still in a preliminary state.

Install

This script requires Ruby Version 2.0 or above (possibley 2.2 or above?).

As for the command-line script file, it can be put in any of your command-line search paths. Make sure the RUBYLIB environment variable contains the library directory to this gem, which is

/THIS/GEM/LIBRARY/PATH/plain_text/lib

You may need to modify the first line (Shebang line) of the script to suit your environment (it should be unnecessary for Linux and MacOS), or run it explicitly with your Ruby command as

Prompt% /YOUR/ENV/ruby /YOUR/INSTALLED/countchar

Developer’s note

Tests

Ruby codes under the directory test/ are the test scripts. You can run them from the top directory as ruby test/test_****.rb or simply run make test.

Known bugs

None.

Copyright

Author: Masa Sakano < info a_t wisebabel dot com >
Versions: The versions of this package follow Semantic Versioning (2.0.0) semver.org/