PlainText - Module and classes to handle plain text
Summary
This module provides utility functions and methods to handle plain text. In the namespace, classes Part/Paragraph/Boundary are defined, which represent the logical structure of a document and another class ParseRule, which describes the rules to parse plain text to produce a Part-type Ruby instance. This package also provides a command-line program to count the number of characters, especially useful for documents in Asian (CJK) chatacters.
Design concept
PlainText - Module and root Namespace
The original plain text should be in String in Ruby.
The module PlainText offers some useful methods, such as, PlainText#head and PlainText#tail. They are meant to be included in String. However, it also contains some useful module functions, such as, PlainText.clean_text and PlainText.count_char.
PlainText::Part - Core class to describe the logical structure
In the namespace of this module, it contains PlainText::Part class, which is the heart to describe the logical structure of documents. It is basically a container class and indeed a sub-class of Array. It can contain either of another PlainText::Part or more basic components of either of PlainText::Part::Paragraph and PlainText::Part::Boundary, both of which are sub-classes of String.
An example instance looks like this:
Part (
(0) Paragraph::Empty,
(1) Boundary::General,
(2) Part::ArticleHeader(
(0) Paragraph::Title,
(1) Boundary::Empty
),
(3) Boundary::TitleMain,
(4) Part::ArticleMain(
(0) Part::ArticleSection(
(0) Paragraph::Title,
(1) Boundary::General,
(2) Paragraph::General,
(3) Boundary::General,
(4) Part::ArticleSubSection(...),
(5) Boundary::General,
(6) Paragraph::General,
(7) Boundary::Empty
),
(1) Boundary::General,
(2) Paragraph::General,
(3) Boundary::Empty
),
(5) Boundary::General
)
where the name of subclasses (or constants) here arbitrary, except for PlainText::Part::Paragraph::Empty and PlainText::Part::Boundary::Empty, which are pre-defined. Users can define their own subclasses to help organize the logical structure at their will.
Basically, at every layer, every PlainText::Part or PlainText::Part::Paragraph is sandwiched by PlainText::Part::Boundary, except for the very first one.
By performing join method, one can retrieve the entire document as a String instance any time.
PlainText::ParseRule - Class to describe the rule of how to parse
PlainText::ParseRule is the class to describe how to parse initially String, and subsequently PlainText::Part, which is basically an Array. PlainText::ParseRule is a container class and holds a set of ordered rules, each of which is either Proc or Regexp as a more simple rule. A rule, Proc, is defined by a user and is designed to receive either String (the first application only) or PlainText::ParseRule (Array) and to return a fully (or partially) parsed PlainText::ParseRule. In short, the rule descries how to determine from where to where a paragraphs and boundaries are located, and maybe what and where the sections and sub-sections and so on are.
For example, if a rule is Regexp, it describes how to split a String; it is applied to String in the first application, but if it is applied (and maybe registered as such) at the second or later stage, it is applied to each Paragraph and Section separately to split them further.
PlainText::ParseRule#apply and PlainText::Part.parse are the standard methods to apply the rules to an object (either String or PlainText::Part.
Command-line tool
countchar
Counts the number of characters in a file(s) or STDIN.
The simplest example to run the command-line script is
countchar YourFile.txt
You may start with
countchar --help
to see the available options.
Miscellaneous
Module PlainText::Split contains an instance method (and class method with the same name) PlainText::Split#split_with_delimiter, which is included in String in default. The method realises a reversible split of String with a delimiter of an arbitrary Regexp.
In the standard String#split, the following is the result, when sent by a String instance s = “XQabXXcXQ”:
s.split(/X+Q?/) #=> ["", "ab", "c"],
s.split(/X+Q?/, -1) #=> ["", "ab", "c", ""],
s.split(/X+(Q?)/, -1) #=> ["", "Q", "ab", "", "c", "Q", ""],
s.split(/(X+(Q?))/, -1) #=> ["", "XQ", "Q", "ab", "XX", "", "c", "XQ", "Q", ""],
With this method,
s.split_with_delimiter(/X+(Q?)/)
#=> ["", "XQ", "ab", "XX", "c", "XQ"]
from which the original string is always easily recovered by simple join.
Also, PlainText::Util contains some miscellaneous methods.
Description
Work in progress…
It is still in a preliminary state.
Install
This script requires Ruby Version 2.0 or above (possibley 2.2 or above?).
As for the command-line script file, it can be put in any of your command-line search paths. Make sure the RUBYLIB environment variable contains the library directory to this gem, which is
/THIS/GEM/LIBRARY/PATH/plain_text/lib
You may need to modify the first line (Shebang line) of the script to suit your environment (it should be unnecessary for Linux and MacOS), or run it explicitly with your Ruby command as
Prompt% /YOUR/ENV/ruby /YOUR/INSTALLED/countchar
Developer’s note
Tests
Ruby codes under the directory test/ are the test scripts. You can run them from the top directory as ruby test/test_****.rb or simply run make test.
Known bugs
None.
Copyright
- Author
-
Masa Sakano < info a_t wisebabel dot com >
- Versions
-
The versions of this package follow Semantic Versioning (2.0.0) semver.org/