Module: HexaPDF::Layout::TextBox::SimpleTextSegmentation

Defined in:
lib/hexapdf/layout/text_box.rb

Overview

Implementation of a simple text segmentation algorithm.

The algorithm breaks TextFragment objects into objects wrapped by Box, Glue or Penalty items, and inserts additional Penalty items when needed:

  • Any valid Unicode newline separator inserts a Penalty object describing a mandatory break.

    See www.unicode.org/reports/tr18/#Line_Boundaries

  • Spaces and tabulators are wrapped by Glue objects, allowing breaks.

  • Non-breaking spaces are wrapped into Penalty objects that prohibit line breaking.

  • Hyphens are attached to the preceeding text fragment (or are a standalone text fragment) and followed by a Penalty object to allow a break.

  • If a soft-hyphens is encountered, a hyphen wrapped by a Penalty object is inserted to allow a break.

  • If a zero-width-space is encountered, a Penalty object is inserted to allow a break.

Constant Summary collapse

BREAK_RE =

Breaks are detected at: space, tab, zero-width-space, non-breaking space, hyphen, soft-hypen and any valid Unicode newline separator

/[ \u{A}-\u{D}\u{85}\u{2028}\u{2029}\t\u{200B}\u{00AD}\u{00A0}-]/

Class Method Summary collapse

Class Method Details

.call(items) ⇒ Object

Breaks the items (an array of InlineBox and TextFragment objects) into atomic pieces wrapped by Box, Glue or Penalty items, and returns those as an array.



190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# File 'lib/hexapdf/layout/text_box.rb', line 190

def self.call(items)
  result = []
  glues = {}
  items.each do |item|
    if item.kind_of?(InlineBox)
      result << Box.new(item)
    else
      i = 0
      while i < item.items.size
        # Collect characters and kerning values until break character is encountered
        box_items = []
        while (glyph = item.items[i]) &&
            (glyph.kind_of?(Numeric) || !BREAK_RE.match?(glyph.str))
          box_items << glyph
          i += 1
        end

        # A hyphen belongs to the text fragment
        box_items << glyph if glyph && !glyph.kind_of?(Numeric) && glyph.str == '-'.freeze

        unless box_items.empty?
          result << Box.new(TextFragment.new(items: box_items.freeze, style: item.style))
        end

        if glyph
          case glyph.str
          when ' '
            glues[item.style] ||=
              Glue.new(TextFragment.new(items: [glyph].freeze, style: item.style))
            result << glues[item.style]
          when "\n", "\v", "\f", "\u{85}", "\u{2028}", "\u{2029}"
            result << Penalty::MandatoryBreak
          when "\r"
            if item.items[i + 1]&.kind_of?(Numeric) || item.items[i + 1].str != "\n"
              result << Penalty::MandatoryBreak
            end
          when '-'
            result << Penalty::Standard
          when "\t"
            spaces = [item.style.font.decode_utf8(" ").first] * 8
            result << Glue.new(TextFragment.new(items: spaces.freeze, style: item.style))
          when "\u{00AD}"
            hyphen = item.style.font.decode_utf8("-").first
            frag = TextFragment.new(items: [hyphen].freeze, style: item.style)
            result << Penalty.new(Penalty::Standard.penalty, frag.width, item: frag)
          when "\u{00A0}"
            space = item.style.font.decode_utf8(" ").first
            frag = TextFragment.new(items: [space].freeze, style: item.style)
            result << Penalty.new(Penalty::ProhibitedBreak.penalty, frag.width, item: frag)
          when "\u{200B}"
            result << Penalty.new(0)
          end
        end
        i += 1
      end
    end
  end
  result
end