Module: Bidi2pdf::TestHelpers::PDFTextSanitizer

Defined in:
lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb

Overview

rubocop: disable Metrics/ModuleLength Provides utilities for sanitizing and comparing PDF text content. This module includes methods for cleaning text, comparing PDF content, and reporting differences between actual and expected PDF outputs.

The sanitization process includes normalizing whitespace, replacing typographic ligatures, and handling other common text formatting issues.

Examples:

Cleaning text

sanitized_text = Bidi2pdf::TestHelpers::PDFTextSanitizer.clean("Some text")

Comparing PDF content

match = Bidi2pdf::TestHelpers::PDFTextSanitizer.match?(actual_pdf, expected_pdf)

Class Method Summary collapse

Class Method Details

.clean(text) ⇒ String

Cleans the given text by replacing common typographic ligatures, normalizing whitespace, and removing unnecessary characters.



29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 29

def clean(text)
  text = UnicodeUtils.nfkd(text)

  text.gsub("\uFB01", "fi")
      .gsub("\uFB02", "fl")
      .gsub("-\n", "")
      .gsub('"', '"')
      .gsub("'", "'")
      .gsub("…", "...")
      .gsub("—", "--")
      .gsub("–", "-")
      .gsub(/\s+/, " ") # Replace all whitespace sequences with a single space
      .strip
end

.clean_pages(actual_pdf_thingy) ⇒ Array<String>

Cleans an array of PDF page texts by applying the ‘clean` method to each page’s content.



49
50
51
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 49

def clean_pages(actual_pdf_thingy)
  Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text(actual_pdf_thingy).map { |text| clean(text) }
end

.contains?(actual_pdf_thingy, expected, page_number = nil) ⇒ Boolean

Checks if the given PDF contains the expected text or pattern.



67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 67

def contains?(actual_pdf_thingy, expected, page_number = nil)
  pages = Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text(actual_pdf_thingy)
  cleaned_pages = clean_pages(pages)

  return false if page_number && page_number > cleaned_pages.size

  # Narrow to specific page if requested
  if page_number
    text = cleaned_pages[page_number - 1]
    return match_expected?(text, expected)
  end

  # Search all pages
  cleaned_pages.any? { |page| match_expected?(page, expected) }
end

.format_diff_output(diffs, expected, actual) ⇒ String

Formats the output of differences for display.



174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 174

def format_diff_output(diffs, expected, actual)
  output = []

  changes = group_changed_diffs(diffs)

  # Output each change with context
  changes.each do |change|
    output += format_change expected, actual, change
  end

  output.join("\n")
end

.match?(actual_pdf_thingy, expected_pdf_thingy) ⇒ Boolean

Compares the content of two PDF objects for equality.



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 99

def match?(actual_pdf_thingy, expected_pdf_thingy)
  actual = Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text actual_pdf_thingy
  expected = Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text expected_pdf_thingy

  cleaned_actual = clean_pages(actual)
  cleaned_expected = clean_pages(expected)

  # Compare without whitespace for equality check
  actual_for_comparison = cleaned_actual.map { |text| normalize(text) }
  expected_for_comparison = cleaned_expected.map { |text| normalize(text) }

  if actual_for_comparison == expected_for_comparison
    true
  else
    report_content_mismatch(cleaned_actual, cleaned_expected)
    false
  end
end

.match_expected?(text, expected) ⇒ Boolean

Matches the given text against the expected text or pattern.



88
89
90
91
92
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 88

def match_expected?(text, expected)
  return false unless text

  expected.is_a?(Regexp) ? text.match?(expected) : text.include?(expected.to_s)
end

.normalize(text) ⇒ String

Cleans the given text and removes all whitespace for comparison purposes.



57
58
59
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 57

def normalize(text)
  clean(text).gsub(/\s+/, "")
end

This method returns an undefined value.

Prints detailed differences between actual and expected PDF content.



133
134
135
136
137
138
139
140
141
142
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 133

def print_differences(actual, expected)
  max_pages = [actual.length, expected.length].max

  (0...max_pages).each do |page_idx|
    actual_page = actual[page_idx] || "(missing page)"
    expected_page = expected[page_idx] || "(missing page)"

    print_differences_for_page(actual_page, expected_page, page_idx)
  end
end

This method returns an undefined value.

Prints the differences between actual and expected content for a specific page. This method compares the content ignoring whitespace and, if differences are found, outputs a formatted representation of those differences.



152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 152

def print_differences_for_page(actual_page, expected_page, page_idx)
  # Compare without whitespace
  actual_no_space = normalize(actual_page.to_s)
  expected_no_space = normalize(expected_page.to_s)

  return if actual_no_space == expected_no_space

  puts "\nPage #{page_idx + 1} differences (ignoring whitespace):"

  # Create diffs between the two pages
  diffs = Diff::LCS.sdiff(expected_page, actual_page)

  # Format and display the differences
  puts format_diff_output(diffs, expected_page, actual_page)
end

.report_content_mismatch(actual, expected) ⇒ void

This method returns an undefined value.

Reports differences between actual and expected PDF content.



123
124
125
126
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 123

def report_content_mismatch(actual, expected)
  puts "--- PDF content mismatch ---"
  print_differences(actual, expected)
end