Module: Bidi2pdf::TestHelpers::PDFTextSanitizer
- Defined in:
- lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb
Overview
rubocop: disable Metrics/ModuleLength Provides utilities for sanitizing and comparing PDF text content. This module includes methods for cleaning text, comparing PDF content, and reporting differences between actual and expected PDF outputs.
The sanitization process includes normalizing whitespace, replacing typographic ligatures, and handling other common text formatting issues.
Class Method Summary collapse
-
.clean(text) ⇒ String
Cleans the given text by replacing common typographic ligatures, normalizing whitespace, and removing unnecessary characters.
-
.clean_pages(actual_pdf_thingy) ⇒ Array<String>
Cleans an array of PDF page texts by applying the ‘clean` method to each page’s content.
-
.contains?(actual_pdf_thingy, expected, page_number = nil) ⇒ Boolean
Checks if the given PDF contains the expected text or pattern.
-
.format_diff_output(diffs, expected, actual) ⇒ String
Formats the output of differences for display.
-
.match?(actual_pdf_thingy, expected_pdf_thingy) ⇒ Boolean
Compares the content of two PDF objects for equality.
-
.match_expected?(text, expected) ⇒ Boolean
Matches the given text against the expected text or pattern.
-
.normalize(text) ⇒ String
Cleans the given text and removes all whitespace for comparison purposes.
-
.print_differences(actual, expected) ⇒ void
Prints detailed differences between actual and expected PDF content.
-
.print_differences_for_page(actual_page, expected_page, page_idx) ⇒ void
Prints the differences between actual and expected content for a specific page.
-
.report_content_mismatch(actual, expected) ⇒ void
Reports differences between actual and expected PDF content.
Class Method Details
.clean(text) ⇒ String
Cleans the given text by replacing common typographic ligatures, normalizing whitespace, and removing unnecessary characters.
29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 29 def clean(text) text = UnicodeUtils.nfkd(text) text.gsub("\uFB01", "fi") .gsub("\uFB02", "fl") .gsub("-\n", "") .gsub('"', '"') .gsub("'", "'") .gsub("…", "...") .gsub("—", "--") .gsub("–", "-") .gsub(/\s+/, " ") # Replace all whitespace sequences with a single space .strip end |
.clean_pages(actual_pdf_thingy) ⇒ Array<String>
Cleans an array of PDF page texts by applying the ‘clean` method to each page’s content.
49 50 51 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 49 def clean_pages(actual_pdf_thingy) Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text(actual_pdf_thingy).map { |text| clean(text) } end |
.contains?(actual_pdf_thingy, expected, page_number = nil) ⇒ Boolean
Checks if the given PDF contains the expected text or pattern.
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 67 def contains?(actual_pdf_thingy, expected, page_number = nil) pages = Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text(actual_pdf_thingy) cleaned_pages = clean_pages(pages) return false if page_number && page_number > cleaned_pages.size # Narrow to specific page if requested if page_number text = cleaned_pages[page_number - 1] return match_expected?(text, expected) end # Search all pages cleaned_pages.any? { |page| match_expected?(page, expected) } end |
.format_diff_output(diffs, expected, actual) ⇒ String
Formats the output of differences for display.
174 175 176 177 178 179 180 181 182 183 184 185 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 174 def format_diff_output(diffs, expected, actual) output = [] changes = group_changed_diffs(diffs) # Output each change with context changes.each do |change| output += format_change expected, actual, change end output.join("\n") end |
.match?(actual_pdf_thingy, expected_pdf_thingy) ⇒ Boolean
Compares the content of two PDF objects for equality.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 99 def match?(actual_pdf_thingy, expected_pdf_thingy) actual = Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text actual_pdf_thingy expected = Bidi2pdf::TestHelpers::PDFReaderUtils.pdf_text expected_pdf_thingy cleaned_actual = clean_pages(actual) cleaned_expected = clean_pages(expected) # Compare without whitespace for equality check actual_for_comparison = cleaned_actual.map { |text| normalize(text) } expected_for_comparison = cleaned_expected.map { |text| normalize(text) } if actual_for_comparison == expected_for_comparison true else report_content_mismatch(cleaned_actual, cleaned_expected) false end end |
.match_expected?(text, expected) ⇒ Boolean
Matches the given text against the expected text or pattern.
88 89 90 91 92 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 88 def match_expected?(text, expected) return false unless text expected.is_a?(Regexp) ? text.match?(expected) : text.include?(expected.to_s) end |
.normalize(text) ⇒ String
Cleans the given text and removes all whitespace for comparison purposes.
57 58 59 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 57 def normalize(text) clean(text).gsub(/\s+/, "") end |
.print_differences(actual, expected) ⇒ void
This method returns an undefined value.
Prints detailed differences between actual and expected PDF content.
133 134 135 136 137 138 139 140 141 142 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 133 def print_differences(actual, expected) max_pages = [actual.length, expected.length].max (0...max_pages).each do |page_idx| actual_page = actual[page_idx] || "(missing page)" expected_page = expected[page_idx] || "(missing page)" print_differences_for_page(actual_page, expected_page, page_idx) end end |
.print_differences_for_page(actual_page, expected_page, page_idx) ⇒ void
This method returns an undefined value.
Prints the differences between actual and expected content for a specific page. This method compares the content ignoring whitespace and, if differences are found, outputs a formatted representation of those differences.
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 152 def print_differences_for_page(actual_page, expected_page, page_idx) # Compare without whitespace actual_no_space = normalize(actual_page.to_s) expected_no_space = normalize(expected_page.to_s) return if actual_no_space == expected_no_space puts "\nPage #{page_idx + 1} differences (ignoring whitespace):" # Create diffs between the two pages diffs = Diff::LCS.sdiff(expected_page, actual_page) # Format and display the differences puts format_diff_output(diffs, expected_page, actual_page) end |
.report_content_mismatch(actual, expected) ⇒ void
This method returns an undefined value.
Reports differences between actual and expected PDF content.
123 124 125 126 |
# File 'lib/bidi2pdf/test_helpers/pdf_text_sanitizer.rb', line 123 def report_content_mismatch(actual, expected) puts "--- PDF content mismatch ---" print_differences(actual, expected) end |