Class: Slaw::Parse::Cleanser
- Inherits:
-
Object
- Object
- Slaw::Parse::Cleanser
- Defined in:
- lib/slaw/parse/cleanser.rb
Overview
Helper class to run various cleanup routines on plain text.
Some of these routines can safely be run multiple times, others are meant to be run only once.
Instance Method Summary collapse
-
#break_lines(s) ⇒ Object
Make educated guesses about lines that should have been broken but haven’t, and break them.
-
#chomp(s) ⇒ Object
Get rid of whitespace at the end of lines and at the start and end of the entire string.
-
#cleanup(s) ⇒ Object
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace.
-
#correct_newlines(s) ⇒ Object
line endings.
- #enforce_newline(s) ⇒ Object
-
#expand_tabs(s) ⇒ Object
tabs to spaces.
-
#fix_quotes(s) ⇒ Object
change weird quotes to normal ones.
-
#reformat(s) ⇒ Object
Run deeper introspections and reformat the text, such as unwrapping/re-wrapping lines.
-
#remove_boilerplate(s) ⇒ Object
Try to remove boilerplate lines found in many files, such as page numbers.
-
#remove_empty_lines(s) ⇒ Object
————————————————————————.
-
#scrub(s) ⇒ Object
strip invalid bytes and ones we don’t like.
-
#strip_toc(s) ⇒ Object
Do our best to remove table of contents at the start, it really confuses the grammer.
-
#unbreak_lines(s) ⇒ Object
Find likely candidates for unnecessarily broken lines and unbreaks them.
Instance Method Details
#break_lines(s) ⇒ Object
Make educated guesses about lines that should have been broken but haven’t, and break them.
This is very dependent on a locale’s legislation grammar, there are lots of rules of thumb that make this work.
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
# File 'lib/slaw/parse/cleanser.rb', line 102 def break_lines(s) # often we find a section title munged onto the same line as its first statement # eg: # foo bar. New section title 62. (1) For the purpose s = s.gsub(/\. ([^.]+) (\d+\. \(1\) )/, ".\n" + '\1' + "\n" + '\2') # New section title 62. (1) For the purpose s = s.gsub(/(\w) (\d+\. \(1\) )/, '\1' + "\n" + '\2') # (1) foo; (2) bar # (1) foo. (2) bar s = s.gsub(/(\w{3,}[;.]) (\([0-9a-z]+\))/, "\\1\n\\2") # (1) foo; and (2) bar # (1) foo; or (2) bar s = s.gsub(/; (and|or) \(/, "; \\1\n(") # The officer-in-Charge may – (a) remove all withered natural... \n(b) # We do this last, because by now we should have reconised that (b) should already # be on a new line. s = s.gsub(/ (\(a\) .+?\n\(b\))/, "\n\\1") # "foo" means ...; "bar" means s = s.gsub(/; (["”“][^"”“]+?["”“] means)/, ";\n\\1") # CHAPTER 4 PARKING METER PARKING GROUNDS Place of parking s = s.gsub(/([A-Z0-9 ]{5,}) ([A-Z][a-z ]{5,})/, "\\1\n\\2") s end |
#chomp(s) ⇒ Object
Get rid of whitespace at the end of lines and at the start and end of the entire string.
84 85 86 87 88 89 90 |
# File 'lib/slaw/parse/cleanser.rb', line 84 def chomp(s) # trailing whitespace at end of lines s = s.gsub(/ +$/, '') # whitespace on either side s.strip end |
#cleanup(s) ⇒ Object
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace. This is idempotent and safe to run multiple times.
14 15 16 17 18 19 20 21 22 |
# File 'lib/slaw/parse/cleanser.rb', line 14 def cleanup(s) s = scrub(s) s = correct_newlines(s) s = fix_quotes(s) s = (s) s = chomp(s) s = enforce_newline(s) s = remove_boilerplate(s) end |
#correct_newlines(s) ⇒ Object
line endings
41 42 43 44 |
# File 'lib/slaw/parse/cleanser.rb', line 41 def correct_newlines(s) s.gsub(/\r\n/, "\n")\ .gsub(/\r/, "\n") end |
#enforce_newline(s) ⇒ Object
92 93 94 95 |
# File 'lib/slaw/parse/cleanser.rb', line 92 def enforce_newline(s) # ensure string ends with a newline s.end_with?("\n") ? s : (s + "\n") end |
#expand_tabs(s) ⇒ Object
tabs to spaces
59 60 61 |
# File 'lib/slaw/parse/cleanser.rb', line 59 def (s) s.gsub(/\t/, ' ') end |
#fix_quotes(s) ⇒ Object
change weird quotes to normal ones
54 55 56 |
# File 'lib/slaw/parse/cleanser.rb', line 54 def fix_quotes(s) s.gsub(/‘‘|’’|''/, '"') end |
#reformat(s) ⇒ Object
Run deeper introspections and reformat the text, such as unwrapping/re-wrapping lines. These may not be safe to run multiple times.
27 28 29 30 31 32 |
# File 'lib/slaw/parse/cleanser.rb', line 27 def reformat(s) s = unbreak_lines(s) s = break_lines(s) s = strip_toc(s) s = enforce_newline(s) end |
#remove_boilerplate(s) ⇒ Object
Try to remove boilerplate lines found in many files, such as page numbers.
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
# File 'lib/slaw/parse/cleanser.rb', line 64 def remove_boilerplate(s) # nuke any line to do with Sabinet and the government printer s.gsub(/^.*Sabinet.*Government Printer.*$/i, '')\ .gsub(/^.*Provincial Gazette \d+.*$/i, '')\ .gsub(/^.*Provinsiale Koerant \d+.*$/i, '')\ .gsub(/^.*PROVINCIAL GAZETTE.*$/, '')\ .gsub(/^.*PROVINSIALE KOERANT.*$/, '')\ .gsub(/^\s*\d+\s*$/, '')\ .gsub(/^.*This gazette is also available.*$/, '')\ # get rid of date lines .gsub(/^\d+\s+\w+\s+\d+$/, '')\ # get rid of page number lines .gsub(/^\s*page \d+( of \d+)?\s*\n/i, '')\ .gsub(/^\s*\d*\s*No\. \d+$/, '')\ # get rid of lines with lots of ____ or ---- chars, they're usually pagebreaks .gsub(/^.*[_-]{5}.*$/, '') end |
#remove_empty_lines(s) ⇒ Object
36 37 38 |
# File 'lib/slaw/parse/cleanser.rb', line 36 def remove_empty_lines(s) s.gsub(/\n\s*$/, '') end |
#scrub(s) ⇒ Object
strip invalid bytes and ones we don’t like
47 48 49 50 51 |
# File 'lib/slaw/parse/cleanser.rb', line 47 def scrub(s) # we often get this unicode codepoint in the string, nuke it s.gsub([65532].pack('U*'), '')\ .gsub("", '') end |
#strip_toc(s) ⇒ Object
Do our best to remove table of contents at the start, it really confuses the grammer.
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/slaw/parse/cleanser.rb', line 161 def strip_toc(s) # first, try to find 'TABLE OF CONTENTS' anywhere within the first 4K of text, if toc_start = s[0..4096].match(/TABLE OF CONTENTS/i) # grab the first non-blank line after that, it's our end-of-TOC marker if eol = s.match(/^(.+?)$/, toc_start.end(0)) marker = eol[0] # search for the first line that is a prefix of marker (or vv), and delete # everything in between posn = eol.end(0) while m = s.match(/^(.+?)$/, posn) if marker.start_with?(m[0]) or m[0].start_with?(marker) return s[0...toc_start.begin(0)] + s[m.begin(0)..-1] end posn = m.end(0) end end end s end |
#unbreak_lines(s) ⇒ Object
Find likely candidates for unnecessarily broken lines and unbreaks them.
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
# File 'lib/slaw/parse/cleanser.rb', line 135 def unbreak_lines(s) lines = s.split(/\n/) output = [] start_re = /^\s*[a-z]/ end_re = /[a-z0-9]\s*$/ prev = nil lines.each_with_index do |line, i| if i == 0 output << line else prev = output[-1] if line =~ start_re and prev =~ end_re output[-1] = prev + ' ' + line else output << line end end end output.join("\n") end |