Class: Slaw::Parse::Cleanser
- Inherits:
-
Object
- Object
- Slaw::Parse::Cleanser
- Defined in:
- lib/slaw/parse/cleanser.rb
Overview
Helper class to run various cleanup routines on plain text.
Some of these routines can safely be run multiple times, others are meant to be run only once.
Instance Method Summary collapse
-
#break_lines(s) ⇒ Object
Make educated guesses about lines that should have been broken but haven’t, and break them.
-
#chomp(s) ⇒ Object
Get rid of whitespace at the end of lines and at the start and end of the entire string.
-
#cleanup(s) ⇒ Object
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace.
-
#correct_newlines(s) ⇒ Object
line endings.
- #enforce_newline(s) ⇒ Object
-
#expand_tabs(s) ⇒ Object
tabs to spaces.
-
#fix_quotes(s) ⇒ Object
change weird quotes to normal ones.
-
#reformat(s) ⇒ Object
Run deeper introspections and reformat the text, such as unwrapping/re-wrapping lines.
-
#remove_boilerplate(s) ⇒ Object
Try to remove boilerplate lines found in many files, such as page numbers.
-
#remove_empty_lines(s) ⇒ Object
————————————————————————.
-
#scrub(s) ⇒ Object
strip invalid bytes and ones we don’t like.
-
#strip_toc(s) ⇒ Object
Do our best to remove table of contents at the start, it really confuses the grammer.
-
#unbreak_lines(s) ⇒ Object
Find likely candidates for unnecessarily broken lines and unbreaks them.
Instance Method Details
#break_lines(s) ⇒ Object
Make educated guesses about lines that should have been broken but haven’t, and break them.
This is very dependent on a locale’s legislation grammar, there are lots of rules of thumb that make this work.
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/slaw/parse/cleanser.rb', line 103 def break_lines(s) # often we find a section title munged onto the same line as its first statement # eg: # foo bar. New section title 62. (1) For the purpose s = s.gsub(/\. ([^.]+) (\d+\. ?\(1\) )/, ".\n" + '\1' + "\n" + '\2') # New section title 62. (1) For the purpose s = s.gsub(/(\w) (\d+\. ?\(1\) )/, '\1' + "\n" + '\2') # (1) foo; (2) bar # (1) foo. (2) bar s = s.gsub(/(\w{3,}[;.]) (\([0-9a-z]+\))/, "\\1\n\\2") # (1) foo; and (2) bar # (1) foo; or (2) bar s = s.gsub(/; (and|or) \(/, "; \\1\n(") # The officer-in-Charge may – (a) remove all withered natural... \n(b) # We do this last, because by now we should have reconised that (b) should already # be on a new line. s = s.gsub(/ (\(a\) .+?\n\(b\))/, "\n\\1") # "foo" means ...; "bar" means s = s.gsub(/; (["”“][^"”“]+?["”“] means)/, ";\n\\1") # CHAPTER 4 PARKING METER PARKING GROUNDS Place of parking s = s.gsub(/([A-Z0-9 ]{5,}) ([A-Z][a-z ]{5,})/, "\\1\n\\2") s end |
#chomp(s) ⇒ Object
Get rid of whitespace at the end of lines and at the start and end of the entire string.
85 86 87 88 89 90 91 |
# File 'lib/slaw/parse/cleanser.rb', line 85 def chomp(s) # trailing whitespace at end of lines s = s.gsub(/ +$/, '') # whitespace on either side s.strip end |
#cleanup(s) ⇒ Object
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace. This is idempotent and safe to run multiple times.
14 15 16 17 18 19 20 21 |
# File 'lib/slaw/parse/cleanser.rb', line 14 def cleanup(s) s = scrub(s) s = correct_newlines(s) s = fix_quotes(s) s = (s) s = chomp(s) s = enforce_newline(s) end |
#correct_newlines(s) ⇒ Object
line endings
41 42 43 44 |
# File 'lib/slaw/parse/cleanser.rb', line 41 def correct_newlines(s) s.gsub(/\r\n/, "\n")\ .gsub(/\r/, "\n") end |
#enforce_newline(s) ⇒ Object
93 94 95 96 |
# File 'lib/slaw/parse/cleanser.rb', line 93 def enforce_newline(s) # ensure string ends with a newline s.end_with?("\n") ? s : (s + "\n") end |
#expand_tabs(s) ⇒ Object
tabs to spaces
59 60 61 62 |
# File 'lib/slaw/parse/cleanser.rb', line 59 def (s) s.gsub(/\t/, ' ')\ .gsub("\u00A0", ' ') # non-breaking space end |
#fix_quotes(s) ⇒ Object
change weird quotes to normal ones
54 55 56 |
# File 'lib/slaw/parse/cleanser.rb', line 54 def fix_quotes(s) s.gsub(/‘‘|’’|''/, '"') end |
#reformat(s) ⇒ Object
Run deeper introspections and reformat the text, such as unwrapping/re-wrapping lines. These may not be safe to run multiple times.
26 27 28 29 30 31 32 |
# File 'lib/slaw/parse/cleanser.rb', line 26 def reformat(s) s = remove_boilerplate(s) s = unbreak_lines(s) s = break_lines(s) s = strip_toc(s) s = enforce_newline(s) end |
#remove_boilerplate(s) ⇒ Object
Try to remove boilerplate lines found in many files, such as page numbers.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/slaw/parse/cleanser.rb', line 65 def remove_boilerplate(s) # nuke any line to do with Sabinet and the government printer s.gsub(/^.*Sabinet.*Government Printer.*$/i, '')\ .gsub(/^.*Provincial Gazette \d+.*$/i, '')\ .gsub(/^.*Provinsiale Koerant \d+.*$/i, '')\ .gsub(/^.*PROVINCIAL GAZETTE.*$/, '')\ .gsub(/^.*PROVINSIALE KOERANT.*$/, '')\ .gsub(/^\s*\d+\s*$/, '')\ .gsub(/^.*This gazette is also available.*$/, '')\ # get rid of date lines .gsub(/^\d{1,2}\s+\w+\s+\d{4}$/, '')\ # get rid of page number lines .gsub(/^\s*page \d+( of \d+)?\s*\n/i, '')\ .gsub(/^\s*\d*\s*No\. \d+$/, '')\ # get rid of lines with lots of ____ or ---- chars, they're usually pagebreaks .gsub(/^.*[_-]{5}.*$/, '') end |
#remove_empty_lines(s) ⇒ Object
36 37 38 |
# File 'lib/slaw/parse/cleanser.rb', line 36 def remove_empty_lines(s) s.gsub(/\n\s*$/, '') end |
#scrub(s) ⇒ Object
strip invalid bytes and ones we don’t like
47 48 49 50 51 |
# File 'lib/slaw/parse/cleanser.rb', line 47 def scrub(s) # we often get this unicode codepoint in the string, nuke it s.gsub([65532].pack('U*'), '')\ .gsub(/\n*/, '') end |
#strip_toc(s) ⇒ Object
Do our best to remove table of contents at the start, it really confuses the grammer.
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
# File 'lib/slaw/parse/cleanser.rb', line 174 def strip_toc(s) # first, try to find 'TABLE OF CONTENTS' anywhere within the first 4K of text, if toc_start = s[0..4096].match(/TABLE OF CONTENTS/i) # grab the first non-blank line after that, it's our end-of-TOC marker if eol = s.match(/^(.+?)$/, toc_start.end(0)) marker = eol[0] # search for the first line that is a prefix of marker (or vv), and delete # everything in between posn = eol.end(0) while m = s.match(/^(.+?)$/, posn) if marker.start_with?(m[0]) or m[0].start_with?(marker) return s[0...toc_start.begin(0)] + s[m.begin(0)..-1] end posn = m.end(0) end end end s end |
#unbreak_lines(s) ⇒ Object
Find likely candidates for unnecessarily broken lines and unbreaks them.
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
# File 'lib/slaw/parse/cleanser.rb', line 136 def unbreak_lines(s) lines = s.split(/\n/) output = [] # set of regex matcher pairs, one for the prev line, one for the current line matchers = [ [/[a-z0-9]$/, /^\s*[a-z]/], # line ends with and starst with lowercase [/;$/, /^\s*(and|or)/], # ends with ; then and/or on new line ] prev = nil lines.each_with_index do |line, i| if i == 0 output << line else prev = output[-1] unbreak = false for prev_re, curr_re in matchers if prev =~ prev_re and line =~ curr_re unbreak = true break end end if unbreak output[-1] = prev + ' ' + line else output << line end end end output.join("\n") end |