Class: Slaw::Parse::Cleanser
- Inherits:
-
Object
- Object
- Slaw::Parse::Cleanser
- Defined in:
- lib/slaw/parse/cleanser.rb
Overview
Helper class to run various cleanup routines on plain text.
Some of these routines can safely be run multiple times, others are meant to be run only once.
Instance Method Summary collapse
-
#chomp(s) ⇒ Object
Get rid of whitespace at the end of lines and at the start and end of the entire string.
-
#cleanup(s) ⇒ Object
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace.
-
#correct_newlines(s) ⇒ Object
line endings.
- #enforce_newline(s) ⇒ Object
-
#expand_tabs(s) ⇒ Object
tabs to spaces.
-
#remove_empty_lines(s) ⇒ Object
————————————————————————.
-
#scrub(s) ⇒ Object
strip invalid bytes and ones we don’t like.
Instance Method Details
#chomp(s) ⇒ Object
Get rid of whitespace at the end of lines and at the start and end of the entire string.
49 50 51 52 53 54 55 |
# File 'lib/slaw/parse/cleanser.rb', line 49 def chomp(s) # trailing whitespace at end of lines s = s.gsub(/ +$/, '') # whitespace on either side s.strip end |
#cleanup(s) ⇒ Object
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace. This is idempotent and safe to run multiple times.
14 15 16 17 18 19 20 |
# File 'lib/slaw/parse/cleanser.rb', line 14 def cleanup(s) s = scrub(s) s = correct_newlines(s) s = (s) s = chomp(s) s = enforce_newline(s) end |
#correct_newlines(s) ⇒ Object
line endings
29 30 31 32 |
# File 'lib/slaw/parse/cleanser.rb', line 29 def correct_newlines(s) s.gsub(/\r\n/, "\n")\ .gsub(/\r/, "\n") end |
#enforce_newline(s) ⇒ Object
57 58 59 60 |
# File 'lib/slaw/parse/cleanser.rb', line 57 def enforce_newline(s) # ensure string ends with a newline s.end_with?("\n") ? s : (s + "\n") end |
#expand_tabs(s) ⇒ Object
tabs to spaces
42 43 44 45 |
# File 'lib/slaw/parse/cleanser.rb', line 42 def (s) s.gsub(/\t/, ' ')\ .gsub("\u00A0", ' ') # non-breaking space end |
#remove_empty_lines(s) ⇒ Object
24 25 26 |
# File 'lib/slaw/parse/cleanser.rb', line 24 def remove_empty_lines(s) s.gsub(/\n\s*$/, '') end |
#scrub(s) ⇒ Object
strip invalid bytes and ones we don’t like
35 36 37 38 39 |
# File 'lib/slaw/parse/cleanser.rb', line 35 def scrub(s) # we often get this unicode codepoint in the string, nuke it s.gsub([65532].pack('U*'), '')\ .gsub(/\n*/, '') end |