Class: Slaw::Parse::Cleanser

Inherits:
Object
  • Object
show all
Defined in:
lib/slaw/parse/cleanser.rb

Overview

Helper class to run various cleanup routines on plain text.

Some of these routines can safely be run multiple times, others are meant to be run only once.

Instance Method Summary collapse

Instance Method Details

#chomp(s) ⇒ Object

Get rid of whitespace at the end of lines and at the start and end of the entire string.



49
50
51
52
53
54
55
# File 'lib/slaw/parse/cleanser.rb', line 49

def chomp(s)
  # trailing whitespace at end of lines
  s = s.gsub(/ +$/, '')

  # whitespace on either side
  s.strip
end

#cleanup(s) ⇒ Object

Run general cleanup, such as stripping bad chars and removing unnecessary whitespace. This is idempotent and safe to run multiple times.



14
15
16
17
18
19
20
# File 'lib/slaw/parse/cleanser.rb', line 14

def cleanup(s)
  s = scrub(s)
  s = correct_newlines(s)
  s = expand_tabs(s)
  s = chomp(s)
  s = enforce_newline(s)
end

#correct_newlines(s) ⇒ Object

line endings



29
30
31
32
# File 'lib/slaw/parse/cleanser.rb', line 29

def correct_newlines(s)
  s.gsub(/\r\n/, "\n")\
   .gsub(/\r/, "\n")
end

#enforce_newline(s) ⇒ Object



57
58
59
60
# File 'lib/slaw/parse/cleanser.rb', line 57

def enforce_newline(s)
  # ensure string ends with a newline
  s.end_with?("\n") ? s : (s + "\n")
end

#expand_tabs(s) ⇒ Object

tabs to spaces



42
43
44
45
# File 'lib/slaw/parse/cleanser.rb', line 42

def expand_tabs(s)
  s.gsub(/\t/, ' ')\
   .gsub("\u00A0", ' ') # non-breaking space
end

#remove_empty_lines(s) ⇒ Object




24
25
26
# File 'lib/slaw/parse/cleanser.rb', line 24

def remove_empty_lines(s)
  s.gsub(/\n\s*$/, '')
end

#scrub(s) ⇒ Object

strip invalid bytes and ones we don’t like



35
36
37
38
39
# File 'lib/slaw/parse/cleanser.rb', line 35

def scrub(s)
  # we often get this unicode codepoint in the string, nuke it
  s.gsub([65532].pack('U*'), '')\
   .gsub(/\n*/, '')
end