Module: TextCleaner
- Defined in:
- lib/text_cleaner.rb,
lib/text_cleaner/version.rb
Constant Summary collapse
- DICTIONARY =
<<-EOS quotation mark " " " " " " ampersand & & & & & & less-than sign < < < < < < greater-than sign > > > > > > Latin capital ligature OE Œ Œ Œ Œ Œ Œ Latin small ligature oe œ œ œ œ œ œ Latin capital letter S with caron Š Š Š Š Š Š Latin small letter s with caron š š š š š š Latin capital letter Y with diaeresis Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ modifier letter circumflex accent ˆ ˆ ˆ ˆ ˆ ˆ small tilde ˜ ˜ ˜ ˜ ˜ ˜ en space       em space       thin space       zero width non-joiner ‌ ‌ ‌ zero width joiner ‍ ‍ ‍ left-to-right mark ‎ ‎ ‎ right-to-left mark ‏ ‏ ‏ en dash – – – – – – em dash — — — — — — amster right single quotation mark ’ ’ ’ ’ ’ ’ left single quotation mark ‘ ‘ ‘ ‘ ‘ ‘ right single quotation mark ’ ’ ’ ’ ’ ’ single low-9 quotation mark ‚ ‚ ‚ ‚ ‚ ‚ left double quotation mark “ “ “ “ “ “ right double quotation mark ” ” ” ” ” ” double low-9 quotation mark „ „ „ „ „ „ dagger † † † † † † double dagger ‡ ‡ ‡ ‡ ‡ ‡ per mille sign ‰ ‰ ‰ ‰ ‰ ‰ single left-pointing angle quotation mark ‹ ‹ ‹ ‹ ‹ ‹ single right-pointing angle quotation mark › › › › › › euro sign € € € € € € Latin small f with hook = function = florin ƒ ƒ ƒ ƒ ƒ ƒ Greek capital letter alpha Α Α Α Α Α Α Greek capital letter beta Β Β Β Β Β Β Greek capital letter gamma Γ Γ Γ Γ Γ Γ Greek capital letter delta Δ Δ Δ Δ Δ Δ Greek capital letter epsilon Ε Ε Ε Ε Ε Ε Greek capital letter zeta Ζ Ζ Ζ Ζ Ζ Ζ Greek capital letter eta Η Η Η Η Η Η Greek capital letter theta Θ Θ Θ Θ Θ Θ Greek capital letter iota Ι Ι Ι Ι Ι Ι Greek capital letter kappa Κ Κ Κ Κ Κ Κ Greek capital letter lambda Λ Λ Λ Λ Λ Λ Greek capital letter mu Μ Μ Μ Μ Μ Μ Greek capital letter nu Ν Ν Ν Ν Ν Ν Greek capital letter xi Ξ Ξ Ξ Ξ Ξ Ξ Greek capital letter omicron Ο Ο Ο Ο Ο Ο Greek capital letter pi Π Π Π Π Π Π Greek capital letter rho Ρ Ρ Ρ Ρ Ρ Ρ Greek capital letter sigma Σ Σ Σ Σ Σ Σ Greek capital letter tau Τ Τ Τ Τ Τ Τ Greek capital letter upsilon Υ Υ Υ Υ Υ Υ Greek capital letter phi Φ Φ Φ Φ Φ Φ Greek capital letter chi Χ Χ Χ Χ Χ Χ Greek capital letter psi Ψ Ψ Ψ Ψ Ψ Ψ Greek capital letter omega Ω Ω Ω Ω Ω Ω Greek small letter alpha α α α α α α Greek small letter beta β β β β β β Greek small letter gamma γ γ γ γ γ γ Greek small letter delta δ δ δ δ δ δ Greek small letter epsilon ε ε ε ε ε ε Greek small letter zeta ζ ζ ζ ζ ζ ζ Greek small letter eta η η η η η η Greek small letter theta θ θ θ θ θ θ Greek small letter iota ι ι ι ι ι ι Greek small letter kappa κ κ κ κ κ κ Greek small letter lambda λ λ λ λ λ λ Greek small letter mu μ μ μ μ μ μ Greek small letter nu ν ν ν ν ν ν Greek small letter xi ξ ξ ξ ξ ξ ξ Greek small letter omicron ο ο ο ο ο ο Greek small letter pi π π π π π π Greek small letter rho ρ ρ ρ ρ ρ ρ Greek small letter final sigma ς ς ς ς ς ς Greek small letter sigma σ σ σ σ σ σ Greek small letter tau τ τ τ τ τ τ Greek small letter upsilon υ υ υ υ υ υ Greek small letter phi φ φ φ φ φ φ Greek small letter chi χ χ χ χ χ χ Greek small letter psi ψ ψ ψ ψ ψ ψ Greek small letter omega ω ω ω ω ω ω Greek small letter theta symbol ϑ ϑ ϑ ϑ ϑ ϑ Greek upsilon with hook symbol ϒ ϒ ϒ ϒ ϒ ϒ Greek pi symbol ϖ ϖ ϖ ϖ ϖ ϖ bullet = black small circle • • • • • • horizontal ellipsis = three dot leader … … … … … … prime = minutes = feet ′ ′ ′ ′ ′ ′ double prime = seconds = inches ″ ″ ″ ″ ″ ″ overline = spacing overscore ‾ ‾ ‾ ‾ ‾ ‾ fraction slash ⁄ ⁄ ⁄ ⁄ ⁄ ⁄ script capital P = power set = Weierstrass p ℘ ℘ ℘ ℘ ℘ ℘ blackletter capital I = imaginary part ℑ ℑ ℑ ℑ ℑ ℑ blackletter capital R = real part symbol ℜ ℜ ℜ ℜ ℜ ℜ trade mark sign ™ ™ ™ ™ ™ ™ alef symbol = first transfinite cardinal ℵ ℵ ℵ ℵ ℵ ℵ leftwards arrow ← ← ← ← ← ← upwards arrow ↑ ↑ ↑ ↑ ↑ ↑ rightwards arrow → → → → → → downwards arrow ↓ ↓ ↓ ↓ ↓ ↓ left right arrow ↔ ↔ ↔ ↔ ↔ ↔ downwards arrow with corner leftwards = carriage return ↵ ↵ ↵ ↵ ↵ ↵ leftwards double arrow ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ upwards double arrow ⇑ ⇑ ⇑ ⇑ ⇑ ⇑ rightwards double arrow ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ downwards double arrow ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ left right double arrow ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ for all ∀ ∀ ∀ ∀ ∀ ∀ partial differential ∂ ∂ ∂ ∂ ∂ ∂ there exists ∃ ∃ ∃ ∃ ∃ ∃ empty set = null set = diameter ∅ ∅ ∅ ∅ ∅ ∅ nabla = backward difference ∇ ∇ ∇ ∇ ∇ ∇ element of ∈ ∈ ∈ ∈ ∈ ∈ not an element of ∉ ∉ ∉ ∉ ∉ ∉ contains as member ∋ ∋ ∋ ∋ ∋ ∋ n-ary product = product sign ∏ ∏ ∏ ∏ ∏ ∏ n-ary sumation ∑ ∑ ∑ ∑ ∑ ∑ minus sign − − − − − − asterisk operator ∗ ∗ ∗ ∗ ∗ ∗ square root = radical sign √ √ √ √ √ √ proportional to ∝ ∝ ∝ ∝ ∝ ∝ infinity ∞ ∞ ∞ ∞ ∞ ∞ angle ∠ ∠ ∠ ∠ ∠ ∠ logical and = wedge ∧ ∧ ∧ ∧ ∧ ∧ logical or = vee ∨ ∨ ∨ ∨ ∨ ∨ intersection = cap ∩ ∩ ∩ ∩ ∩ ∩ union = cup ∪ ∪ ∪ ∪ ∪ ∪ integral ∫ ∫ ∫ ∫ ∫ ∫ therefore ∴ ∴ ∴ ∴ ∴ ∴ tilde operator = varies with = similar to ∼ ∼ ∼ ∼ ∼ ∼ approximately equal to ≅ ≅ ≅ ≅ ≅ ≅ almost equal to = asymptotic to ≈ ≈ ≈ ≈ ≈ ≈ not equal to ≠ ≠ ≠ ≠ ≠ ≠ identical to ≡ ≡ ≡ ≡ ≡ ≡ less-than or equal to ≤ ≤ ≤ ≤ ≤ ≤ greater-than or equal to ≥ ≥ ≥ ≥ ≥ ≥ subset of ⊂ ⊂ ⊂ ⊂ ⊂ ⊂ superset of ⊃ ⊃ ⊃ ⊃ ⊃ ⊃ not a subset of ⊄ ⊄ ⊄ ⊄ ⊄ ⊄ subset of or equal to ⊆ ⊆ ⊆ ⊆ ⊆ ⊆ superset of or equal to ⊇ ⊇ ⊇ ⊇ ⊇ ⊇ circled plus = direct sum ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ circled times = vector product ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ up tack = orthogonal to = perpendicular ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ dot operator ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ left ceiling = APL upstile ⌈ ⌈ ⌈ ⌈ ⌈ ⌈ right ceiling ⌉ ⌉ ⌉ ⌉ ⌉ ⌉ left floor = APL downstile ⌊ ⌊ ⌊ ⌊ ⌊ ⌊ right floor ⌋ ⌋ ⌋ ⌋ ⌋ ⌋ left-pointing angle bracket = bra ⟨ 〈 〈 〈 〈 〈 right-pointing angle bracket = ket ⟩ 〉 〉 〉 〉 〉 lozenge ◊ ◊ ◊ ◊ ◊ ◊ black spade suit ♠ ♠ ♠ ♠ ♠ ♠ black club suit = shamrock ♣ ♣ ♣ ♣ ♣ ♣ black heart suit = valentine ♥ ♥ ♥ ♥ ♥ ♥ black diamond suit ♦ ♦ ♦ ♦ ♦ ♦ inverted exclamation mark ¡ ¡ ¡ ¡ ¡ ¡ cent sign ¢ ¢ ¢ ¢ ¢ ¢ pound sign £ £ £ £ £ £ currency sign ¤ ¤ ¤ ¤ ¤ ¤ yen sign = yuan sign ¥ ¥ ¥ ¥ ¥ ¥ broken bar = broken vertical bar ¦ ¦ ¦ ¦ ¦ ¦ section sign § § § § § § diaeresis = spacing diaeresis ¨ ¨ ¨ ¨ ¨ ¨ copyright sign © © © © © © feminine ordinal indicator ª ª ª ª ª ª left-pointing double angle quotation mark = left pointing guillemet « « « « « « not sign ¬ ¬ ¬ ¬ ¬ ¬ soft hyphen = discretionary hyphen ­ ­ ­ registered sign = registered trade mark sign ® ® ® ® ® ® macron = spacing macron = overline = APL overbar ¯ ¯ ¯ ¯ ¯ ¯ degree sign ° ° ° ° ° ° plus-minus sign = plus-or-minus sign ± ± ± ± ± ± superscript two = superscript digit two = squared ² ² ² ² ² ² superscript three = superscript digit three = cubed ³ ³ ³ ³ ³ ³ acute accent = spacing acute ´ ´ ´ ´ ´ ´ micro sign µ µ µ µ µ µ pilcrow sign = paragraph sign ¶ ¶ ¶ ¶ ¶ ¶ middle dot = Georgian comma = Greek middle dot · · · · · · cedilla = spacing cedilla ¸ ¸ ¸ ¸ ¸ ¸ superscript one = superscript digit one ¹ ¹ ¹ ¹ ¹ ¹ masculine ordinal indicator º º º º º º right-pointing double angle quotation mark = right pointing guillemet » » » » » » vulgar fraction one quarter = fraction one quarter ¼ ¼ ¼ ¼ ¼ ¼ vulgar fraction one half = fraction one half ½ ½ ½ ½ ½ ½ vulgar fraction three quarters = fraction three quarters ¾ ¾ ¾ ¾ ¾ ¾ inverted question mark = turned question mark ¿ ¿ ¿ ¿ ¿ ¿ Latin capital letter A with grave = Latin capital letter A grave À À À À À À Latin capital letter A with acute Á Á Á Á Á Á Latin capital letter A with circumflex       Latin capital letter A with tilde à à à à à à Latin capital letter A with diaeresis Ä Ä Ä Ä Ä Ä Latin capital letter A with ring above = Latin capital letter A ring Å Å Å Å Å Å Latin capital letter AE = Latin capital ligature AE Æ Æ Æ Æ Æ Æ Latin capital letter C with cedilla Ç Ç Ç Ç Ç Ç Latin capital letter E with grave È È È È È È Latin capital letter E with acute É É É É É É Latin capital letter E with circumflex Ê Ê Ê Ê Ê Ê Latin capital letter E with diaeresis Ë Ë Ë Ë Ë Ë Latin capital letter I with grave Ì Ì Ì Ì Ì Ì Latin capital letter I with acute Í Í Í Í Í Í Latin capital letter I with circumflex Î Î Î Î Î Î Latin capital letter I with diaeresis Ï Ï Ï Ï Ï Ï Latin capital letter ETH Ð Ð Ð Ð Ð Ð Latin capital letter N with tilde Ñ Ñ Ñ Ñ Ñ Ñ Latin capital letter O with grave Ò Ò Ò Ò Ò Ò Latin capital letter O with acute Ó Ó Ó Ó Ó Ó Latin capital letter O with circumflex Ô Ô Ô Ô Ô Ô Latin capital letter O with tilde Õ Õ Õ Õ Õ Õ Latin capital letter O with diaeresis Ö Ö Ö Ö Ö Ö multiplication sign × × × × × × Latin capital letter O with stroke = Latin capital letter O slash Ø Ø Ø Ø Ø Ø Latin capital letter U with grave Ù Ù Ù Ù Ù Ù Latin capital letter U with acute Ú Ú Ú Ú Ú Ú Latin capital letter U with circumflex Û Û Û Û Û Û Latin capital letter U with diaeresis Ü Ü Ü Ü Ü Ü Latin capital letter Y with acute Ý Ý Ý Ý Ý Ý Latin capital letter THORN Þ Þ Þ Þ Þ Þ Latin small letter sharp s = ess-zed ß ß ß ß ß ß Latin small letter a with grave = Latin small letter a grave à à à à à à Latin small letter a with acute á á á á á á Latin small letter a with circumflex â â â â â â Latin small letter a with tilde ã ã ã ã ã ã Latin small letter a with diaeresis ä ä ä ä ä ä Latin small letter a with ring above = Latin small letter a ring å å å å å å Latin small letter ae = Latin small ligature ae æ æ æ æ æ æ Latin small letter c with cedilla ç ç ç ç ç ç Latin small letter e with grave è è è è è è Latin small letter e with acute é é é é é é Latin small letter e with circumflex ê ê ê ê ê ê Latin small letter e with diaeresis ë ë ë ë ë ë Latin small letter i with grave ì ì ì ì ì ì Latin small letter i with acute í í í í í í Latin small letter i with circumflex î î î î î î Latin small letter i with diaeresis ï ï ï ï ï ï Latin small letter eth ð ð ð ð ð ð Latin small letter n with tilde ñ ñ ñ ñ ñ ñ Latin small letter o with grave ò ò ò ò ò ò Latin small letter o with acute ó ó ó ó ó ó Latin small letter o with circumflex ô ô ô ô ô ô Latin small letter o with tilde õ õ õ õ õ õ Latin small letter o with diaeresis ö ö ö ö ö ö division sign ÷ ÷ ÷ ÷ ÷ ÷ Latin small letter o with stroke = Latin small letter o slash ø ø ø ø ø ø Latin small letter u with grave ù ù ù ù ù ù Latin small letter u with acute ú ú ú ú ú ú Latin small letter u with circumflex û û û û û û Latin small letter u with diaeresis ü ü ü ü ü ü Latin small letter y with acute ý ý ý ý ý ý Latin small letter thorn þ þ þ þ þ þ Latin small letter y with diaeresis ÿ ÿ ÿ ÿ ÿ ÿ EOS
- VERSION =
"0.0.1"
Class Method Summary collapse
Class Method Details
.clean(input_text) ⇒ Object
261 262 263 264 265 266 267 |
# File 'lib/text_cleaner.rb', line 261 def self.clean(input_text) DICTIONARY.each_line do |line| name, html, hex, oct, display, display2, display3 = line.split(/\t/) input_text.gsub!(display.strip, html.strip) end input_text end |