Module: Greeb::Parser

Extended by:
Parser
Included in:
Parser
Defined in:
lib/greeb/parser.rb

Overview

It is often necessary to find different entities in a natural language text. These entities are URLs, e-mail addresses, names, etc. This module includes several helpers that could help to solve these problems.

Constant Summary collapse

URL =

An URL pattern. Not so precise, but IDN-compatible.

%r{\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|/)))}i
EMAIL =

A horrible e-mail pattern.

/[A-Z0-9._%+-][email protected][A-Z0-9.-]+\.[A-Z]{2,4}/i
ABBREV =

Another horrible pattern. Now for abbreviations.

/\b((-{0,1}\p{L}\.)*|(-{0,1}\p{L}\. )*)-{0,1}\p{L}\./i
HTML =

This pattern matches anything that looks like HTML. Or not.

/<(.*?)>/i
TIME =

Time pattern.

/\b(\d|[0-2]\d):[0-6]\d(:[0-6]\d){0,1}\b/i
APOSTROPHE =

Apostrophe pattern.

/['’]/i
TOGETHER =

Together pattern.

[:letter, :integer, :apostrophe, :together]

Instance Method Summary collapse

Instance Method Details

#abbrevs(text) ⇒ Array<Greeb::Span>

Recognize abbreviations in the input text.

Parameters:

  • text (String)

    input text.

Returns:


63
64
65
# File 'lib/greeb/parser.rb', line 63

def abbrevs(text)
  scan(text, ABBREV, :abbrev)
end

#apostrophes(text, spans) ⇒ Array<Greeb::Span>

Retrieve apostrophes from the tokenized text. The algorithm may be more optimal.

Parameters:

  • text (String)

    input text.

  • spans (Array<Greeb::Span>)

    already tokenized text.

Returns:


95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/greeb/parser.rb', line 95

def apostrophes(text, spans)
  apostrophes = scan(text, APOSTROPHE, :apostrophe)
  return [] if apostrophes.empty?

  apostrophes.each { |s| Greeb.extract_spans(spans, s) }.clear

  spans.each_with_index.each_cons(3).reverse_each do |(s1, i), (s2, j), (s3, k)|
    next unless s1 && s1.type == :letter
    next unless s2 && s2.type == :apostrophe
    next unless !s3 || s3 && s3.type == :letter
    s3, k = s2, j unless s3
    apostrophes << Greeb::Span.new(s1.from, s3.to, s1.type)
    spans[i..k] = apostrophes.last
  end

  apostrophes
end

#emails(text) ⇒ Array<Greeb::Span>

Recognize e-mail addresses in the input text.

Parameters:

  • text (String)

    input text.

Returns:


53
54
55
# File 'lib/greeb/parser.rb', line 53

def emails(text)
  scan(text, EMAIL, :email)
end

#html(text) ⇒ Array<Greeb::Span>

Recognize HTML-alike entities in the input text.

Parameters:

  • text (String)

    input text.

Returns:


73
74
75
# File 'lib/greeb/parser.rb', line 73

def html(text)
  scan(text, HTML, :html)
end

#time(text) ⇒ Array<Greeb::Span>

Recognize timestamps in the input text.

Parameters:

  • text (String)

    input text.

Returns:


83
84
85
# File 'lib/greeb/parser.rb', line 83

def time(text)
  scan(text, TIME, :time)
end

#together(spans) ⇒ Array<Greeb::Span>

Merge some spans that are together.

Parameters:

  • spans (Array<Greeb::Span>)

    already tokenized text.

Returns:


119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# File 'lib/greeb/parser.rb', line 119

def together(spans)
  loop do
    converged = true

    spans.each_with_index.each_cons(2).reverse_each do |(s1, i), (s2, j)|
      next unless TOGETHER.include?(s1.type) && TOGETHER.include?(s2.type)
      spans[i..j] = Greeb::Span.new(s1.from, s2.to, :together)
      converged = false
    end

    break if converged
  end

  spans
end

#urls(text) ⇒ Array<Greeb::Span>

Recognize URLs in the input text. Actually, URL is obsolete standard and this code should be rewritten to use the URI concept.

Parameters:

  • text (String)

    input text.

Returns:


43
44
45
# File 'lib/greeb/parser.rb', line 43

def urls(text)
  scan(text, URL, :url)
end