Module: Gimme

Defined in:
lib/gimme_poc.rb,
lib/gimme_poc/poc.rb,
lib/gimme_poc/web.rb,
lib/gimme_poc/save.rb,
lib/gimme_poc/version.rb,
lib/gimme_poc/questions.rb,
lib/gimme_poc/contactpage.rb

Overview

Find the contact

Defined Under Namespace

Classes: Search

Constant Summary collapse

PHONE_REGEX =

Simple regex that looks for ###.#### or ###-####

/(\d{3}[-]\d{4}|\d{3}[.]\d{4})/
HTTP_REGEX =

Captures http:// and https://

%r{(\A\bhttps:\/\/|\bhttp:\/\/)}
VERSION =
'0.0.5'

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.contactObject

Returns the value of attribute contact.



14
15
16
# File 'lib/gimme_poc.rb', line 14

def contact
  @contact
end

Returns the value of attribute contact_links.



14
15
16
# File 'lib/gimme_poc.rb', line 14

def contact_links
  @contact_links
end

.pageObject

Returns the value of attribute page.



14
15
16
# File 'lib/gimme_poc.rb', line 14

def page
  @page
end

.urlObject

Returns the value of attribute url.



14
15
16
# File 'lib/gimme_poc.rb', line 14

def url
  @url
end

Class Method Details

.blind_test(url) ⇒ Object

TODO: Sometimes DNS will do a redirect and not give a 404.

Need to prevent redirects.

Blindly tests to see if a url goes through. If there is a 404 error, this will return nil.



86
87
88
89
# File 'lib/gimme_poc/web.rb', line 86

def blind_test(url)
  puts "\n(blind testing: #{url})"
  get(url)
end

.contact_page(url) ⇒ Object

Looks for contact page. Gets page if available. If no contact link is available, it will blind test ‘../contact’. Returns nil if nothing can be found.



17
18
19
20
21
22
23
24
25
26
27
28
29
30
# File 'lib/gimme_poc/contactpage.rb', line 17

def contact_page(url)
  puts 'now looking for contact pages'
  contact_link = link_with_href(/contact|Contact/)
  contact_test_page = merged_link('../contact')

  case
  when !contact_link.nil?
    puts "#{'Success:'.green} Found contact link!\n"
    get(merged_link(contact_link))
  else
    puts "#{'Warning:'.yellow} couldn't find contact link"
    blind_test(contact_test_page) || get(orig_domain(url))
  end
end

.contactform_available?Boolean

TODO: build better conditional to prevent false positives.

There could be other forms like newsletter signup, etc.

If there is a form with more than one field, this returns true. Forms with one field are typically search boxes.

Boolean, returns true if form is present on page.

Returns:

  • (Boolean)


29
30
31
# File 'lib/gimme_poc/questions.rb', line 29

def contactform_available?
  !(page.forms.select { |x| x.fields.length > 1 }.empty?)
end

.delete_failures(hsh) ⇒ Object

Remove negatives from the contacts hash. Deletes a key value pair with a value of either nil or false. Remember that false is a string.



38
39
40
# File 'lib/gimme_poc/save.rb', line 38

def delete_failures(hsh)
  hsh.delete_if { |_k, v| v.nil? || v == 'false' }
end

.email_available?Boolean

Boolean, returns true if email is present.

Returns:

  • (Boolean)


12
13
14
# File 'lib/gimme_poc/questions.rb', line 12

def email_available?
  !link_with_href('mailto').nil?
end

.english_contact_page(url) ⇒ Object

Looks for english page. Gets page if available then looks for english contact page.

If no english link is available, it will blind test ‘../en’ and ‘../english’. Returns nil if nothing can be found.



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/gimme_poc/contactpage.rb', line 39

def english_contact_page(url)
  puts "\nLooking for english page..."
  english_link = page.link_with(href: %r{en\/|english|English})
  test_en_page = merged_link('../en')
  test_english_page = merged_link('../english')

  case
  when !english_link.nil?
    puts "#{'Success:'.green} found english link!"
    get(merged_link(english_link.uri))
  else
    blind_test(test_en_page) || blind_test(test_english_page)
    puts "\n(restarting)\n"
    contact_page(url)
  end
end

.format_url(str) ⇒ Object

Mechanize needs absolute urls to work. If http:// or https:// isn’t present, append http://.



36
37
38
# File 'lib/gimme_poc/web.rb', line 36

def format_url(str)
  LazyDomain.autohttp(str)
end

.get(str) ⇒ Object

Go to a page using Mechanize. Sleep for a split second to not overload any servers.

Returns nil if bad url is given.



9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'lib/gimme_poc/web.rb', line 9

def get(str)
  url = format_url(str)
  puts "sending GET request to: #{url}"
  sleep(0.1)
  @page = Mechanize.new do |a|
    a.user_agent_alias = 'Mac Safari'
    a.open_timeout = 7
    a.read_timeout = 7
    a.idle_timeout = 7
    a.redirect_ok = true
  end.get(url)

rescue Mechanize::ResponseCodeError => e
  puts "#{'Response Error:'.red} #{e}"
rescue SocketError => e
  puts "#{'Socket Error:'.red} #{e}"
rescue Net::OpenTimeout => e
  puts "#{'Connection Timeout:'.red} #{e}"
rescue Errno::ETIMEDOUT => e
  puts "#{'Connection Timeout:'.red} #{e}"
rescue Net::HTTP::Persistent::Error
  puts "#{'Connection Timeout:'.red} read timeout, too many resets."
end

.go_to_contact_page(url) ⇒ Object

Scans for contact page. If it doesn’t work on the first try, It will look for english versions and try again. Processes left to right.

Returns nil if no contact page can be found.



9
10
11
# File 'lib/gimme_poc/contactpage.rb', line 9

def go_to_contact_page(url)
  contact_page(url) || english_contact_page(url)
end

Expects relative paths and merges everything. Returns a string. If there’s nothing, return nil.

Add b word block to ensure whole word is searched.



70
71
72
73
74
# File 'lib/gimme_poc/web.rb', line 70

def link_with_href(str)
  merged_link(page.link_with(href: /\b#{str}/).uri.to_s)
rescue
  nil
end

.memoryObject

Convenience method.



54
55
56
# File 'lib/gimme_poc.rb', line 54

def memory
  Search.all_sites
end

Used in case of relative paths. Merging guarantees correct url. This needs a url string as argument to work. Produces a merged uri string.



61
62
63
# File 'lib/gimme_poc/web.rb', line 61

def merged_link(url_str)
  page.uri.merge(url_str).to_s
end

.orig_domain(str) ⇒ Object

Outputs domain of a url. Useful if subdomains are given to GimmePOC and they don’t work.

For example: Given maps.google.com, returns ‘google.com’.



51
52
53
54
55
# File 'lib/gimme_poc/web.rb', line 51

def orig_domain(str)
  LazyDomain.parse(str).domain
rescue PublicSuffix::DomainInvalid => e
  puts "#{'Invalid Domain:'.red} #{e}"
end

.phone_available?Boolean

Boolean, returns true if phone number is present.

Returns:

  • (Boolean)


17
18
19
# File 'lib/gimme_poc/questions.rb', line 17

def phone_available?
  !(page.body =~ PHONE_REGEX).nil?
end

.poc(arr) ⇒ Object

The main method! Takes array of urls and gets contact info for each if possible. If url is bad, it’s converted to nil in ‘get’ method and skipped over.



26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/gimme_poc.rb', line 26

def poc(arr)
  arr = arr.split unless arr.is_a?(Array)
  arr.each do |url|
    puts '-' * 50
    puts "starting: #{url}"
    unless LazyDomain.valid?(url)
      puts "#{'Invalid Domain:'.red} `#{url}' is not a valid domain"
      next
    end
    case
    when subdomain?(url)
      puts '(This url is a subdomain.  Will try both sub and root domain.)'
      next if get(url).nil? && get(orig_domain(url)).nil?
    else
      next if get(url).nil?
    end
    start_contact_links
    mechpage = go_to_contact_page(url)
    if mechpage.nil?
      puts '(empty page, exiting.)'
    else
      save_available_contacts(mechpage.uri.to_s)
    end
  end
  Search.all_sites # Return results from all sites.
end

.reset!Object

Clears entire collection.



59
60
61
# File 'lib/gimme_poc.rb', line 59

def reset!
  Search.all_sites = []
end

.save_available_contacts(url, hsh = scan_for_contacts) ⇒ Object

Saves any available contact info to @contact_links.



43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/gimme_poc/save.rb', line 43

def save_available_contacts(url, hsh = scan_for_contacts)
  if something_to_save?(hsh)
    puts "\nsaving available contact information from #{url}"
    if hsh.is_a?(Hash)
      hsh.each do |k, v|
        save_link(k, v) # saves to @contact_links
      end
      delete_failures(@contact_links)
      puts "#{@contact_links}".cyan # same as @contact_links
    else
      fail ArgumentError, "expected hash but got #{hsh.class}"
    end
    Search::POC.new(url, @contact_links)
  else
    puts '(nothing to save)'
    return
  end
end

Used in save_available_contacts to save each valid link.



29
30
31
32
# File 'lib/gimme_poc/save.rb', line 29

def save_link(key, url)
  return if key.nil? || url.nil?
  @contact_links[key] = url
end

.scan_for_contactsObject

Returns anything that is possible to save, otherwise returns nil. Booleans for phone, email, or contact form will display True or False.

Add periods to link hrefs to prevent false positives. Must escape periods with a backslash or else it will be a regex wild card.



9
10
11
12
13
14
15
16
17
18
19
20
21
# File 'lib/gimme_poc/save.rb', line 9

def scan_for_contacts
  {
    contactpage: link_with_href('contact'),
    email_present: "#{email_available?}",
    phone_present: "#{phone_available?}",
    contact_form: "#{contactform_available?}",
    facebook: link_with_href('facebook\.'),
    twitter: link_with_href('twitter\.'),
    youtube: link_with_href('youtube\.'),
    googleplus: link_with_href('plus\.google\.'),
    linkedin: link_with_href('linkedin\.')
  }
end

.something_to_save?(hsh) ⇒ Boolean

Boolean, returns true if anything is present after running scan_for_contacts and deleting failures.

Returns:

  • (Boolean)


7
8
9
# File 'lib/gimme_poc/questions.rb', line 7

def something_to_save?(hsh)
  delete_failures(hsh).any?
end

Starts/Restarts @contacts_links hash



24
25
26
# File 'lib/gimme_poc/save.rb', line 24

def start_contact_links
  @contact_links = {}
end

.subdomain?(str) ⇒ Boolean

Boolean, returns true if url is not identical to original domain.

Returns:

  • (Boolean)


77
78
79
# File 'lib/gimme_poc/web.rb', line 77

def subdomain?(str)
  (unformat_url(str) != orig_domain(str))
end

.unformat_url(str) ⇒ Object

Used for subdomain check. Not a permanent change to url variable.



41
42
43
# File 'lib/gimme_poc/web.rb', line 41

def unformat_url(str)
  str.gsub(HTTP_REGEX, '')
end