Ramparts - Spam Detection Build Status Maintainability Gem Version

Parses blocks of text to find phone numbers (including phonetic numbers), emails, and spammer urls

Example

Find obfuscated phone numbers

>> message = "Contact me directly ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE  ). Hope you cracked that number code."
>> Ramparts.find_phone_numbers(message)
[{start_offset: 22, end_offset: 71, type: :phone, value: 'FOUR ONE FIVE E I G H T 9 FOUR TOO EIGHT SIX FIVE'}]

Find obfuscated emails.

>> message = "Looking for honest worker .. contact ashley73299 AT yahoo dot com for more info"
>> Ramparts.find_emails(message)
[{start_offset: 37, end_offset: 65, type: :email, value: 'ashley73299 AT yahoo dot com'}]

Find both obfuscated emails and phone numbers.

>> message = "Looking for honest worker .. contact ashley73299 AT yahoo dot com or FOUR FIVE ONE 456 8900 for more info"
>> Ramparts.find_phone_numbers_and_emails(message)
[{start_offset: 37, end_offset: 65, type: :email, value: 'ashley73299 AT yahoo dot com'}, {start_offset: 70, end_offset: 92, type: :phone, value: 'FOUR FIVE ONE 456 8900'}]

Count the occurrences of well known spam URLs and keywords

>> message = ""cialis vs viagra spam guestbook.php?action=http://cialiswalmart.shop""
>> Ramparts.count_urls(message)
3

Installation

In the root directory of your project

gem install ramparts

Remember to require ramparts as necessary

require 'ramparts'

API

count_phone_numbers(text, options = {})

  • Returns the count of the number of phone numbers in the text. Currently uses a map reduce paradigm, which incurs information loss but is cleaner to implement, achieves better results, and is ~2x faster than find_phone_numbers
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
  • Output:
    • number of occurrences of phone numbers [Integer]
  • Example
    • Input:
      • text → "If you're interested in this position, do contact me directly on my phone number ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE ). Hope you cracked that number code."
    • Output: 1

find_phone_numbers(text, options = {})

  • Description: Finds all occurrences of emails within a block of text. Even when l33t speak, phonetics and space variations are used.
  • Input:
    • text [String]
    • options [Hash]
      • To Be Implemented
  • Output:
    • [Array]
      • match [Hash]
        • offset: [Integer]
        • value: [String]
  • Example
    • Input:
      • text → "If you're interested in this position, do contact me directly on my phone number ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE ). Hope you cracked that number code."
    • Output: [{start_offset: 84, end_offset: 133, type: :phone, value: 'FOUR ONE FIVE E I G H T 9 FOUR TOO EIGHT SIX FIVE'}]

replace_phone_numbers(text, options = {}, &block)

  • Description: Replaces all the occurrences of phone numbers within the text with what is returned in the block. Returns the redacted text. of text.
  • Input:
    • text [String]
    • insertable [String]
    • options [Hash]
      • To Be Implemented
  • Output:
    • updated text [String]
  • Example
    • Usage: altered_text = replace_phone_numbers(...) do CENSORED end
    • Input:
      • text → "If you're interested in this position, do contact me directly on my phone number ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE ). Hope you cracked that number code."
    • Output: "If you're interested in this position, do contact me directly on my phone number ( CENSORED ). Hope you cracked that number code."

count_emails(text, options = {})

  • Description: Returns the count of the number of emails in the text. Currently uses a map reduce paradigm, which incurs information loss but is cleaner to implement, achieves better results, and is ~2x faster than find_emails
    • Input:
      • text [String]
      • options [Hash]
        • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
    • Output:
      • number of occurences of emails [Integer]
    • Example
      • Input:
        • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, I repeat ashley73299 @ yahoo . com ?.. Ashley"
      • Output: 2

find_emails(text, options = {})

  • Description: Finds all occurrences of emails within a block of text. Even when l33t speak, phonetics are used.
  • Input:
    • text [String]
    • options [Hash]
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • [Array]
      • match [Hash]
        • offset: [Integer]
        • value: [String]
  • Example
    • Input:
      • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, I repeat ashley73299 @ yahoo . com ?.. Ashley"
    • Output: [{start_offset: 78, end_offset: 106, type: :email, value: 'ashley73299 AT yahoo dot com'}, {start_offset: 118, end_offset: 143, type: :email, value: 'ashley73299 @ yahoo . com'}]

replace_emails(text, options = {}, &block)

  • Description: Replaces all the occurrences of emails within the text with what is returned in the block. Returns the redacted text of text.
  • Input:
    • text [String]
    • options [Hash]
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • updated text [String]
  • Example
    • Usage: altered_text = replace_emails(...) do CENSORED end
    • Input:
      • text → "My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( [email protected] ) She will be waiting to hear from you kindly send her an email now!"
    • Output: My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( CENSORED ) She will be waiting to hear from you kindly send her an email now!

count_phone_numbers_and_emails(text, options = {})

  • Description: Returns the count of the number of emails in the text. Currently uses a map reduce paradigm, which incurs information loss but is cleaner to implement, achieves better results, and is ~2x faster than find_emails
    • Input:
      • text [String]
      • options [Hash]
        • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
        • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
        • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
        • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
    • Output:
      • number of occurences of emails [Integer]
    • Example
      • Input:
        • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, phone 416 090 78 NINE 5 ?.. Ashley"
      • Output: 2

find_phone_numbers_and_emails(text, options = {})

  • Description: Finds all occurrences of phone numbers and emails within a block of text.
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
    • Output:
      • [Array]
      • match [Hash]
        • offset: [Integer]
        • value: [String]
  • Example
    • Input:
      • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, phone 416 090 78 NINE 5 ?.. Ashley"
    • Output: [{start_offset: 78, end_offset: 106, type: :email, value: 'ashley73299 AT yahoo dot com'}, {start_offset: 115, end_offset: 132, type: :phone, value: 'FOUR FIVE ONE 456 8900'}]

replace_phone_numbers_and_emails(text, options = {}, &block)

  • Description: Replaces all the occurrences of phone numbers and emails within the text with what is returned from the block. Returns the redacted text of text.
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • updated text [String]
  • Example
    • Usage: altered_text = replace_phone_numbers_and_emails(...) do CENSORED end
    • Input:
      • text → "My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( [email protected] or FOUR FIVE ONE 789 4568 ) She will be waiting to hear from you kindly send her an email now!"
    • Output: My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( CENSORED or CENSORED ) She will be waiting to hear from you kindly send her an email now!

count_urls(text, options = {})

  • Description: Simple union regex to find if the text contains bad urls eg. viagra/cialis. Returns a count of the number of occurrences. appear in the text.
  • Input:
    • text [String]
    • options [Hash]
      • To Be Implemented
    • Output:
      • number of occurences of matches [Integer]
  • Example
    • Input:
      • text → "cialis vs cialis spam guestbook.php?action=http://cialiswalmart.shop"
    • Output: 3