ExtendedEmailReplyParser
When implementing a "reply or comment by email" feature, it's neccessary to filter out signatures and the previous conversation. One needs to extract just the relevant parts for the conversation or comment section of the application. This is what this ruby gem helps to do.
This gem is an extended version of github's email_reply_parser
. It wraps the original email_reply_parser and allows to build extensions such as support for i18n and detecting previous conversation that is not properly marked as quotation by the sender's mail client.
Usage
Parsing incoming emails
To extract the relevant text of an email reply, call ExtendedEmailReplyParser.parse
, which accepts either a path to the email file, or a Mail::Message
object, or the email body text itself, and returns the parsed, i.e. relevant text, which can be used as comment in the application
ExtendedEmailReplyParser.parse "/path/to/email.eml"
ExtendedEmailReplyParser.parse
ExtendedEmailReplyParser.parse body_text # => parsed text as String
Or, if you prefer to call #parse
on the Mail::Message
:
.parse # => parsed text as String
For example, for a incoming Mail::Message
, message
:
@comment. = User.where(email: .from.first).first
@comment.text = ExtendedEmailReplyParser.parse
@comment.save!
Extracting the body text
This gem extends Mail::Message to conveniently extract the text body as utf-8.
.extract_text
Or, to see where this comes from:
ExtendedEmailReplyParser.extract_text
ExtendedEmailReplyParser.extract_text '/path/to/email.eml'
Or, in two separate steps:
ExtendedEmailReplyParser.read("/path/to/email.eml") # => Mail::Message
ExtendedEmailReplyParser.read("/path/to/email.eml").extract_text # => String
Optional: How to handle html-only emails: There are emails that do not have a text part but only an html part. For those emails, ExtendedEmailReplyParser.parse
uses the html part. But, to give you more control over how to handle those situations, ExtendedEmailReplyParser.extract_text
returns nil
for those emails. If you want your text extraction to fall back to the html part if the text part is missing, use this:
ExtendedEmailReplyParser.extract_text_or_html
ExtendedEmailReplyParser.extract_text_or_html '/path/to/email.eml'
The Mail::Message
object it self is extended to support message.extract_text
, message.extract_html_body_content
as well as message.extract_text_or_html
.
Writing a parser
The parsing system allows you to add your own parser to the parsing chain. Just define a class inheriting from ExtendedEmailReplyParser::Parsers::Base
and implement a parse
method. The text before parsing is accessed via text
.
# app/models/email_parsers/my_parser.rb
class EmailParsers::ShoutParser < ExtendedEmailReplyParser::Parsers::Base
# This parser SHOUTS THE WHOLE EMAIL.
# The text before parsing is accessed by `text`.
#
def parse
text.upcase
end
end
Selecting parsers to use
By default, all parsers that inherit from ExtendedEmailReplyParser::Parsers::Base
are used. One simply has to call:
ExtendedEmailReplyParser.parse
In order to select specific parsers, just chain them yourself:
EmailParsers::ShoutParser.parse \
ExtendedEmailReplyParser::Parsers::Github.parse \
Installation
Add this line to your application's Gemfile:
# Gemfile
gem 'extended_email_reply_parser'
Or, for the edge version:
# Gemfile
gem 'extended_email_reply_parser', github: 'fiedl/extended_email_reply_parser'
And then execute:
➜ bundle
Or install it yourself as:
➜ gem install extended_email_reply_parser
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Helper methods for writing parsers
To accomplish the most common parsing operations, there are a couple of helper methods.
This, for example, is the English parser.
module ExtendedEmailReplyParser
class Parsers::I18nEn < Parsers::Base
def parse
except_in_visible_block_quotes do
hide_everything_after ["From: ", "Sent: ", "To: "]
end
end
end
end
add_quote_header_regex
The github parser needs to know how to identify the header line of quotes, for example "On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote":
Hi,
On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote:
> Hi folks
>
> What is the best way to clear a Riak bucket of all key, values after
> running a test?
> I am currently using the Java HTTP API.
You can list the keys for the bucket and call delete for each. Or if you
put the keys (and kept track of them in your test) you can delete them
one at a time (without incurring the cost of calling list first.)
By default, it uses the regex /^On .* wrote:$/
for that. To make it recognize other header lines, specify their patterns using add_quote_header_regex
.
Since this is needed by the github parser, i.e. possibly before the parse
method of your custom parser is run, make sure to add the quote header regex in the class head:
module ExtendedEmailReplyParser
class Parsers::I18nDe < Parsers::Base
add_quote_header_regex '^Am .* schrieb.*$'
# ...
end
end
hide_everything_after
Some email clients do not quote the previous conversation.
Hi Chris,
this is great, thanks!
Cheers, John
From: Chris <[email protected]>
Sent: Saturday, July 09, 2016 3:27 PM
To: John <[email protected]>
Subject: The solution!
Hi John,
I've just found a solution to our big problem!
...
To remove the previous conversation, tell the parser expressions to identify where start of the previous conversation:
module ExtendedEmailReplyParser
class Parsers::I18nEn < Parsers::Base
def parse
except_in_visible_block_quotes do
hide_everything_after ["From: ", "Sent: ", "To: "]
end
# ...
end
end
end
(The parser will combine the expressions to a regex: /(#{expressions.join(".*?")}.*?\n)/m
, for example: /(From: .*?Sent: .*?To: .*?\n)/m
.)
To avoid cutting off the email within a visible quote, wrap the hide_everything_after
within a except_in_visible_block_quotes
block as shown above.
Hi Chris,
> From: Chris <[email protected]>
> Sent: Saturday, July 09, 2016 3:27 PM
> To: John <[email protected]>
> Subject: The solution!
>
> Hi John,
> I've just found a solution to our big problem!
this is great, thanks!
Cheers, John
If not wrapped in except_in_visible_block_quotes
, the parsed email would just be "Hi Chris,", because everything after "From: Sent: To:" would be cut off.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/fiedl/extended_email_reply_parser.
License
The gem is available as open source under the terms of the MIT License.