Method: Mapi::RTF.rtf2html

Defined in:: lib/mapi/rtf.rb

.rtf2html(rtf) ⇒ `Object`

Note, this is a conversion of the original C code. Not great - needs tests and some refactoring, and an attempt to correct some inaccuracies. Hacky but works.

Returns nil if it doesn't look like an rtf encapsulated rtf.

Some cases that the original didn't deal with have been patched up, eg from this chunk, where there are tags outside of the htmlrtf ignore block.

"
\htmlrtf \line \htmlrtf0 \line {\*\htmltag84 <a href..."

We take the approach of ignoring all rtf tags not explicitly handled. A proper parse tree would be nicer to work with. will need to look for ruby rtf library

Some of the original comment to the c code is excerpted here:

Sometimes in MAPI, the PR_BODY_HTML property contains the HTML of a message. But more usually, the HTML is encoded inside the RTF body (which you get in the PR_RTF_COMPRESSED property). These routines concern the decoding of the HTML from this RTF body.

An encoded htmlrtf file is a valid RTF document, but which contains additional html markup information in its comments, and sometimes contains the equivalent rtf markup outside the comments. Therefore, when it is displayed by a plain simple RTF reader, the html comments are ignored and only the rtf markup has effect. Typically, this rtf markup is not as rich as the html markup would have been. But for an html-aware reader (such as the code below), we can ignore all the rtf markup, and extract the html markup out of the comments, and get a valid html document.

There are actually two kinds of html markup in comments. Most of them are prefixed by "*\htmltagNNN", for some number NNN. But sometimes there's one prefixed by "*\mhtmltagNNN" followed by "*\htmltagNNN". In this case, the two are equivalent, but the m-tag is for a MIME Multipart/Mixed Message and contains tags that refer to content-ids (e.g. img src="cid:072344a7") while the normal tag just refers to a name (e.g. img src="fred.jpg") The code below keeps the m-tag and discards the normal tag. If there are any m-tags like this, then the message also contains an attachment with a PR_CONTENT_ID property e.g. "072344a7". Actually, sometimes the m-tag is e.g. img src="http://outlook/welcome.html" and the attachment has a PR_CONTENT_LOCATION "http://outlook/welcome.html" instead of a PR_CONTENT_ID.

# File 'lib/mapi/rtf.rb', line 119

def rtf2html rtf
	scan = StringScanner.new rtf
	# require \fromhtml. is this worth keeping? apparently you see \\fromtext if it
	# was converted from plain text. 
	return nil unless rtf["\\fromhtml"]
	html = ''
	ignore_tag = nil
	# skip up to the first htmltag. return nil if we don't ever find one
	return nil unless scan.scan_until /(?=\{\\\*\\htmltag)/
	until scan.empty?
		if scan.scan /\{/
		elsif scan.scan /\}/
		elsif scan.scan /\\\*\\htmltag(\d+) ?/
			#p scan[1]
			if ignore_tag == scan[1]
				scan.scan_until /\}/
				ignore_tag = nil
			end
		elsif scan.scan /\\\*\\mhtmltag(\d+) ?/
				ignore_tag = scan[1]
		elsif scan.scan /\\par ?/
			html << "\r\n"
		elsif scan.scan /\\tab ?/
			html << "\t"
		elsif scan.scan /\\'([0-9A-Za-z]{2})/
			html << scan[1].hex.chr
		elsif scan.scan /\\pntext/
			scan.scan_until /\}/
		elsif scan.scan /\\htmlrtf/
			scan.scan_until /\\htmlrtf0 ?/
		# a generic throw away unknown tags thing.
		# the above 2 however, are handled specially
		elsif scan.scan /\\[a-z-]+(\d+)? ?/
		#elsif scan.scan /\\li(\d+) ?/
		#elsif scan.scan /\\fi-(\d+) ?/
		elsif scan.scan /[\r\n]/
		elsif scan.scan /\\([{}\\])/
			html << scan[1]
		elsif scan.scan /(.)/
			html << scan[1]
		else
			p :wtf
		end
	end
	html.strip.empty? ? nil : html
end

Method: Mapi::RTF.rtf2html

.rtf2html(rtf) ⇒ Object

.rtf2html(rtf) ⇒ `Object`