Module: OllamaChat::Parsing
- Included in:
- Chat
- Defined in:
- lib/ollama_chat/parsing.rb
Overview
A module that provides content parsing functionality for OllamaChat.
The Parsing module encapsulates methods for processing various types of input sources including HTML, XML, CSV, RSS, Atom, PDF, and Postscript documents. It handles content extraction and conversion into standardized text formats suitable for chat interactions. The module supports different document policies for handling imported or embedded content and provides utilities for parsing structured data from multiple source types.
Instance Method Summary collapse
-
#parse_atom(source_io) ⇒ String
The parse_atom method processes an Atom feed from the provided IO source and converts it into a formatted text representation.
-
#parse_content(content, images) ⇒ Array<String, Documentrix::Utils::Tags>
Parses content and processes embedded resources based on document policy.
-
#parse_csv(source_io) ⇒ String
The parse_csv method processes CSV content from an input source and converts it into a formatted string representation.
-
#parse_rss(source_io) ⇒ String
The parse_rss method processes an RSS feed source and converts it into a formatted text representation.
-
#parse_source(source_io) ⇒ String?
The parse_source method processes different types of input sources and converts them into a standardized text representation.
-
#pdf_read(io) ⇒ String
The pdf_read method extracts text content from a PDF file by reading all pages.
-
#ps_read(io) ⇒ String?
Reads and processes PDF content using Ghostscript for conversion.
-
#reverse_markdown(html) ⇒ String
The reverse_markdown method converts HTML content into Markdown format.
Instance Method Details
#parse_atom(source_io) ⇒ String
The parse_atom method processes an Atom feed from the provided IO source and converts it into a formatted text representation. It extracts the feed title and iterates through each item to build a structured output containing titles, links, and update dates.
The content of each item is converted using reverse_markdown for better readability.
title, items, links, update dates, and content
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
# File 'lib/ollama_chat/parsing.rb', line 115 def parse_atom(source_io) feed = RSS::Parser.parse(source_io, false, false) title = <<~EOT # #{feed.title.content} EOT feed.items.inject(title) do |text, item| text << <<~EOT ## [#{item&.title&.content}](#{item&.link&.href}) updated on #{item&.updated&.content} #{reverse_markdown(item&.content&.content)} EOT end end |
#parse_content(content, images) ⇒ Array<String, Documentrix::Utils::Tags>
Parses content and processes embedded resources based on document policy
This method analyzes input content for URLs, tags, and file references, fetches referenced resources, and processes them according to the current document policy. It supports different processing modes for various content types.
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
# File 'lib/ollama_chat/parsing.rb', line 203 def parse_content(content, images) images.clear = Documentrix::Utils::Tags.new valid_tag: /\A#*([\w\]\[]+)/ contents = [ content ] content.scan(%r((https?://\S+)|(?<![a-zA-Z\d])#+([\w\]\[]+)|(?:file://)?(\S*\/\S+))).each do |url, tag, file| case when tag .add(tag) next when file file = file.sub(/#.*/, '') file =~ %r(\A[~./]) or file.prepend('./') file = begin File.(file) rescue ArgumentError next end File.exist?(file) or next source = file when url links.add(url.to_s) source = url end fetch_source(source) do |source_io| case source_io&.content_type&.media_type when 'image' add_image(images, source_io, source) when 'text', 'application', nil case @document_policy when 'ignoring' nil when 'importing' contents << import_source(source_io, source) when 'embedding' (source_io, source) when 'summarizing' contents << summarize_source(source_io, source) end else STDERR.puts( "Cannot fetch #{source.to_s.inspect} with content type "\ "#{source_io&.content_type.inspect}" ) end end end new_content = contents.select { _1.present? rescue nil }.compact * "\n\n" return new_content, ( unless .empty?) end |
#parse_csv(source_io) ⇒ String
The parse_csv method processes CSV content from an input source and converts it into a formatted string representation. It iterates through each row of the CSV, skipping empty rows, and constructs a structured output where each row’s fields are formatted with indentation and separated by newlines. The resulting string includes double newlines between rows for readability.
61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/ollama_chat/parsing.rb', line 61 def parse_csv(source_io) result = +'' CSV.table(File.new(source_io), col_sep: ?,).each do |row| next if row.fields.select(&:present?).none? result << row.map { |pair| pair.compact.map { _1.to_s.strip } * ': ' if pair.last.present? }.select(&:present?).map { _1.prepend(' ') } * ?\n result << "\n\n" end result end |
#parse_rss(source_io) ⇒ String
The parse_rss method processes an RSS feed source and converts it into a formatted text representation. It extracts the channel title and iterates through each item in the feed to build a structured output. The method uses the RSS parser to handle the source input and formats the title, link, publication date, and description of each item into a readable text format with markdown-style headers and links.
channel title and item details
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/ollama_chat/parsing.rb', line 85 def parse_rss(source_io) feed = RSS::Parser.parse(source_io, false, false) title = <<~EOT # #{feed&.channel&.title} EOT feed.items.inject(title) do |text, item| text << <<~EOT ## [#{item&.title}](#{item&.link}) updated on #{item&.pubDate} #{reverse_markdown(item&.description)} EOT end end |
#parse_source(source_io) ⇒ String?
The parse_source method processes different types of input sources and converts them into a standardized text representation.
content type is not supported
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/ollama_chat/parsing.rb', line 22 def parse_source(source_io) case source_io&.content_type when 'text/html' reverse_markdown(source_io.read) when 'text/xml', 'application/xml' if source_io.read(8192) =~ %r(^\s*<rss\s) source_io.rewind return parse_rss(source_io) end source_io.rewind source_io.read when 'text/csv' parse_csv(source_io) when 'application/rss+xml' parse_rss(source_io) when 'application/atom+xml' parse_atom(source_io) when 'application/postscript' ps_read(source_io) when 'application/pdf' pdf_read(source_io) when %r(\Aapplication/(json|ld\+json|x-ruby|x-perl|x-gawk|x-python|x-javascript|x-c?sh|x-dosexec|x-shellscript|x-tex|x-latex|x-lyx|x-bibtex)), %r(\Atext/), nil source_io.read else STDERR.puts "Cannot parse #{source_io&.content_type} document." return end end |
#pdf_read(io) ⇒ String
The pdf_read method extracts text content from a PDF file by reading all pages.
139 140 141 142 |
# File 'lib/ollama_chat/parsing.rb', line 139 def pdf_read(io) reader = PDF::Reader.new(io) reader.pages.inject(+'') { |result, page| result << page.text } end |
#ps_read(io) ⇒ String?
Reads and processes PDF content using Ghostscript for conversion
This method takes an IO object containing PDF data, processes it through Ghostscript’s pdfwrite device, and returns the processed PDF content. If Ghostscript is not available in the system path, it outputs an error message.
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
# File 'lib/ollama_chat/parsing.rb', line 153 def ps_read(io) gs = `which gs`.chomp if gs.present? Tempfile.create do |tmp| IO.popen("#{gs} -q -sDEVICE=pdfwrite -sOutputFile=#{tmp.path} -", 'wb') do |gs_io| until io.eof? buffer = io.read(1 << 17) IO.select(nil, [ gs_io ], nil) gs_io.write buffer end gs_io.close File.open(tmp.path, 'rb') do |pdf| pdf_read(pdf) end end end else STDERR.puts "Cannot convert #{io&.content_type} whith ghostscript, gs not in path." end end |
#reverse_markdown(html) ⇒ String
The reverse_markdown method converts HTML content into Markdown format.
This method processes HTML input and transforms it into equivalent Markdown, using specific conversion options to ensure compatibility and formatting.
183 184 185 186 187 188 189 190 |
# File 'lib/ollama_chat/parsing.rb', line 183 def reverse_markdown(html) ReverseMarkdown.convert( html, unknown_tags: :bypass, github_flavored: true, tag_border: '' ) end |