Module: OllamaChat::Parsing

Included in:
Chat
Defined in:
lib/ollama_chat/parsing.rb

Instance Method Summary collapse

Instance Method Details

#parse_atom(source_io) ⇒ String

The parse_atom method processes an Atom feed from the provided IO source and converts it into a formatted text representation. It extracts the feed title and iterates through each item to build a structured output containing titles, links, and update dates.

The content of each item is converted using reverse_markdown for better readability.

title, items, links, update dates, and content

Parameters:

  • source_io (IO)

    the input stream containing the Atom feed data

Returns:

  • (String)

    a formatted string representation of the Atom feed with



102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/ollama_chat/parsing.rb', line 102

def parse_atom(source_io)
  feed = RSS::Parser.parse(source_io, false, false)
  title = "    # \#{feed.title.content}\n\n  EOT\n  feed.items.inject(title) do |text, item|\n    text << <<~EOT\n      ## [\#{item&.title&.content}](\#{item&.link&.href})\n\n      updated on \#{item&.updated&.content}\n\n      \#{reverse_markdown(item&.content&.content)}\n\n    EOT\n  end\nend\n"

#parse_content(content, images) ⇒ Array<String, Documentrix::Utils::Tags>

Parses content and processes embedded resources based on document policy

This method analyzes input content for URLs, tags, and file references, fetches referenced resources, and processes them according to the current document policy. It supports different processing modes for various content types.

Parameters:

  • content (String)

    The input content string to parse

  • images (Array)

    An array to collect image references (will be cleared)

Returns:

  • (Array<String, Documentrix::Utils::Tags>)

    Returns an array containing the processed content string and tags object if any tags were found



190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# File 'lib/ollama_chat/parsing.rb', line 190

def parse_content(content, images)
  images.clear
  tags = Documentrix::Utils::Tags.new valid_tag: /\A#*([\w\]\[]+)/

  contents = [ content ]
  content.scan(%r((https?://\S+)|(?<![a-zA-Z\d])#+([\w\]\[]+)|(?:file://)?(\S*\/\S+))).each do |url, tag, file|
    case
    when tag
      tags.add(tag)
      next
    when file
      file = file.sub(/#.*/, '')
      file =~ %r(\A[~./]) or file.prepend('./')
      File.exist?(file) or next
      source = file
    when url
      links.add(url.to_s)
      source = url
    end
    fetch_source(source) do |source_io|
      case source_io&.content_type&.media_type
      when 'image'
        add_image(images, source_io, source)
      when 'text', 'application', nil
        case @document_policy
        when 'ignoring'
          nil
        when 'importing'
          contents << import_source(source_io, source)
        when 'embedding'
          embed_source(source_io, source)
        when 'summarizing'
          contents << summarize_source(source_io, source)
        end
      else
        STDERR.puts(
          "Cannot fetch #{source.to_s.inspect} with content type "\
          "#{source_io&.content_type.inspect}"
        )
      end
    end
  end
  new_content = contents.select { _1.present? rescue nil }.compact * "\n\n"
  return new_content, (tags unless tags.empty?)
end

#parse_csv(source_io) ⇒ String

The parse_csv method processes CSV content from an input source and converts it into a formatted string representation. It iterates through each row of the CSV, skipping empty rows, and constructs a structured output where each row’s fields are formatted with indentation and separated by newlines. The resulting string includes double newlines between rows for readability.

Parameters:

  • source_io (IO)

    the input source containing CSV data

Returns:

  • (String)

    a formatted string representation of the CSV content



48
49
50
51
52
53
54
55
56
57
58
# File 'lib/ollama_chat/parsing.rb', line 48

def parse_csv(source_io)
  result = +''
  CSV.table(File.new(source_io), col_sep: ?,).each do |row|
    next if row.fields.select(&:present?).none?
    result << row.map { |pair|
      pair.compact.map { _1.to_s.strip } * ': ' if pair.last.present?
    }.select(&:present?).map { _1.prepend('  ') } * ?\n
    result << "\n\n"
  end
  result
end

#parse_rss(source_io) ⇒ String

The parse_rss method processes an RSS feed source and converts it into a formatted text representation. It extracts the channel title and iterates through each item in the feed to build a structured output. The method uses the RSS parser to handle the source input and formats the title, link, publication date, and description of each item into a readable text format with markdown-style headers and links.

channel title and item details

Parameters:

  • source_io (IO)

    the input stream containing the RSS feed data

Returns:

  • (String)

    a formatted string representation of the RSS feed with



72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/ollama_chat/parsing.rb', line 72

def parse_rss(source_io)
  feed = RSS::Parser.parse(source_io, false, false)
  title = "    # \#{feed&.channel&.title}\n\n  EOT\n  feed.items.inject(title) do |text, item|\n    text << <<~EOT\n      ## [\#{item&.title}](\#{item&.link})\n\n      updated on \#{item&.pubDate}\n\n      \#{reverse_markdown(item&.description)}\n\n    EOT\n  end\nend\n"

#parse_source(source_io) ⇒ String?

The parse_source method processes different types of input sources and converts them into a standardized text representation.

content type is not supported

Parameters:

  • source_io (IO)

    the input source to be parsed

Returns:

  • (String, nil)

    the parsed content as a string or nil if the



9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/ollama_chat/parsing.rb', line 9

def parse_source(source_io)
  case source_io&.content_type
  when 'text/html'
    reverse_markdown(source_io.read)
  when 'text/xml', 'application/xml'
    if source_io.read(8192) =~ %r(^\s*<rss\s)
      source_io.rewind
      return parse_rss(source_io)
    end
    source_io.rewind
    source_io.read
  when 'text/csv'
    parse_csv(source_io)
  when 'application/rss+xml'
    parse_rss(source_io)
  when 'application/atom+xml'
    parse_atom(source_io)
  when 'application/postscript'
    ps_read(source_io)
  when 'application/pdf'
    pdf_read(source_io)
  when %r(\Aapplication/(json|ld\+json|x-ruby|x-perl|x-gawk|x-python|x-javascript|x-c?sh|x-dosexec|x-shellscript|x-tex|x-latex|x-lyx|x-bibtex)), %r(\Atext/), nil
    source_io.read
  else
    STDERR.puts "Cannot parse #{source_io&.content_type} document."
    return
  end
end

#pdf_read(io) ⇒ String

The pdf_read method extracts text content from a PDF file by reading all pages.

Parameters:

  • io (IO)

    the input stream containing the PDF data

Returns:

  • (String)

    the concatenated text content from all pages in the PDF



126
127
128
129
# File 'lib/ollama_chat/parsing.rb', line 126

def pdf_read(io)
  reader = PDF::Reader.new(io)
  reader.pages.inject(+'') { |result, page| result << page.text }
end

#ps_read(io) ⇒ String?

Reads and processes PDF content using Ghostscript for conversion

This method takes an IO object containing PDF data, processes it through Ghostscript’s pdfwrite device, and returns the processed PDF content. If Ghostscript is not available in the system path, it outputs an error message.

Parameters:

  • io (IO)

    An IO object containing PDF data to be processed

Returns:

  • (String, nil)

    The processed PDF content as a string, or nil if processing fails



140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/ollama_chat/parsing.rb', line 140

def ps_read(io)
  gs = `which gs`.chomp
  if gs.present?
    Tempfile.create do |tmp|
      IO.popen("#{gs} -q -sDEVICE=pdfwrite -sOutputFile=#{tmp.path} -", 'wb') do |gs_io|
        until io.eof?
          buffer = io.read(1 << 17)
          IO.select(nil, [ gs_io ], nil)
          gs_io.write buffer
        end
        gs_io.close
        File.open(tmp.path, 'rb') do |pdf|
          pdf_read(pdf)
        end
      end
    end
  else
    STDERR.puts "Cannot convert #{io&.content_type} whith ghostscript, gs not in path."
  end
end

#reverse_markdown(html) ⇒ String

The reverse_markdown method converts HTML content into Markdown format.

This method processes HTML input and transforms it into equivalent Markdown, using specific conversion options to ensure compatibility and formatting.

Parameters:

  • html (String)

    the HTML string to be converted

Returns:

  • (String)

    the resulting Markdown formatted string



170
171
172
173
174
175
176
177
# File 'lib/ollama_chat/parsing.rb', line 170

def reverse_markdown(html)
  ReverseMarkdown.convert(
    html,
    unknown_tags: :bypass,
    github_flavored: true,
    tag_border: ''
  )
end