Class: Documentrix::Documents

Inherits:
Object
  • Object
show all
Includes:
Cache, Kramdown::ANSI::Width
Defined in:
lib/documentrix/documents.rb,
lib/documentrix/documents.rb,
lib/documentrix/documents/cache/redis_backed_memory_cache.rb

Defined Under Namespace

Modules: Cache, Splitters Classes: MemoryCache, RedisBackedMemoryCache, RedisCache

Constant Summary collapse

Record =
Class.new Documentrix::Documents::Cache::Records::Record

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false) ⇒ Documents

The initialize method sets up the Documentrix::Documents instance by configuring its components.

Parameters:

  • the client used for embedding

  • the name of the model to use for embeddings

  • (defaults to: nil)

    optional parameters for the model

  • (defaults to: nil)

    the default collection to use (defaults to :default)

  • (defaults to: 1_024)

    the length of the embeddings (defaults to 1024)

  • (defaults to: MemoryCache)

    the cache to use for storing documents (defaults to MemoryCache)

  • (defaults to: nil)

    the filename of the SQLite database to use (defaults to ':memory:')

  • (defaults to: nil)

    the URL of the Redis server to use (defaults to nil)

  • (defaults to: false)

    whether to enable debugging mode (defaults to false)



37
38
39
40
41
42
43
# File 'lib/documentrix/documents.rb', line 37

def initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false)
  collection ||= default_collection
  @ollama, @model, @model_options, @collection, @debug =
    ollama, model, model_options, collection.to_sym, debug
  database_filename ||= ':memory:'
  @cache = connect_cache(cache, redis_url, embedding_length, database_filename)
end

Instance Attribute Details

#cacheObject (readonly)

Returns the value of attribute cache.



52
53
54
# File 'lib/documentrix/documents.rb', line 52

def cache
  @cache
end

#collectionObject

Returns the value of attribute collection.



52
53
54
# File 'lib/documentrix/documents.rb', line 52

def collection
  @collection
end

#modelObject (readonly)

Returns the value of attribute model.



52
53
54
# File 'lib/documentrix/documents.rb', line 52

def model
  @model
end

#ollamaObject (readonly)

Returns the value of attribute ollama.



52
53
54
# File 'lib/documentrix/documents.rb', line 52

def ollama
  @ollama
end

Instance Method Details

#[](text) ⇒ Object

The [] method retrieves the value associated with the given text from the cache.

Parameters:

  • the text for which to retrieve the cached value

Returns:

  • the cached value, or nil if not found



130
131
132
# File 'lib/documentrix/documents.rb', line 130

def [](text)
  @cache[key(text)]
end

#[]=(text, record) ⇒ Object

The []= method sets the value for a given text in the cache.

Parameters:

  • the text to set

  • the value to store



138
139
140
# File 'lib/documentrix/documents.rb', line 138

def []=(text, record)
  @cache[key(text)] = record
end

#add(texts, batch_size: nil, source: nil, tags: []) ⇒ Documentrix::Documents Also known as: <<

The method adds new texts texts to the documents collection by processing them through various stages. It first filters out existing texts from the input array using the prepare_texts method, then fetches embeddings for each text using the specified model and options. The fetched embeddings are used to create a new record in the cache, which is associated with the original text and tags (if any). The method processes the texts in batches of size , displaying progress information in the console. It also accepts an optional string to associate with the added texts and an array of to attach to each record. Once all texts have been processed, it returns the Documentrix::Documents instance itself, allowing for method chaining.

Examples:

documents.add(%w[ foo bar ], batch_size: 23, source: 'https://example.com', tags: %w[tag1 tag2])

Parameters:

  • an array of input texts

  • (defaults to: nil)

    the number of texts to process in one batch

  • (defaults to: nil)

    the source URL for the added texts

  • (defaults to: [])

    an array of tags associated with the added texts

Returns:

  • self



100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# File 'lib/documentrix/documents.rb', line 100

def add(texts, batch_size: nil, source: nil, tags: [])
  texts = prepare_texts(texts) or return self
  tags = Documentrix::Utils::Tags.new(tags, source:)
  if source
    tags.add(File.basename(source).gsub(/\?.*/, ''), source:)
  end
  batches = texts.each_slice(batch_size || 10).
    with_infobar(
      label: "Add #{truncate(tags.to_s(link: false), percentage: 25)}",
      total: texts.size
    )
  batches.each do |batch|
    embeddings = fetch_embeddings(model:, options: @model_options, input: batch)
    batch.zip(embeddings) do |text, embedding|
      norm       = @cache.norm(embedding)
      self[text] = Record[text:, embedding:, norm:, source:, tags: tags.to_a]
    end
    infobar.progress by: batch.size
  end
  infobar.newline
  self
end

#clear(tags: nil) ⇒ Documentrix::Documents

The clear method clears all texts from the cache or tags was given the ones tagged with the .

Parameters:

  • (defaults to: nil)

    the tag name to filter by

Returns:

  • self



176
177
178
179
# File 'lib/documentrix/documents.rb', line 176

def clear(tags: nil)
  @cache.clear(tags:)
  self
end

#collectionsArray

The collections method returns an array of unique collection names

Returns:

  • An array of unique collection names



228
229
230
# File 'lib/documentrix/documents.rb', line 228

def collections
  ([ default_collection ] + @cache.collections('%s-' % class_prefix)).uniq
end

#default_collection:default

The default_collection method returns the default collection name.

Returns:

  • The default collection name.



48
49
50
# File 'lib/documentrix/documents.rb', line 48

def default_collection
  :default
end

#delete(text) ⇒ FalseClass, TrueClass

The delete method removes the specified text from the cache by calling the delete method on the underlying cache object.

Parameters:

  • the text for which to remove the value

Returns:

  • true if the text was removed, false otherwise.



158
159
160
# File 'lib/documentrix/documents.rb', line 158

def delete(text)
  @cache.delete(key(text))
end

#exist?(text) ⇒ FalseClass, TrueClass

The exist? method checks if the given text exists in the cache.

Parameters:

  • the text to check for existence

Returns:

  • true if the text exists, false otherwise.



147
148
149
# File 'lib/documentrix/documents.rb', line 147

def exist?(text)
  @cache.key?(key(text))
end

#find(string, tags: nil, prompt: nil, max_records: nil) ⇒ Array<Documentrix::Documents::Record>

The find method searches for strings within the cache by computing their similarity scores.

Examples:

documents.find("foo")

Parameters:

  • the input string

  • (defaults to: nil)

    an array of tags to filter results by (optional)

  • (defaults to: nil)

    a prompt to use when searching for similar strings (optional)

  • (defaults to: nil)

    the maximum number of records to return (optional)

Returns:



193
194
195
196
# File 'lib/documentrix/documents.rb', line 193

def find(string, tags: nil, prompt: nil, max_records: nil)
  needle = convert_to_vector(string, prompt:)
  @cache.find_records(needle, tags:, max_records: nil)
end

#find_where(string, text_size: nil, text_count: nil, **opts) ⇒ Array<Documentrix::Documents::Record>

The method filters the records returned by find based on text size and count.

Examples:

documents.find_where('foo', text_size: 3, text_count: 1)

Parameters:

  • the search query

  • (defaults to: nil)

    the maximum allowed text size to return

  • (defaults to: nil)

    the maximum number of texts to return

Returns:

  • the filtered records



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# File 'lib/documentrix/documents.rb', line 208

def find_where(string, text_size: nil, text_count: nil, **opts)
  if text_count
    opts[:max_records] =  text_count
  end
  records = find(string, **opts)
  size, count = 0, 0
  records.take_while do |record|
    if text_size and (size += record.text.size) > text_size
      next false
    end
    if text_count and (count += 1) > text_count
      next false
    end
    true
  end
end

#sizeInteger

The size method returns the number of texts stored in the cache of this Documentrix::Documents instance.

Returns:

  • The total count of cached texts.



166
167
168
# File 'lib/documentrix/documents.rb', line 166

def size
  @cache.size
end

#tagsDocumentrix::Utils::Tags

The tags method returns an array of unique tags from the cache.

Returns:

  • A set of unique tags



235
236
237
# File 'lib/documentrix/documents.rb', line 235

def tags
  @cache.tags
end