Class: Picky::Indexers::Parallel

Inherits:
Base show all
Defined in:
lib/picky/indexers/parallel.rb

Overview

Uses a number of categories, a source, and a tokenizer to index data.

The tokenizer is taken from each category if specified, from the index, if not.

Instance Attribute Summary

Attributes inherited from Base

#index_or_category

Instance Method Summary collapse

Methods inherited from Base

#check, #initialize, #notify_finished, #prepare, #reset

Constructor Details

This class inherits a constructor from Picky::Indexers::Base

Instance Method Details

#flush(file, cache) ⇒ Object



78
79
80
# File 'lib/picky/indexers/parallel.rb', line 78

def flush file, cache
  file.write(cache.join) && cache.clear
end

#index_flush(objects, file, category, cache, tokenizer) ⇒ Object



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/picky/indexers/parallel.rb', line 58

def index_flush objects, file, category, cache, tokenizer
  comma   = ?,
  newline = ?\n

  # Optimized, therefore duplicate code.
  #
  id = category.id
  from = category.from
  objects.each do |object|
    tokens = object.send from
    tokens, _ = tokenizer.tokenize tokens if tokenizer # Note: Originals not needed. TODO Optimize?
    tokens.each do |token_text|
      next unless token_text
      cache << object.send(id) << comma << token_text << newline
    end
  end

  flush file, cache
end

#process(source_for_prepare, categories, scheduler = Scheduler.new) ⇒ Object

Process does the actual indexing.

Parameters:

* categories: An Enumerable of Category-s.


18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/picky/indexers/parallel.rb', line 18

def process source_for_prepare, categories, scheduler = Scheduler.new
  # Prepare a combined object - array.
  #
  combined = categories.map do |category|
    [category, category.prepared_index_file, [], category.tokenizer]
  end

  # Go through each object in the source.
  #
  objects = []

  reset source_for_prepare

  source_for_prepare.each do |object|

    # Accumulate objects.
    #
    objects << object
    next if objects.size < 10_000

    # THINK Is it a good idea that not the tokenizer has
    # control over when he gets the next text?
    #
    combined.each do |category, file, cache, tokenizer|
      index_flush objects, file, category, cache, tokenizer
    end

    objects.clear

  end

  # Close all files.
  #
  combined.each do |category, file, cache, tokenizer|
    index_flush objects, file, category, cache, tokenizer
    yield file
    file.close
  end
end