Class: MartSearch::IndexBuilder

Inherits:
Object
  • Object
show all
Includes:
MartSearch, IndexBuilderUtils, Utils
Defined in:
lib/martsearch/index_builder.rb

Overview

This class is responsible for building and updating of a Solr search index for use with a MartSearch application.

Author:

  • Darren Oakley

Constant Summary

Constant Summary

Constants included from MartSearch

ENVIRONMENT

Instance Attribute Summary (collapse)

Instance Method Summary (collapse)

Methods included from IndexBuilderUtils

#extract_value_to_index, #index_concatenated_ontology_terms, #index_extracted_attributes, #index_grouped_attributes, #index_ontology_terms, #new_document, #open_daily_directory, #process_attribute_map, #setup_and_move_to_work_directory, #solr_document_xml

Methods included from Utils

#build_http_client, #convert_array_to_hash

Constructor Details

- (IndexBuilder) initialize



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# File 'lib/martsearch/index_builder.rb', line 16

def initialize()
  ms_config           = MartSearch::Controller.instance().config
  @index_config       = ms_config[:index]
  @builder_config     = ms_config[:index_builder]
  @datasources_config = ms_config[:datasources]

  @builder_config[:number_of_docs_per_xml_file] = 1000

  @log                 = Logger.new(STDOUT)
  @log.level           = Logger::DEBUG
  @log.datetime_format = "%Y-%m-%d %H:%M:%S "

  # Create a document cache, and a helper lookup variable
  @file_based_cache      = false
  @document_cache        = {}
  @document_cache_keys   = {}
  @document_cache_lookup = {}

  # Setup in an memory ontology cache - this will reduce the amount
  # of repetetive graph traversal and computation we need to do
  @ontology_cache = {}
end

Instance Attribute Details

- (Object) builder_config (readonly)

Returns the value of attribute builder_config



14
15
16
# File 'lib/martsearch/index_builder.rb', line 14

def builder_config
  @builder_config
end

- (Object) document_cache (readonly)

Returns the value of attribute document_cache



14
15
16
# File 'lib/martsearch/index_builder.rb', line 14

def document_cache
  @document_cache
end

- (Object) index_config (readonly)

Returns the value of attribute index_config



14
15
16
# File 'lib/martsearch/index_builder.rb', line 14

def index_config
  @index_config
end

- (Object) log (readonly)

Returns the value of attribute log



14
15
16
# File 'lib/martsearch/index_builder.rb', line 14

def log
  @log
end

Instance Method Details

- (Object) fetch_datasets

Function to control the dataset download process. Determines if we need to download each dataset (configured using the 'days_between_downlads' option) - then only downloads the datasets that need downloading.



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/martsearch/index_builder.rb', line 42

def fetch_datasets
  @log.info "Starting dataset downloads..."

  pwd = Dir.pwd
  setup_and_move_to_work_directory()

  # First see which datasets we need to download (based on the age 
  # of the 'current' dump file).
  Dir.chdir('dataset_dowloads/current')
  datasets_to_download = []

  @builder_config[:datasets_to_index].each do |ds|
    ds_conf = @builder_config[:datasets][ds.to_sym]

    if File.exists?("#{ds}.marshal")
      file_timestamp   = File.new("#{ds}.marshal").mtime
      now_timestamp    = Time.now()
      file_age_in_days = ( ( ( (now_timestamp - file_timestamp).round / 60 ) / 60 ) / 24 )

      if file_age_in_days >= ds_conf[:indexing][:days_between_downlads]
        datasets_to_download.push(ds)
      end
    else
      datasets_to_download.push(ds)
    end
  end

  open_daily_directory( 'dataset_dowloads', false )
  Parallel.each( datasets_to_download, :in_threads => 10 ) do |ds|
  # datasets_to_download.each do |ds|
    # puts " - #{ds}: requesting data"
    @log.info " - #{ds}: requesting data"
    results = fetch_dataset( ds )
    # puts " - #{ds}: #{results[:data].size} rows of data returned"
    @log.info " - #{ds}: #{results[:data].size} rows of data returned"
  end

  @log.info "Dataset downloads completed."
  Dir.chdir(pwd)
end

- (Object) process_datasets

Function to control the processing of the dataset downloads. Once the processing is complete it will also save the @document_cache to disk.



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# File 'lib/martsearch/index_builder.rb', line 85

def process_datasets
  @log.info "Starting dataset processing..."

  pwd = Dir.pwd
  setup_and_move_to_work_directory()
  Dir.chdir('dataset_dowloads/current')

  @builder_config[:datasets_to_index].each do |ds|
    @log.info " - #{ds}: processing results"
    process_dataset(ds)
    clean_document_cache()
    @log.info " - #{ds}: processing results complete"
  end

  @log.info "Finished dataset processing."

  @log.info "Saving @document_cache to disk."
  save_document_cache()

  Dir.chdir(pwd)
end

- (Object) save_solr_document_xmls

Function to build and store the XML files needed to update a Solr index based on the @document_cache store in this current instance.



109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/martsearch/index_builder.rb', line 109

def save_solr_document_xmls
  pwd = Dir.pwd
  open_daily_directory( 'solr_xml' )

  batch_size = @builder_config[:number_of_docs_per_xml_file]
  @log.info "Creating Solr XML files (#{batch_size} docs per file)..."

  open_stored_document_cache if @document_cache_keys.empty?
  doc_chunks      = @document_cache_keys.keys.chunk( batch_size )
  doc_chunks_size = doc_chunks.size - 1

  Parallel.each( (0..doc_chunks_size), :in_threads => 5 ) do |chunk_number|
    @log.info " - writing solr-xml-#{chunk_number+1}.xml"

    doc_names = doc_chunks[chunk_number]
    docs      = []
    doc_names.each do |name|
      docs.push( get_document( name ) )
    end

    file = File.open( "solr-xml-#{chunk_number+1}.xml", "w" )
    file.print solr_document_xml(docs)
    file.close
  end

  Dir.chdir(pwd)
end

- (Object) send_xml_to_solr

Function to send all of the XML files to the Solr instance.



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# File 'lib/martsearch/index_builder.rb', line 138

def send_xml_to_solr
  pwd = Dir.pwd
  open_daily_directory( 'solr_xml', false )

  client    = build_http_client()
  index_url = "#{@index_config[:builder_url]}/update"
  url       = URI.parse( index_url )

  client.start( url.host, url.port ) do |http|
    @log.info "Sending XML files to Solr (#{index_url})"
    Dir.glob("solr-xml-*.xml").each do |file|
      @log.info "  - #{file}"
      data = File.read( file )
      res  = http.post( url.path, data, { 'Content-type' => 'text/xml; charset=utf-8' } )

      if res.code.to_i != 200
        raise "Error uploading #{file} to index!\ncode: #{res.code}\nbody: #{res.body}"
      end
    end

    @log.info "  - commiting and optimising updates"
    ['<commit/>','<optimize/>'].each do |task|
      res = http.post( url.path, task, { 'Content-type' => 'text/xml; charset=utf-8' } )

      if res.code.to_i != 200
        raise "Error sending #{task} instruction to index!\ncode: #{res.code}\nbody: #{res.body}"
      end
    end
  end

  Dir.chdir(pwd)
end