Class: BrowserWebData::EntitySumarization::Statistic

Inherits:
Object
  • Object
show all
Includes:
BrowserWebData::EntitySumarizationConfig
Defined in:
lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb

Overview

Statistic class allow to find, collect and generate knowledge of entity sumarization. Entity sumarization is based on use dataset of NLP Interchange Format (NIF). For example datasets from wiki.dbpedia.org/nif-abstract-datasets Knowledge is generate by information in DBpedia.

Constant Summary

Constants included from BrowserWebData::EntitySumarizationConfig

BrowserWebData::EntitySumarizationConfig::COMMON_PROPERTIES, BrowserWebData::EntitySumarizationConfig::IDENTICAL_PROPERTY_LIMIT, BrowserWebData::EntitySumarizationConfig::IMPORTANCE_TO_IDENTIFY_MAX_COUNT, BrowserWebData::EntitySumarizationConfig::NO_SENSE_PROPERTIES, BrowserWebData::EntitySumarizationConfig::SCAN_REGEXP

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ Statistic

Create new instance of Statistic class.

Parameters:

  • nif_dataset_path (String) (defaults to: nil)

    Optional param. Default value is nil.

  • results_dir_path (String) (defaults to: nil)

    Default value is Optional param. Default value is Temp/BROWSER_WEB_DATA/results.

  • console_output (TrueClass, FalseClass) (defaults to: false)

    Allow puts info to console. Default is false.



27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 27

def initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false)
  nif_dataset_path = nif_dataset_path.gsub('\\', '/') if nif_dataset_path
  results_dir_path = results_dir_path.gsub('\\', '/').chomp('/') if results_dir_path

  unless Dir.exist?(results_dir_path)
    cache_dir_path = "#{Dir.tmpdir}/#{BrowserWebData::TMP_DIR}"
    Dir.mkdir(cache_dir_path) unless Dir.exist?(cache_dir_path)
    results_dir_path = "#{cache_dir_path}/results"
    Dir.mkdir(results_dir_path) unless Dir.exist?(results_dir_path)
  end

  @nif_file_path = nif_dataset_path
  @results_dir_path = results_dir_path
  @console_output = console_output

  @query = SPARQLRequest.new
  @predicates_similarity = PredicatesSimilarity.new(@results_dir_path, IDENTICAL_PROPERTY_LIMIT, console_output)
end

Instance Attribute Details

#nif_file_pathObject (readonly)

Returns the value of attribute nif_file_path.



19
20
21
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 19

def nif_file_path
  @nif_file_path
end

#results_dir_pathObject (readonly)

Returns the value of attribute results_dir_path.



19
20
21
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 19

def results_dir_path
  @results_dir_path
end

Instance Method Details

#create_complete_knowledge_base(params) ⇒ Object

The method find resource links in given nif file dataset.

Parameters:

  • params (Hash)

Options Hash (params):

  • :entity_types (Array<String>, String)
  • :entity_count (Fixnum)

    Best ranked resources by every entity type.

  • :demand_reload (FalseClass, TruesClass)
  • :identify_identical_predicates (FalseClass, TruesClass)


54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 54

def create_complete_knowledge_base(params)
  params[:entity_types] = [params[:entity_types]] unless params[:entity_types].is_a?(Array)

  generate_statistics_from_nif(params[:entity_types], params[:entity_count], params[:demand_reload])

  params[:entity_types].each { |type|
    generate_literal_statistics(type)
  }

  params[:entity_types].each { |type|
    generate_knowledge_base_for_entity(type, params[:identify_identical_predicates])
  }
end

#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ Object

The method process all generated result files from nif dataset (by entity class type) to one result knowledge base file.

Parameters:



300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 300

def generate_knowledge_base_for_entity(type, identify_identical = true)
  puts "_____ #{type} _____" if @console_output
  files = Dir.glob("#{@results_dir_path}/#{type}/*.json")
  type = type.to_s.to_sym

  knowledge_data = {type => []}

  global_properties = get_global_statistic_by_type(type) || {}

  if identify_identical
    try_this_identical = {}

    files.each { |file_path|
      file_data = JSON.parse(File.read(file_path).force_encoding('utf-8'), symbolize_names: true)
      file_data[:nif_data].each { |data|
        try_this_identical.merge!(data[:properties][type]) { |_, x, y| x + y }
      }
    }

    try_this_identical.merge!(global_properties) { |_, x, y| x + y }

    if try_this_identical.size > 0
      try_this_identical = Hash[try_this_identical.sort_by { |_, v| v }.reverse]
      puts "- prepare to identify identical: total count #{try_this_identical.size}" if @console_output
      @predicates_similarity.identify_identical_predicates(try_this_identical.keys)
    end
  end

  puts "- calculate: files count #{files.size}" if @console_output
  files.each { |file_path|
    file_data = JSON.parse(File.read(file_path).force_encoding('utf-8'), symbolize_names: true)

    file_data[:nif_data].each { |found|

      properties = found[:properties][type.to_sym]
      strict_properties = (found[:strict_properties] ||{})[type] || {}
      weight = found[:weight]

      strict_properties.each { |property, count|
        property = property.to_s
        value = count.to_i * weight

        prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge|
          old_score = from_knowledge[:score] * from_knowledge[:counter]
          from_knowledge[:counter] += 1
          (old_score + value) / from_knowledge[:counter]
        }
      }

      properties.each { |property, count|
        property = property.to_s
        value = count.to_i * weight

        prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge|
          old_score = from_knowledge[:score] * from_knowledge[:counter]
          from_knowledge[:counter] += 1
          (old_score + value) / from_knowledge[:counter]
        }
      }
    }

    unless knowledge_data[type].empty?
      max_weight = knowledge_data[type].max_by { |data| data[:score] }[:score]
      knowledge_data[type] = knowledge_data[type].map { |hash|
        hash[:score] = (hash[:score] / max_weight).round(4)
        hash
      }
    end
  }


  if global_properties.size > 0
    max_count = global_properties.max_by { |_, count| count }[1].to_f
    global_properties.each { |property, count|

      value = count / max_count

      prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge|
        from_knowledge[:score] > 0 ? ((from_knowledge[:score] + value) / 2.0).round(4) : value.round(4)
      }
    }
  end

  knowledge_data[type].map! { |hash|
    hash.delete(:counter)
    hash
  }

  knowledge_data[type] = knowledge_data[type].keep_if { |hash|
    hash[:score] > 0
  }.sort_by { |hash|
    hash[:score]
  }.reverse

  if identify_identical
    @predicates_similarity.reduce_identical
  end

  update_knowledge_base(knowledge_data)
end

#generate_literal_statistics(type = nil, count = 10000) ⇒ Object

The method generate simple statistics that contain all predicates that links to literal. Predicates are grouped by entity class type and also contains count of total occurrence. Predicates find from best ranked resources.

Parameters:



141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 141

def generate_literal_statistics(type = nil, count = 10000)
  unless type
    type = get_all_classes
  end

  type = [type] unless type.is_a?(Array)

  type.each_with_index { |entity_type, index|
    all_properties = {}
    puts "#{__method__} - start process entity type: #{entity_type} [#{(index / type.size.to_f).round(2)}]" if @console_output
    entity_type = entity_type.to_s.to_sym

    get_best_ranked_resources(entity_type, count).each { |resource, _|
      properties = @query.get_all_predicates_by_subject(resource.to_s, true).map { |solution_prop|
        solution_prop[:property].to_s
      } || []

      properties.uniq.each { |prop|
        next if Predicate.unimportant?(prop)
        all_properties[entity_type] ||= {}
        all_properties[entity_type][prop] ||= 0
        all_properties[entity_type][prop] += 1
      }

    }

    update_global_statistic(all_properties)
  }
end

#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ Object

The method helps to store founded information from nif for given resource.

Parameters:

  • resource_uri (String)
  • type (String)
  • result_relations (Hsah)

    Hash generated by method #find_relations

  • this_time (Float)

    Relative time of find in nif dataset.

Options Hash (result_relations):

  • :sections (Hash)

    Contains key ‘section_type’ value ‘position’

  • :relations (Array<Hash>)

    Hashes generated by method #get_predicates_by_link



261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 261

def generate_result_file(resource_uri, type, result_relations, this_time)
  section_degradation = result_relations[:sections].map { |section_type, position|
    index = result_relations[:sections].keys.index(section_type)

    # recognize value of degradation by relative position paragraphs in document

    position[:degradation] = 1 - ((index / result_relations[:sections].size) / 10.0)

    {section_type => position}
  }.reduce(:merge)

  total_size = section_degradation.max_by { |_, v| v[:to] }[1][:to].to_f

  result_nif_data = result_relations[:relations].map { |relation|
    paragraph_position = section_degradation[relation[:section]]

    # weight is lowest by relative distance from document start

    position_weight = (1 - ((relation[:indexes][0].to_i) / total_size))
    # weight is also degraded by index of paragraph

    relation[:weight] = (position_weight * paragraph_position[:degradation]).round(4)

    relation
  }

  result = {
    process_time: {nif_find: this_time, relations_find: result_relations[:time]},
    resource_uri: resource_uri,
    nif_data: result_nif_data
  }

  result_path = get_resource_file_path(resource_uri, type)
  File.open(result_path, 'w:utf-8') { |f| f << JSON.pretty_generate(result) }
end

#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ Object

The method find links in given nif dataset. After find collect relations #find_relations. For each resource generate file in @results_dir_path.

Parameters:

  • entity_types (Array<String>, String)
  • count (Fixnum) (defaults to: 10)

    Count of best ranked resources

  • demand_reload (FalseClass, TruesClass) (defaults to: false)


75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 75

def generate_statistics_from_nif(entity_types, count = 10, demand_reload = false)
  unless @nif_file_path
    raise RuntimeError.new('Instance has no defined return nif_dataset_path. Can not start generate from nif datset. Please create new instance.')
  end

  resources = get_best_ranked_resources(entity_types, count)
  resources = keep_unloaded(resources) unless demand_reload

  actual_resource_data = []
  lines_group = []

  begin
    time_start = Time.now
    nif_file = File.open(@nif_file_path, 'r')
    line = nif_file.readline

    until nif_file.eof?
      line = nif_file.readline

      if lines_group.size == 7
        # evaulate group (7 lines)

        this_resource_uri = NIFLineParser.parse_resource_uri(lines_group[0])

        if resources.keys.include?(this_resource_uri)
          # process group, is requested

          resource_uri = this_resource_uri
          actual_resource_data << NIFLineParser.parse_line_group(lines_group)

        elsif !actual_resource_data.empty?
          # resource changed, process actual_resource_data

          resource_hash = resources.delete(resource_uri)
          type = resource_hash[:type]

          this_time = (Time.now - time_start).round(2)
          puts "\n#{resource_uri}\n- nif found in #{this_time}\n- resources to find #{resources.size}" if @console_output

          result_relations = find_relations(resource_uri, actual_resource_data, type)
          generate_result_file(resource_uri, type, result_relations, this_time)

          actual_resource_data = []
          time_start = Time.now
        end

        # start new group

        lines_group = [line]
      else

        # join line to group

        lines_group << line
      end

      break if resources.empty?
    end

  ensure
    nif_file.close if nif_file && !nif_file.closed?
  end
end

#get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json')) ⇒ Array<String>

The method load all defined entity class types by mappings.dbpedia.org/server/ontology/classes/

Parameters:

  • path (String) (defaults to: File.join(__dir__, '../knowledge/classes_hierarchy.json'))

Returns:

  • (Array<String>)

    classes



407
408
409
410
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 407

def get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json'))
  data = ensure_load_json(path, {})
  HashHelper.recursive_map_keys(data)
end

#get_best_ranked_resources(entity_types, count = 10) ⇒ Hash

The method return list of best ranked resources by required entity types.

Parameters:

Returns:

  • (Hash)

    resources



178
179
180
181
182
183
184
185
186
187
188
189
190
191
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 178

def get_best_ranked_resources(entity_types, count = 10)
  resources = {}
  entity_types = [entity_types] unless entity_types.is_a?(Array)

  entity_types.each { |type|
    top_ranked_entities = @query.get_resources_by_dbpedia_page_rank(type, count)

    top_ranked_entities.each { |solution|
      resources[solution.entity.value] = {type: type, rank: solution.rank.value.to_f}
    }
  }

  resources
end

The method find predicates by given link. Find strict predicates that are in relation: <resource> ?predicate <link> . Find predicates that are in relation: ?subject a <type> . ?subject ?predicate <link>

Parameters:

  • resource_uri (String)

    Resource for which will be find strict properties

  • link (String)

    Link that has some importance to resource or entity type.

  • type (String)

Returns:

  • (Hash)

    result



223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 223

def get_predicates_by_link(resource_uri, link, type)
  properties = {type => {}}
  strict_properties = {type => {}}

  @query.get_all_predicates_by_subject_object(resource_uri, link).each { |solution|
    predicate = solution.to_h
    property = predicate[:property].to_s.force_encoding('utf-8')

    next if Predicate.unimportant?(property)

    count = @query.get_count_predicate_by_entity(type, property)[0].to_h[:count].to_f
    strict_properties[type][property] = count if count > 0
  }

  @query.get_all_predicates_by_object(link).each { |solution|
    predicate = solution.to_h
    property = predicate[:property].to_s.force_encoding('utf-8')

    next if Predicate.unimportant?(property) || strict_properties[type][property]

    count = @query.get_count_predicate_by_entity(type, property)[0].to_h[:count].to_f
    properties[type][property] = count if count > 0
  }


  {properties: properties, strict_properties: strict_properties}
end

#refresh_statistics_in_files(entity_types, count = 10) ⇒ Object

The method helps to recollect relations by already generated result files.

Parameters:



198
199
200
201
202
203
204
205
206
207
208
209
210
211
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 198

def refresh_statistics_in_files(entity_types, count = 10)
  resources = get_best_ranked_resources(entity_types, count)

  resources = keep_loaded(resources)

  resources.each { |resource_uri, resource_info|
    puts "_____ #{resource_uri} _____" if @console_output

    update_nif_file_properties(resource_uri, resource_info[:type]) { |link|
      get_predicates_by_link(resource_uri, link, resource_info[:type])
    }
  }

end