Class: BrowserWebData::EntitySumarization::Statistic

Inherits:

Object

Object
BrowserWebData::EntitySumarization::Statistic

show all

Includes:: BrowserWebData::EntitySumarizationConfig

Defined in:: lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb

Overview

Statistic class allow to find, collect and generate knowledge of entity sumarization. Entity sumarization is based on use dataset of NLP Interchange Format (NIF). For example datasets from wiki.dbpedia.org/nif-abstract-datasets Knowledge is generate by information in DBpedia.

Constant Summary

Constants included from BrowserWebData::EntitySumarizationConfig

BrowserWebData::EntitySumarizationConfig::COMMON_PROPERTIES, BrowserWebData::EntitySumarizationConfig::IDENTICAL_PROPERTY_LIMIT, BrowserWebData::EntitySumarizationConfig::IMPORTANCE_TO_IDENTIFY_MAX_COUNT, BrowserWebData::EntitySumarizationConfig::NO_SENSE_PROPERTIES, BrowserWebData::EntitySumarizationConfig::SCAN_REGEXP

Instance Attribute Summary collapse

#nif_file_path ⇒ Object readonly

Returns the value of attribute nif_file_path.
#results_dir_path ⇒ Object readonly

Returns the value of attribute results_dir_path.

Instance Method Summary collapse

#create_complete_knowledge_base(params) ⇒ Object

The method find resource links in given nif file dataset.
#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ Object

The method process all generated result files from nif dataset (by entity class type) to one result knowledge base file.
#generate_literal_statistics(type = nil, count = 10000) ⇒ Object

The method generate simple statistics that contain all predicates that links to literal.
#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ Object

The method helps to store founded information from nif for given resource.
#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ Object

The method find links in given nif dataset.
#get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json')) ⇒ Array<String>

The method load all defined entity class types by mappings.dbpedia.org/server/ontology/classes/.
#get_best_ranked_resources(entity_types, count = 10) ⇒ Hash

The method return list of best ranked resources by required entity types.
#get_predicates_by_link(resource_uri, link, type) ⇒ Hash

The method find predicates by given link.
#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ Statistic constructor

Create new instance of Statistic class.
#refresh_statistics_in_files(entity_types, count = 10) ⇒ Object

The method helps to recollect relations by already generated result files.

Constructor Details

#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ `Statistic`

Create new instance of Statistic class.

Parameters:

nif_dataset_path (String) (defaults to: nil) —

Optional param. Default value is nil.
results_dir_path (String) (defaults to: nil) —

Default value is Optional param. Default value is Temp/BROWSER_WEB_DATA/results.
console_output (TrueClass, FalseClass) (defaults to: false) —

Allow puts info to console. Default is false.

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 27

def initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false)
  nif_dataset_path = nif_dataset_path.gsub('\\', '/') if nif_dataset_path
  results_dir_path = results_dir_path.gsub('\\', '/').chomp('/') if results_dir_path

  unless Dir.exist?(results_dir_path)
    cache_dir_path = "#{Dir.tmpdir}/#{BrowserWebData::TMP_DIR}"
    Dir.mkdir(cache_dir_path) unless Dir.exist?(cache_dir_path)
    results_dir_path = "#{cache_dir_path}/results"
    Dir.mkdir(results_dir_path) unless Dir.exist?(results_dir_path)
  end

  @nif_file_path = nif_dataset_path
  @results_dir_path = results_dir_path
  @console_output = console_output

  @query = SPARQLRequest.new
  @predicates_similarity = PredicatesSimilarity.new(@results_dir_path, IDENTICAL_PROPERTY_LIMIT, console_output)
end

Instance Attribute Details

#nif_file_path ⇒ `Object` (readonly)

Returns the value of attribute nif_file_path.



19
20
21

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 19

def nif_file_path
  @nif_file_path
end

#results_dir_path ⇒ `Object` (readonly)

Returns the value of attribute results_dir_path.



19
20
21

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 19

def results_dir_path
  @results_dir_path
end

Instance Method Details

#create_complete_knowledge_base(params) ⇒ `Object`

The method find resource links in given nif file dataset.

Parameters:

params (Hash)

Options Hash (params):

:entity_types (Array<String>, String) —

Types from mappings.dbpedia.org/server/ontology/classes/
:entity_count (Fixnum) —

Best ranked resources by every entity type.
:demand_reload (FalseClass, TruesClass)
:identify_identical_predicates (FalseClass, TruesClass)

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 54

def create_complete_knowledge_base(params)
  params[:entity_types] = [params[:entity_types]] unless params[:entity_types].is_a?(Array)

  generate_statistics_from_nif(params[:entity_types], params[:entity_count], params[:demand_reload])

  params[:entity_types].each { |type|
    generate_literal_statistics(type)
  }

  params[:entity_types].each { |type|
    generate_knowledge_base_for_entity(type, params[:identify_identical_predicates])
  }
end

#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ `Object`

The method process all generated result files from nif dataset (by entity class type) to one result knowledge base file.

Parameters:

type (String) —

Type from mappings.dbpedia.org/server/ontology/classes/
identify_identical (TrueClass, FalseClass) (defaults to: true) —

Flag for process identify and group identical properties as one item.

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 300

def generate_knowledge_base_for_entity(type, identify_identical = true)
  puts "_____ #{type} _____" if @console_output
  files = Dir.glob("#{@results_dir_path}/#{type}/*.json")
  type = type.to_s.to_sym

  knowledge_data = {type => []}

  global_properties = get_global_statistic_by_type(type) || {}

  if identify_identical
    try_this_identical = {}

    files.each { |file_path|
      file_data = JSON.parse(File.read(file_path).force_encoding('utf-8'), symbolize_names: true)
      file_data[:nif_data].each { |data|
        try_this_identical.merge!(data[:properties][type]) { |_, x, y| x + y }
      }
    }

    try_this_identical.merge!(global_properties) { |_, x, y| x + y }

    if try_this_identical.size > 0
      try_this_identical = Hash[try_this_identical.sort_by { |_, v| v }.reverse]
      puts "- prepare to identify identical: total count #{try_this_identical.size}" if @console_output
      @predicates_similarity.identify_identical_predicates(try_this_identical.keys)
    end
  end

  puts "- calculate: files count #{files.size}" if @console_output
  files.each { |file_path|
    file_data = JSON.parse(File.read(file_path).force_encoding('utf-8'), symbolize_names: true)

    file_data[:nif_data].each { |found|

      properties = found[:properties][type.to_sym]
      strict_properties = (found[:strict_properties] ||{})[type] || {}
      weight = found[:weight]

      strict_properties.each { |property, count|
        property = property.to_s
        value = count.to_i * weight

        prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge|
          old_score = from_knowledge[:score] * from_knowledge[:counter]
          from_knowledge[:counter] += 1
          (old_score + value) / from_knowledge[:counter]
        }
      }

      properties.each { |property, count|
        property = property.to_s
        value = count.to_i * weight

        prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge|
          old_score = from_knowledge[:score] * from_knowledge[:counter]
          from_knowledge[:counter] += 1
          (old_score + value) / from_knowledge[:counter]
        }
      }
    }

    unless knowledge_data[type].empty?
      max_weight = knowledge_data[type].max_by { |data| data[:score] }[:score]
      knowledge_data[type] = knowledge_data[type].map { |hash|
        hash[:score] = (hash[:score] / max_weight).round(4)
        hash
      }
    end
  }


  if global_properties.size > 0
    max_count = global_properties.max_by { |_, count| count }[1].to_f
    global_properties.each { |property, count|

      value = count / max_count

      prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge|
        from_knowledge[:score] > 0 ? ((from_knowledge[:score] + value) / 2.0).round(4) : value.round(4)
      }
    }
  end

  knowledge_data[type].map! { |hash|
    hash.delete(:counter)
    hash
  }

  knowledge_data[type] = knowledge_data[type].keep_if { |hash|
    hash[:score] > 0
  }.sort_by { |hash|
    hash[:score]
  }.reverse

  if identify_identical
    @predicates_similarity.reduce_identical
  end

  update_knowledge_base(knowledge_data)
end

#generate_literal_statistics(type = nil, count = 10000) ⇒ `Object`

The method generate simple statistics that contain all predicates that links to literal. Predicates are grouped by entity class type and also contains count of total occurrence. Predicates find from best ranked resources.

Parameters:

type (String) (defaults to: nil) —

Type from mappings.dbpedia.org/server/ontology/classes/
count (Fixnum) (defaults to: 10000) —

Count of best ranked resources

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 141

def generate_literal_statistics(type = nil, count = 10000)
  unless type
    type = get_all_classes
  end

  type = [type] unless type.is_a?(Array)

  type.each_with_index { |entity_type, index|
    all_properties = {}
    puts "#{__method__} - start process entity type: #{entity_type} [#{(index / type.size.to_f).round(2)}]" if @console_output
    entity_type = entity_type.to_s.to_sym

    get_best_ranked_resources(entity_type, count).each { |resource, _|
      properties = @query.get_all_predicates_by_subject(resource.to_s, true).map { |solution_prop|
        solution_prop[:property].to_s
      } || []

      properties.uniq.each { |prop|
        next if Predicate.unimportant?(prop)
        all_properties[entity_type] ||= {}
        all_properties[entity_type][prop] ||= 0
        all_properties[entity_type][prop] += 1
      }

    }

    update_global_statistic(all_properties)
  }
end

#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ `Object`

The method helps to store founded information from nif for given resource.

Parameters:

resource_uri (String)
type (String) —

Type from mappings.dbpedia.org/server/ontology/classes/
result_relations (Hsah) —

Hash generated by method #find_relations
this_time (Float) —

Relative time of find in nif dataset.

Options Hash (result_relations):

:sections (Hash) —

Contains key ‘section_type’ value ‘position’
:relations (Array<Hash>) —

Hashes generated by method #get_predicates_by_link

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 261

def generate_result_file(resource_uri, type, result_relations, this_time)
  section_degradation = result_relations[:sections].map { |section_type, position|
    index = result_relations[:sections].keys.index(section_type)

    # recognize value of degradation by relative position paragraphs in document

    position[:degradation] = 1 - ((index / result_relations[:sections].size) / 10.0)

    {section_type => position}
  }.reduce(:merge)

  total_size = section_degradation.max_by { |_, v| v[:to] }[1][:to].to_f

  result_nif_data = result_relations[:relations].map { |relation|
    paragraph_position = section_degradation[relation[:section]]

    # weight is lowest by relative distance from document start

    position_weight = (1 - ((relation[:indexes][0].to_i) / total_size))
    # weight is also degraded by index of paragraph

    relation[:weight] = (position_weight * paragraph_position[:degradation]).round(4)

    relation
  }

  result = {
    process_time: {nif_find: this_time, relations_find: result_relations[:time]},
    resource_uri: resource_uri,
    nif_data: result_nif_data
  }

  result_path = get_resource_file_path(resource_uri, type)
  File.open(result_path, 'w:utf-8') { |f| f << JSON.pretty_generate(result) }
end

#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ `Object`

The method find links in given nif dataset. After find collect relations #find_relations. For each resource generate file in @results_dir_path.

Parameters:

entity_types (Array<String>, String) —

Types from mappings.dbpedia.org/server/ontology/classes/
count (Fixnum) (defaults to: 10) —

Count of best ranked resources
demand_reload (FalseClass, TruesClass) (defaults to: false)

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 75

def generate_statistics_from_nif(entity_types, count = 10, demand_reload = false)
  unless @nif_file_path
    raise RuntimeError.new('Instance has no defined return nif_dataset_path. Can not start generate from nif datset. Please create new instance.')
  end

  resources = get_best_ranked_resources(entity_types, count)
  resources = keep_unloaded(resources) unless demand_reload

  actual_resource_data = []
  lines_group = []

  begin
    time_start = Time.now
    nif_file = File.open(@nif_file_path, 'r')
    line = nif_file.readline

    until nif_file.eof?
      line = nif_file.readline

      if lines_group.size == 7
        # evaulate group (7 lines)

        this_resource_uri = NIFLineParser.parse_resource_uri(lines_group[0])

        if resources.keys.include?(this_resource_uri)
          # process group, is requested

          resource_uri = this_resource_uri
          actual_resource_data << NIFLineParser.parse_line_group(lines_group)

        elsif !actual_resource_data.empty?
          # resource changed, process actual_resource_data

          resource_hash = resources.delete(resource_uri)
          type = resource_hash[:type]

          this_time = (Time.now - time_start).round(2)
          puts "\n#{resource_uri}\n- nif found in #{this_time}\n- resources to find #{resources.size}" if @console_output

          result_relations = find_relations(resource_uri, actual_resource_data, type)
          generate_result_file(resource_uri, type, result_relations, this_time)

          actual_resource_data = []
          time_start = Time.now
        end

        # start new group

        lines_group = [line]
      else

        # join line to group

        lines_group << line
      end

      break if resources.empty?
    end

  ensure
    nif_file.close if nif_file && !nif_file.closed?
  end
end

#get_all_classes(path = File.join(dir, '../knowledge/classes_hierarchy.json')) ⇒ `Array<String>`

The method load all defined entity class types by mappings.dbpedia.org/server/ontology/classes/

Parameters:

path (String) (defaults to: File.join(__dir__, '../knowledge/classes_hierarchy.json'))

Returns:

(Array<String>) —

classes

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 407

def get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json'))
  data = ensure_load_json(path, {})
  HashHelper.recursive_map_keys(data)
end

#get_best_ranked_resources(entity_types, count = 10) ⇒ `Hash`

The method return list of best ranked resources by required entity types.

Parameters:

entity_types (Array<String>, String) —

Types from mappings.dbpedia.org/server/ontology/classes/
count (Fixnum) (defaults to: 10) —

Count of best ranked resources

Returns:

(Hash) —

resources

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 178

def get_best_ranked_resources(entity_types, count = 10)
  resources = {}
  entity_types = [entity_types] unless entity_types.is_a?(Array)

  entity_types.each { |type|
    top_ranked_entities = @query.get_resources_by_dbpedia_page_rank(type, count)

    top_ranked_entities.each { |solution|
      resources[solution.entity.value] = {type: type, rank: solution.rank.value.to_f}
    }
  }

  resources
end

#get_predicates_by_link(resource_uri, link, type) ⇒ `Hash`

The method find predicates by given link. Find strict predicates that are in relation: <resource> ?predicate <link> . Find predicates that are in relation: ?subject a <type> . ?subject ?predicate <link>

Parameters:

resource_uri (String) —

Resource for which will be find strict properties
link (String) —

Link that has some importance to resource or entity type.
type (String) —

Type from mappings.dbpedia.org/server/ontology/classes/

Returns:

(Hash) —

result

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 223

def get_predicates_by_link(resource_uri, link, type)
  properties = {type => {}}
  strict_properties = {type => {}}

  @query.get_all_predicates_by_subject_object(resource_uri, link).each { |solution|
    predicate = solution.to_h
    property = predicate[:property].to_s.force_encoding('utf-8')

    next if Predicate.unimportant?(property)

    count = @query.get_count_predicate_by_entity(type, property)[0].to_h[:count].to_f
    strict_properties[type][property] = count if count > 0
  }

  @query.get_all_predicates_by_object(link).each { |solution|
    predicate = solution.to_h
    property = predicate[:property].to_s.force_encoding('utf-8')

    next if Predicate.unimportant?(property) || strict_properties[type][property]

    count = @query.get_count_predicate_by_entity(type, property)[0].to_h[:count].to_f
    properties[type][property] = count if count > 0
  }


  {properties: properties, strict_properties: strict_properties}
end

#refresh_statistics_in_files(entity_types, count = 10) ⇒ `Object`

The method helps to recollect relations by already generated result files.

Parameters:

entity_types (Array<String>, String) —

Types from mappings.dbpedia.org/server/ontology/classes/
count (Fixnum) (defaults to: 10) —

Count of best ranked resources

# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 198

def refresh_statistics_in_files(entity_types, count = 10)
  resources = get_best_ranked_resources(entity_types, count)

  resources = keep_loaded(resources)

  resources.each { |resource_uri, resource_info|
    puts "_____ #{resource_uri} _____" if @console_output

    update_nif_file_properties(resource_uri, resource_info[:type]) { |link|
      get_predicates_by_link(resource_uri, link, resource_info[:type])
    }
  }

end

Class: BrowserWebData::EntitySumarization::Statistic

Overview

Constant Summary

Constants included from BrowserWebData::EntitySumarizationConfig

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ Statistic

Instance Attribute Details

#nif_file_path ⇒ Object (readonly)

#results_dir_path ⇒ Object (readonly)

Instance Method Details

#create_complete_knowledge_base(params) ⇒ Object

#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ Object

#generate_literal_statistics(type = nil, count = 10000) ⇒ Object

#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ Object

#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ Object

#get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json')) ⇒ Array<String>

#get_best_ranked_resources(entity_types, count = 10) ⇒ Hash

#get_predicates_by_link(resource_uri, link, type) ⇒ Hash

#refresh_statistics_in_files(entity_types, count = 10) ⇒ Object

#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ `Statistic`

#nif_file_path ⇒ `Object` (readonly)

#results_dir_path ⇒ `Object` (readonly)

#create_complete_knowledge_base(params) ⇒ `Object`

#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ `Object`

#generate_literal_statistics(type = nil, count = 10000) ⇒ `Object`

#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ `Object`

#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ `Object`

#get_all_classes(path = File.join(dir, '../knowledge/classes_hierarchy.json')) ⇒ `Array<String>`

#get_best_ranked_resources(entity_types, count = 10) ⇒ `Hash`

#get_predicates_by_link(resource_uri, link, type) ⇒ `Hash`

#refresh_statistics_in_files(entity_types, count = 10) ⇒ `Object`