Class: BrowserWebData::EntitySumarization::Statistic
- Inherits:
-
Object
- Object
- BrowserWebData::EntitySumarization::Statistic
- Includes:
- BrowserWebData::EntitySumarizationConfig
- Defined in:
- lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb
Overview
Statistic class allow to find, collect and generate knowledge of entity sumarization. Entity sumarization is based on use dataset of NLP Interchange Format (NIF). For example datasets from wiki.dbpedia.org/nif-abstract-datasets Knowledge is generate by information in DBpedia.
Constant Summary
Constants included from BrowserWebData::EntitySumarizationConfig
BrowserWebData::EntitySumarizationConfig::COMMON_PROPERTIES, BrowserWebData::EntitySumarizationConfig::IDENTICAL_PROPERTY_LIMIT, BrowserWebData::EntitySumarizationConfig::IMPORTANCE_TO_IDENTIFY_MAX_COUNT, BrowserWebData::EntitySumarizationConfig::NO_SENSE_PROPERTIES, BrowserWebData::EntitySumarizationConfig::SCAN_REGEXP
Instance Attribute Summary collapse
-
#nif_file_path ⇒ Object
readonly
Returns the value of attribute nif_file_path.
-
#results_dir_path ⇒ Object
readonly
Returns the value of attribute results_dir_path.
Instance Method Summary collapse
-
#create_complete_knowledge_base(params) ⇒ Object
The method find resource links in given nif file dataset.
-
#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ Object
The method process all generated result files from nif dataset (by entity class type) to one result knowledge base file.
-
#generate_literal_statistics(type = nil, count = 10000) ⇒ Object
The method generate simple statistics that contain all predicates that links to literal.
-
#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ Object
The method helps to store founded information from nif for given resource.
-
#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ Object
The method find links in given nif dataset.
-
#get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json')) ⇒ Array<String>
The method load all defined entity class types by mappings.dbpedia.org/server/ontology/classes/.
-
#get_best_ranked_resources(entity_types, count = 10) ⇒ Hash
The method return list of best ranked resources by required entity types.
-
#get_predicates_by_link(resource_uri, link, type) ⇒ Hash
The method find predicates by given link.
-
#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ Statistic
constructor
Create new instance of Statistic class.
-
#refresh_statistics_in_files(entity_types, count = 10) ⇒ Object
The method helps to recollect relations by already generated result files.
Constructor Details
#initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) ⇒ Statistic
Create new instance of Statistic class.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 27 def initialize(nif_dataset_path = nil, results_dir_path = nil, console_output = false) nif_dataset_path = nif_dataset_path.gsub('\\', '/') if nif_dataset_path results_dir_path = results_dir_path.gsub('\\', '/').chomp('/') if results_dir_path unless Dir.exist?(results_dir_path) cache_dir_path = "#{Dir.tmpdir}/#{BrowserWebData::TMP_DIR}" Dir.mkdir(cache_dir_path) unless Dir.exist?(cache_dir_path) results_dir_path = "#{cache_dir_path}/results" Dir.mkdir(results_dir_path) unless Dir.exist?(results_dir_path) end @nif_file_path = nif_dataset_path @results_dir_path = results_dir_path @console_output = console_output @query = SPARQLRequest.new @predicates_similarity = PredicatesSimilarity.new(@results_dir_path, IDENTICAL_PROPERTY_LIMIT, console_output) end |
Instance Attribute Details
#nif_file_path ⇒ Object (readonly)
Returns the value of attribute nif_file_path.
19 20 21 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 19 def nif_file_path @nif_file_path end |
#results_dir_path ⇒ Object (readonly)
Returns the value of attribute results_dir_path.
19 20 21 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 19 def results_dir_path @results_dir_path end |
Instance Method Details
#create_complete_knowledge_base(params) ⇒ Object
The method find resource links in given nif file dataset.
54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 54 def create_complete_knowledge_base(params) params[:entity_types] = [params[:entity_types]] unless params[:entity_types].is_a?(Array) generate_statistics_from_nif(params[:entity_types], params[:entity_count], params[:demand_reload]) params[:entity_types].each { |type| generate_literal_statistics(type) } params[:entity_types].each { |type| generate_knowledge_base_for_entity(type, params[:identify_identical_predicates]) } end |
#generate_knowledge_base_for_entity(type, identify_identical = true) ⇒ Object
The method process all generated result files from nif dataset (by entity class type) to one result knowledge base file.
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 300 def generate_knowledge_base_for_entity(type, identify_identical = true) puts "_____ #{type} _____" if @console_output files = Dir.glob("#{@results_dir_path}/#{type}/*.json") type = type.to_s.to_sym knowledge_data = {type => []} global_properties = get_global_statistic_by_type(type) || {} if identify_identical try_this_identical = {} files.each { |file_path| file_data = JSON.parse(File.read(file_path).force_encoding('utf-8'), symbolize_names: true) file_data[:nif_data].each { |data| try_this_identical.merge!(data[:properties][type]) { |_, x, y| x + y } } } try_this_identical.merge!(global_properties) { |_, x, y| x + y } if try_this_identical.size > 0 try_this_identical = Hash[try_this_identical.sort_by { |_, v| v }.reverse] puts "- prepare to identify identical: total count #{try_this_identical.size}" if @console_output @predicates_similarity.identify_identical_predicates(try_this_identical.keys) end end puts "- calculate: files count #{files.size}" if @console_output files.each { |file_path| file_data = JSON.parse(File.read(file_path).force_encoding('utf-8'), symbolize_names: true) file_data[:nif_data].each { |found| properties = found[:properties][type.to_sym] strict_properties = (found[:strict_properties] ||{})[type] || {} weight = found[:weight] strict_properties.each { |property, count| property = property.to_s value = count.to_i * weight prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge| old_score = from_knowledge[:score] * from_knowledge[:counter] from_knowledge[:counter] += 1 (old_score + value) / from_knowledge[:counter] } } properties.each { |property, count| property = property.to_s value = count.to_i * weight prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge| old_score = from_knowledge[:score] * from_knowledge[:counter] from_knowledge[:counter] += 1 (old_score + value) / from_knowledge[:counter] } } } unless knowledge_data[type].empty? max_weight = knowledge_data[type].max_by { |data| data[:score] }[:score] knowledge_data[type] = knowledge_data[type].map { |hash| hash[:score] = (hash[:score] / max_weight).round(4) hash } end } if global_properties.size > 0 max_count = global_properties.max_by { |_, count| count }[1].to_f global_properties.each { |property, count| value = count / max_count prepare_property_to_knowledge(property, knowledge_data[type]) { |from_knowledge| from_knowledge[:score] > 0 ? ((from_knowledge[:score] + value) / 2.0).round(4) : value.round(4) } } end knowledge_data[type].map! { |hash| hash.delete(:counter) hash } knowledge_data[type] = knowledge_data[type].keep_if { |hash| hash[:score] > 0 }.sort_by { |hash| hash[:score] }.reverse if identify_identical @predicates_similarity.reduce_identical end update_knowledge_base(knowledge_data) end |
#generate_literal_statistics(type = nil, count = 10000) ⇒ Object
The method generate simple statistics that contain all predicates that links to literal. Predicates are grouped by entity class type and also contains count of total occurrence. Predicates find from best ranked resources.
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 141 def generate_literal_statistics(type = nil, count = 10000) unless type type = get_all_classes end type = [type] unless type.is_a?(Array) type.each_with_index { |entity_type, index| all_properties = {} puts "#{__method__} - start process entity type: #{entity_type} [#{(index / type.size.to_f).round(2)}]" if @console_output entity_type = entity_type.to_s.to_sym get_best_ranked_resources(entity_type, count).each { |resource, _| properties = @query.get_all_predicates_by_subject(resource.to_s, true).map { |solution_prop| solution_prop[:property].to_s } || [] properties.uniq.each { |prop| next if Predicate.unimportant?(prop) all_properties[entity_type] ||= {} all_properties[entity_type][prop] ||= 0 all_properties[entity_type][prop] += 1 } } update_global_statistic(all_properties) } end |
#generate_result_file(resource_uri, type, result_relations, this_time) ⇒ Object
The method helps to store founded information from nif for given resource.
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 261 def generate_result_file(resource_uri, type, result_relations, this_time) section_degradation = result_relations[:sections].map { |section_type, position| index = result_relations[:sections].keys.index(section_type) # recognize value of degradation by relative position paragraphs in document position[:degradation] = 1 - ((index / result_relations[:sections].size) / 10.0) {section_type => position} }.reduce(:merge) total_size = section_degradation.max_by { |_, v| v[:to] }[1][:to].to_f result_nif_data = result_relations[:relations].map { |relation| paragraph_position = section_degradation[relation[:section]] # weight is lowest by relative distance from document start position_weight = (1 - ((relation[:indexes][0].to_i) / total_size)) # weight is also degraded by index of paragraph relation[:weight] = (position_weight * paragraph_position[:degradation]).round(4) relation } result = { process_time: {nif_find: this_time, relations_find: result_relations[:time]}, resource_uri: resource_uri, nif_data: result_nif_data } result_path = get_resource_file_path(resource_uri, type) File.open(result_path, 'w:utf-8') { |f| f << JSON.pretty_generate(result) } end |
#generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) ⇒ Object
The method find links in given nif dataset. After find collect relations #find_relations. For each resource generate file in @results_dir_path.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 75 def generate_statistics_from_nif(entity_types, count = 10, demand_reload = false) unless @nif_file_path raise RuntimeError.new('Instance has no defined return nif_dataset_path. Can not start generate from nif datset. Please create new instance.') end resources = get_best_ranked_resources(entity_types, count) resources = keep_unloaded(resources) unless demand_reload actual_resource_data = [] lines_group = [] begin time_start = Time.now nif_file = File.open(@nif_file_path, 'r') line = nif_file.readline until nif_file.eof? line = nif_file.readline if lines_group.size == 7 # evaulate group (7 lines) this_resource_uri = NIFLineParser.parse_resource_uri(lines_group[0]) if resources.keys.include?(this_resource_uri) # process group, is requested resource_uri = this_resource_uri actual_resource_data << NIFLineParser.parse_line_group(lines_group) elsif !actual_resource_data.empty? # resource changed, process actual_resource_data resource_hash = resources.delete(resource_uri) type = resource_hash[:type] this_time = (Time.now - time_start).round(2) puts "\n#{resource_uri}\n- nif found in #{this_time}\n- resources to find #{resources.size}" if @console_output result_relations = find_relations(resource_uri, actual_resource_data, type) generate_result_file(resource_uri, type, result_relations, this_time) actual_resource_data = [] time_start = Time.now end # start new group lines_group = [line] else # join line to group lines_group << line end break if resources.empty? end ensure nif_file.close if nif_file && !nif_file.closed? end end |
#get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json')) ⇒ Array<String>
The method load all defined entity class types by mappings.dbpedia.org/server/ontology/classes/
407 408 409 410 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 407 def get_all_classes(path = File.join(__dir__, '../knowledge/classes_hierarchy.json')) data = ensure_load_json(path, {}) HashHelper.recursive_map_keys(data) end |
#get_best_ranked_resources(entity_types, count = 10) ⇒ Hash
The method return list of best ranked resources by required entity types.
178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 178 def get_best_ranked_resources(entity_types, count = 10) resources = {} entity_types = [entity_types] unless entity_types.is_a?(Array) entity_types.each { |type| top_ranked_entities = @query.get_resources_by_dbpedia_page_rank(type, count) top_ranked_entities.each { |solution| resources[solution.entity.value] = {type: type, rank: solution.rank.value.to_f} } } resources end |
#get_predicates_by_link(resource_uri, link, type) ⇒ Hash
The method find predicates by given link. Find strict predicates that are in relation: <resource> ?predicate <link> . Find predicates that are in relation: ?subject a <type> . ?subject ?predicate <link>
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 223 def get_predicates_by_link(resource_uri, link, type) properties = {type => {}} strict_properties = {type => {}} @query.get_all_predicates_by_subject_object(resource_uri, link).each { |solution| predicate = solution.to_h property = predicate[:property].to_s.force_encoding('utf-8') next if Predicate.unimportant?(property) count = @query.get_count_predicate_by_entity(type, property)[0].to_h[:count].to_f strict_properties[type][property] = count if count > 0 } @query.get_all_predicates_by_object(link).each { |solution| predicate = solution.to_h property = predicate[:property].to_s.force_encoding('utf-8') next if Predicate.unimportant?(property) || strict_properties[type][property] count = @query.get_count_predicate_by_entity(type, property)[0].to_h[:count].to_f properties[type][property] = count if count > 0 } {properties: properties, strict_properties: strict_properties} end |
#refresh_statistics_in_files(entity_types, count = 10) ⇒ Object
The method helps to recollect relations by already generated result files.
198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_statistics.rb', line 198 def refresh_statistics_in_files(entity_types, count = 10) resources = get_best_ranked_resources(entity_types, count) resources = keep_loaded(resources) resources.each { |resource_uri, resource_info| puts "_____ #{resource_uri} _____" if @console_output update_nif_file_properties(resource_uri, resource_info[:type]) { |link| get_predicates_by_link(resource_uri, link, resource_info[:type]) } } end |