Class: BrowserWebData::EntitySumarization::PredicatesSimilarity

Inherits:
Object
  • Object
show all
Includes:
BrowserWebData::EntitySumarizationConfig
Defined in:
lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb

Overview

The class include methods to identify identical predicates

Constant Summary

Constants included from BrowserWebData::EntitySumarizationConfig

BrowserWebData::EntitySumarizationConfig::COMMON_PROPERTIES, BrowserWebData::EntitySumarizationConfig::IDENTICAL_PROPERTY_LIMIT, BrowserWebData::EntitySumarizationConfig::IMPORTANCE_TO_IDENTIFY_MAX_COUNT, BrowserWebData::EntitySumarizationConfig::NO_SENSE_PROPERTIES, BrowserWebData::EntitySumarizationConfig::SCAN_REGEXP

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(results_dir_path, identical_limit = IDENTICAL_PROPERTY_LIMIT, console_output = false) ⇒ PredicatesSimilarity

The method create new instance of PredicatesSimilarity class.

Parameters:

  • results_dir_path (String)
  • identical_limit (Float) (defaults to: IDENTICAL_PROPERTY_LIMIT)

    Define minimal identical percent rate of predicates to mark as identical.

  • console_output (TrueClass, FalseClass) (defaults to: false)

    Allow puts info to console. Default is false.



22
23
24
25
26
27
28
29
30
31
32
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 22

def initialize(results_dir_path, identical_limit = IDENTICAL_PROPERTY_LIMIT, console_output = false)
  @results_dir_path = results_dir_path
  @console_output = console_output
  @identical_limit = identical_limit

  @query = SPARQLRequest.new

  load_identical_predicates
  load_different_predicates
  load_counts
end

Class Method Details

.get_key(predicates) ⇒ String

The method return key of identical predicates

Parameters:

  • predicates (Array<String>)

Returns:

  • (String)

    key



40
41
42
43
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 40

def self.get_key(predicates)
  predicates = [predicates] unless predicates.is_a?(Array)
  "<#{predicates.sort.join('><')}>" if predicates && !predicates.empty?
end

.parse_key(key) ⇒ Array<String>

The method return identical predicates by key

Parameters:

  • key (String)

Returns:

  • (Array<String>)

    predicates



51
52
53
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 51

def self.parse_key(key)
  key.to_s.scan(SCAN_REGEXP[:identical_key]).reduce(:+)
end

Instance Method Details

#add_different(values) ⇒ Object

The method add new different values to local storage.

Parameters:

  • values (Array<String>)


173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 173

def add_different(values)
  values = values.map { |p| p.to_s }.uniq.sort
  group_key = PredicatesSimilarity.get_key(values)

  unless @different_predicates.include?(group_key)
    @different_predicates << group_key

    @new_diff_counter ||= 0
    @new_diff_counter += 1

    if @new_diff_counter > 100
      store_different_predicates
      @new_diff_counter = 0
    end

  end
end

#add_identical(values) ⇒ Object

The method add new identical values to local storage.

Parameters:

  • values (Array<String>)


159
160
161
162
163
164
165
166
167
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 159

def add_identical(values)
  values = values.map { |p| p.to_s }.uniq.sort
  group_key = PredicatesSimilarity.get_key(values)

  unless @identical_predicates.include?(group_key)
    @identical_predicates << group_key
    store_identical_properties
  end
end

#find_different(value) ⇒ String, NilClass

The method helps to recognize if is already marked as different properties

Parameters:

  • value (Array<String>, String)

Returns:

  • (String, NilClass)

Raises:

  • (RuntimeError)


140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 140

def find_different(value)
  raise RuntimeError.new('No support identify identical for more than 2 predicates.') if value.is_a?(Array) && value.size >2

  key = case value
          when Array
            value = value.map { |v| PredicatesSimilarity.get_key(v) }
            @different_predicates.find { |p| p[value[0]] && p[value[1]] }
          else
            value = PredicatesSimilarity.get_key(value)
            @different_predicates.find { |p| p[value] }
        end

  PredicatesSimilarity.parse_key(key)
end

#find_identical(value) ⇒ String, NilClass

The method helps to recognize if is already marked as identical properties

Parameters:

  • value (Array<String>, String)

Returns:

  • (String, NilClass)

Raises:

  • (RuntimeError)


115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 115

def find_identical(value)
  raise RuntimeError.new('No support identify identical for more than 2 predicates.') if value.is_a?(Array) && value.size >2

  predicates_key = case value
                     when Array
                       value = value.map { |v| PredicatesSimilarity.get_key(v) }
                       @identical_predicates.find { |p|
                         p[value[0]] && p[value[1]]
                       }
                     else
                       value = PredicatesSimilarity.get_key(value)
                       @identical_predicates.find { |p|
                         p[value]
                       }
                   end

  PredicatesSimilarity.parse_key(predicates_key)
end

#identify_identical_predicates(predicates, identical_limit = @identical_limit) ⇒ Object

The method verify every combination of two predicates. Method store identify combination in two files identical_predicates.json and different_predicates.json files contains Array of combination keys. Given predicates count are is reduced to #IMPORTANCE_TO_IDENTIFY_MAX_COUNT (250)

Parameters:

  • predicates (Array<String>)


62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 62

def identify_identical_predicates(predicates, identical_limit = @identical_limit)
  combination = predicates.take(IMPORTANCE_TO_IDENTIFY_MAX_COUNT).map { |p| p.to_sym }.combination(2)
  times_count = combination.size / 10.0

  combination.each_with_index { |values, i|

    already_mark_same = find_identical(values)
    already_mark_different = find_different(values)

    if already_mark_same.nil? && already_mark_different.nil?

      # in case of dbpedia ontology vs. property

      # automatically became identical

      unless is_identical_property_ontology?(values)

        unless @counts[values[0]]
          @counts[values[0]] = @query.get_count_of_identical_predicates(values[0])
        end

        unless @counts[values[1]]
          @counts[values[1]] = @query.get_count_of_identical_predicates(values[1])
        end

        x = @counts[values[0]]
        y = @counts[values[1]]
        z = @query.get_count_of_identical_predicates(values)

        identical_level = z / [x, y].max

        if identical_level >= identical_limit
          puts "     - result[#{identical_level}] z[#{z}] x[#{x}] y[#{y}] #{values.inspect}" if @console_output
          add_identical(values)
        else
          add_different(values)
        end
      end
    end

    if @console_output && ( i == 0 || (i+1) % times_count == 0 )
      puts "#{Time.now.localtime} | #{(((i+1)/combination.size.to_f) * 100).round(0)}% | [#{(i+1)}/#{combination.size}]"
    end

  }

  store_counts
end

#is_identical_property_ontology?(values) ⇒ TrueClass, FalseClass

The method helps to automatic identify identical properties that means DBpedia property versus ontology predicates.

Parameters:

  • values (Array<String>)

Returns:

  • (TrueClass, FalseClass)

    resuls



198
199
200
201
202
203
204
205
206
207
208
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 198

def is_identical_property_ontology?(values)
  group_key = PredicatesSimilarity.get_key(values)

  temp = values.map { |val| val.to_s.split('/').last }.uniq
  if temp.size == 1 && group_key['property/'] && group_key['ontology/']
    add_identical(values)
    true
  else
    false
  end
end

#recursive_find_identical(keys, values) ⇒ Array<String>

The method helps to collect identical chains.

Parameters:

  • keys (Array<String>)

    Array of identical key items.

  • values (Array<String>)

    All values that is related to all keys.

Returns:

  • (Array<String>)

    all_find_values



238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 238

def recursive_find_identical(keys, values)
  keys = [keys] unless keys.is_a?(Array)

  @identical_predicates.each { |this_key|
    next if keys.include?(this_key)
    temp = PredicatesSimilarity.parse_key(this_key)

    unless (temp & values).empty?
      keys << this_key
      return recursive_find_identical(keys, (values + temp).uniq)
    end
  }

  values
end

#reduce_identicalObject

The method helps to reduce identical predicates by join of common predicate



213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# File 'lib/browser_web_data_entity_sumarization/entity_sumarization_predicates_similarity.rb', line 213

def reduce_identical
  new_identical = []

  @identical_predicates.each { |key|
    values = PredicatesSimilarity.parse_key(key)
    next if new_identical.find { |v| !(v & values).empty? }

    ## find nodes with values predicates

    values = recursive_find_identical(key, values)

    new_identical << values.uniq.sort
  }

  @identical_predicates = new_identical.map { |v| PredicatesSimilarity.get_key(v) }

  store_identical_properties
end