Class: Opener::PropertyTagger::Processor

Inherits:

Object

Object
Opener::PropertyTagger::Processor

show all

Defined in:: lib/opener/property_tagger/processor.rb

Overview

Class that applies property tagging to a given input KAF file.

Constant Summary collapse

FILE_ASPECTS_CACHE = Global cache used for storing loaded aspects. Returns: (Opener::PropertyTagger::FileAspectsCache.new)

FileAspectsCache.new

REMOTE_ASPECTS_CACHE =

RemoteAspectsCache.new

Instance Attribute Summary collapse

#aspects ⇒ Object

Returns the value of attribute aspects.
#aspects_path ⇒ Object

Returns the value of attribute aspects_path.
#aspects_url ⇒ Object

Returns the value of attribute aspects_url.
#document ⇒ Object

Returns the value of attribute document.
#pretty ⇒ Object

Returns the value of attribute pretty.
#timestamp ⇒ Object

Returns the value of attribute timestamp.

Instance Method Summary collapse

#add_features_layer ⇒ Object

Remove the features layer from the KAF file if it exists and add a new one.
#add_linguistic_processor ⇒ Object
#add_properties_layer ⇒ Object

Add the properties layer as a child to the features layer.
#add_property(key, value, index) ⇒ Object
#extract_aspects ⇒ Hash

Check which terms belong to an aspect (property) Text have priority over Lemmas, overriding if there is a conflict.
#initialize(file, params: {}, url: nil, path: nil, timestamp: true, pretty: false) ⇒ Processor constructor

A new instance of Processor.
#language ⇒ Object
#pretty_print(document) ⇒ String

Format the output document properly.
#process ⇒ String

Processes the input and returns the new KAF output.
#terms ⇒ Object

Constructor Details

#initialize(file, params: {}, url: nil, path: nil, timestamp: true, pretty: false) ⇒ `Processor`

Returns a new instance of Processor.

Parameters:

file (String|IO) —

The KAF file/input to process.
aspects_path (String) —

Path to the aspects.
timestamp (TrueClass|FalseClass) (defaults to: true) —

Add timestamps to the KAF.
pretty (TrueClass|FalseClass) (defaults to: false) —

Enable pretty formatting, disabled by default due to the performance overhead.

# File 'lib/opener/property_tagger/processor.rb', line 27

def initialize file, params: {}, url: nil, path: nil, timestamp: true, pretty: false
  @document     = Nokogiri.XML file
  raise 'Error parsing input. Input is required to be KAF' unless is_kaf?
  @timestamp    = timestamp
  @pretty       = pretty

  @params       = params
  @remote       = !url.nil?
  @aspects_path = path
  @aspects_url  = url
  @cache_keys   = params[:cache_keys]
  @cache_keys.merge! lang: @document.root.attr('xml:lang')

  @aspects = if @remote then REMOTE_ASPECTS_CACHE[**@cache_keys].aspects else FILE_ASPECTS_CACHE[aspects_file] end
end

Instance Attribute Details

#aspects ⇒ `Object`

Returns the value of attribute aspects.



9
10
11

# File 'lib/opener/property_tagger/processor.rb', line 9

def aspects
  @aspects
end

#aspects_path ⇒ `Object`

Returns the value of attribute aspects_path.



9
10
11

# File 'lib/opener/property_tagger/processor.rb', line 9

def aspects_path
  @aspects_path
end

#aspects_url ⇒ `Object`

Returns the value of attribute aspects_url.



9
10
11

# File 'lib/opener/property_tagger/processor.rb', line 9

def aspects_url
  @aspects_url
end

#document ⇒ `Object`

Returns the value of attribute document.



8
9
10

# File 'lib/opener/property_tagger/processor.rb', line 8

def document
  @document
end

#pretty ⇒ `Object`

Returns the value of attribute pretty.



10
11
12

# File 'lib/opener/property_tagger/processor.rb', line 10

def pretty
  @pretty
end

#timestamp ⇒ `Object`

Returns the value of attribute timestamp.



10
11
12

# File 'lib/opener/property_tagger/processor.rb', line 10

def timestamp
  @timestamp
end

Instance Method Details

#add_features_layer ⇒ `Object`

Remove the features layer from the KAF file if it exists and add a new one.

# File 'lib/opener/property_tagger/processor.rb', line 124

def add_features_layer
  existing = document.at_xpath('KAF/features')

  existing.remove if existing

  new_node('features', 'KAF')
end

#add_linguistic_processor ⇒ `Object`

# File 'lib/opener/property_tagger/processor.rb', line 160

def add_linguistic_processor
  description = 'VUA property tagger'
  last_edited = '16jan2015'
  version     = '2.0'

  node = new_node('linguisticProcessors', 'KAF/kafHeader')
  node['layer'] = 'features'

  lp_node = new_node('lp', node)

  lp_node['version'] = "#{last_edited}-#{version}"
  lp_node['name']    = description

  if timestamp
    format = '%Y-%m-%dT%H:%M:%S%Z'

    lp_node['timestamp'] = Time.now.strftime(format)
  else
    lp_node['timestamp'] = '*'
  end
end

#add_properties_layer ⇒ `Object`

Add the properties layer as a child to the features layer.



134
135
136

# File 'lib/opener/property_tagger/processor.rb', line 134

def add_properties_layer
  new_node("properties", "KAF/features")
end

#add_property(key, value, index) ⇒ `Object`

# File 'lib/opener/property_tagger/processor.rb', line 138

def add_property(key, value, index)
  property_node = new_node("property", "KAF/features/properties")

  property_node['lemma'] = key.to_s
  property_node['pid']   = "p#{index.to_s}"

  references_node = new_node("references", property_node)

  value.uniq.each do |v|
    comm_node = Nokogiri::XML::Comment.new(references_node, " #{v.last} ")
    references_node.add_child comm_node

    span_node = new_node("span", references_node)

    v.first.each do |val|
      target_node       = new_node("target", span_node)

      target_node['id'] = val.to_s
    end
  end
end

#extract_aspects ⇒ `Hash`

Check which terms belong to an aspect (property) Text have priority over Lemmas, overriding if there is a conflict

Returns:

(Hash)

# File 'lib/opener/property_tagger/processor.rb', line 85

def extract_aspects
  term_ids     = terms.keys
  lemmas       = terms.values
  uniq_aspects = Hash.new { |hash, key| hash[key] = [] }

  [:lemma, :text].each do |k|
    current_token = 0
    # Use of n-grams to determine if a unigram (1 lemma) or bigram (2
    # lemmas) belong to a property.
    max_ngram = 2


    while current_token < terms.count
      (0..max_ngram).each do |tam_ngram|
        if current_token + tam_ngram <= terms.count
          ngram = lemmas[current_token..current_token+tam_ngram].map{|a| a[k] }.join(" ").downcase

          if aspects[ngram.to_sym]
            properties = aspects[ngram.to_sym]
            ids        = term_ids[current_token..current_token+tam_ngram]

            properties.uniq.each do |property|
              next if !property or property.strip.empty?

              uniq_aspects[property.to_sym] << [ids,ngram] unless uniq_aspects[property.to_sym].include? [ids,ngram]
            end
          end
        end
      end
      current_token += 1
    end
  end

  Hash[uniq_aspects.sort]
end

#language ⇒ `Object`



64
65
66

# File 'lib/opener/property_tagger/processor.rb', line 64

def language
  @language ||= document.at_xpath('KAF').attr('xml:lang')
end

#pretty_print(document) ⇒ `String`

Format the output document properly.

TODO: this should be handled by Oga in a nice way.

Returns:

(String)

# File 'lib/opener/property_tagger/processor.rb', line 189

def pretty_print(document)
  doc = REXML::Document.new document.to_xml
  doc.context[:attribute_quote] = :quote
  out = ""
  formatter = REXML::Formatters::Pretty.new
  formatter.compact = true
  formatter.write(doc, out)

  out.strip
end

#process ⇒ `String`

Processes the input and returns the new KAF output.

Returns:

(String)

# File 'lib/opener/property_tagger/processor.rb', line 47

def process
  existing_aspects = extract_aspects

  add_features_layer
  add_properties_layer

  existing_aspects.each_with_index do |(key, value), index|
    index += 1

    add_property(key, value, index)
  end

  add_linguistic_processor

  pretty ? pretty_print(document) : document.to_xml
end

#terms ⇒ `Object`

# File 'lib/opener/property_tagger/processor.rb', line 68

def terms
  unless @terms
    @terms = {}

    document.xpath('KAF/terms/term').each do |term|
      @terms[term.attr('tid').to_sym] = { lemma: term.attr('lemma'), text: term.attr('text')}
    end
  end

  @terms
end

Class: Opener::PropertyTagger::Processor

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(file, params: {}, url: nil, path: nil, timestamp: true, pretty: false) ⇒ Processor

Instance Attribute Details

#aspects ⇒ Object

#aspects_path ⇒ Object

#aspects_url ⇒ Object

#document ⇒ Object

#pretty ⇒ Object

#timestamp ⇒ Object

Instance Method Details

#add_features_layer ⇒ Object

#add_linguistic_processor ⇒ Object

#add_properties_layer ⇒ Object

#add_property(key, value, index) ⇒ Object

#extract_aspects ⇒ Hash

#language ⇒ Object

#pretty_print(document) ⇒ String

#process ⇒ String

#terms ⇒ Object

#initialize(file, params: {}, url: nil, path: nil, timestamp: true, pretty: false) ⇒ `Processor`

#aspects ⇒ `Object`

#aspects_path ⇒ `Object`

#aspects_url ⇒ `Object`

#document ⇒ `Object`

#pretty ⇒ `Object`

#timestamp ⇒ `Object`

#add_features_layer ⇒ `Object`

#add_linguistic_processor ⇒ `Object`

#add_properties_layer ⇒ `Object`

#add_property(key, value, index) ⇒ `Object`

#extract_aspects ⇒ `Hash`

#language ⇒ `Object`

#pretty_print(document) ⇒ `String`

#process ⇒ `String`

#terms ⇒ `Object`