Class: Ariel::Node::Structure

Inherits:
Ariel::Node show all
Defined in:
lib/ariel/node/structure.rb

Overview

Implements a Node object used to represent the structure of the document tree. Each node stores start and end rules to extract the desired content from its parent node. Could be viewed as a rule-storing object.

Instance Attribute Summary collapse

Attributes inherited from Ariel::Node

#children, #node_name, #parent

Instance Method Summary collapse

Methods inherited from Ariel::Node

#add_child, #each_descendant, #inspect

Constructor Details

#initialize(name = :root, type = :not_list) {|_self| ... } ⇒ Structure

Returns a new instance of Structure.

Yields:

  • (_self)

Yield Parameters:



11
12
13
14
15
# File 'lib/ariel/node/structure.rb', line 11

def initialize(name=:root, type=:not_list, &block)
  super(name)
  @node_type=type
  yield self if block_given?
end

Instance Attribute Details

#node_typeObject

Returns the value of attribute node_type.



9
10
11
# File 'lib/ariel/node/structure.rb', line 9

def node_type
  @node_type
end

#rulesetObject

Returns the value of attribute ruleset.



9
10
11
# File 'lib/ariel/node/structure.rb', line 9

def ruleset
  @ruleset
end

Instance Method Details

#apply_extraction_tree_on(root_node, extract_labels = false) ⇒ Object

Applies the extraction rules stored in the current Node::Structure and all its descendant children.



49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/ariel/node/structure.rb', line 49

def apply_extraction_tree_on(root_node, extract_labels=false)
  extraction_queue = [root_node]
  until extraction_queue.empty? do
    new_parent = extraction_queue.shift
    new_parent.structure_node.children.values.each do |child|
      if extract_labels
        extractions=LabelUtils.extract_labeled_region(child, new_parent)
      else
        extractions=child.extract_from(new_parent)
      end
      extractions.each {|extracted_node| extraction_queue.push extracted_node}
    end
  end
  return root_node
end

#extend_structure {|_self| ... } ⇒ Object

Used to extend an already created Node. e.g.

node.extend_structure do |r|
  r.item :new_field1
  r.item :new_field2
end

Yields:

  • (_self)

Yield Parameters:



22
23
24
# File 'lib/ariel/node/structure.rb', line 22

def extend_structure(&block)
  yield self if block_given?
end

#extract_from(node) ⇒ Object

Given a Node to apply it’s rules to, this function will create a new node and add it as a child of the given node. It returns an array of the items extracted by the rule



29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'lib/ariel/node/structure.rb', line 29

def extract_from(node)
  extractions=[]
  i=0
  return extractions if @ruleset.nil? #no extractions if no rule has been learnt
  @ruleset.apply_to(node.tokenstream) do |newstream|
    if self.node_type==:list_item
      new_node_name=i
      i+=1
    else
      new_node_name=@node_name
    end
    extracted_node = Node::Extracted.new(new_node_name, newstream, self)
    node.add_child extracted_node
    extractions << extracted_node
  end
  return extractions
end

#item(name, &block) ⇒ Object Also known as: list

Use when defining any object that occurs once. #list is a synonym, but it’s recommended you use it when defining a container for list_items. The children of a list_item are just items. e.g. <tt>structure = Ariel::Node::Structure.new do |r|

r.list :comments do |c|  # r.item :comments would be equivalent, but less readable
  c.list_item :comment do |c|
    c.item :author  # Now these are just normal items, as they are extracted once from their parent
    c.item :date
    c.item :body
  end
end

end



77
78
79
# File 'lib/ariel/node/structure.rb', line 77

def item(name, &block)
  self.add_child(Node::Structure.new(name, &block))
end

#list_item(name, &block) ⇒ Object

See the docs for #item for a discussion of when to use #item and when to use #list_item.



86
87
88
# File 'lib/ariel/node/structure.rb', line 86

def list_item(name, &block)
  self.add_child(Node::Structure.new(name, :list_item, &block))
end