XmlDataExtractor

This gem provides a DSL for extracting formatted data from any XML structure.

Installation

Add this line to your application's Gemfile:

gem 'xml_data_extractor'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install xml_data_extractor

Usage

The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document.

structure = { schemas: { character: { path: "xml/FirstName" } } }
xml = "<xml><FirstName>Gandalf</FirstName></xml>"

result = XmlDataExtractor.new(structure).parse(xml)

# result -> { character: "Gandalf" }

For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using YAML.load(yml).deep_symbolize_keys.

Considering the following yaml and xml:

schemas:
  description:
    path: xml/desc
    modifier: downcase
  amount:  
    path: xml/info/price
    modifier: to_f

<xml>
  <desc>HELLO WORLD</desc>
  <info>
    <price>123</price>
  </info>  
</xml>

The output is:

{
  description: "hello world",
  amount: 123.0
}

Defining the structure

The structure should be defined as a hash inside the schemas key. See the complete example.

When defining the structure you can combine any available command in order to extract and format the data as needed.

The available commands are separated in two general pusposes:

Navigation & Extraction
Formatting

The data extraction process is based on Xpath using Nokogiri.

path

Defines the xpath of the element. The path is the default command of a field definition, so this:

  schemas:
    description: 
      path: xml/desc

Is equivalent to this:

  schemas:
    description: xml/desc

It can be defined as a string:

schemas:
  description:
    path: xml/some_field

<xml>
  <some_field>ABC</some_field>
</xml>

{ description: "ABC" }

Or as a string array:

schemas:
  address:
    path: [street, info/city]

<xml>
  <street>Diagon Alley</street>
  <info>
    <city>London</city>
  </info>  
</xml>

{ address: ["Diagon Alley", "London"] }

And even as a hash array, for complex operations:

schemas:
  address:
    path:
      - path: street
        modifier: downcase
      - path: info/city
        modifier: upcase

{ address: ["diagon alley", "LONDON"] }

attr

Defines a tag attribute which the value should be extracted from, instead of the tag value itself:

schemas:
  description:
    path: xml/info 
    attr: desc

<xml>
  <info desc="ABC">some stuff<info>
</xml>

{ description: "ABC" }

Like the path, it can also be defined as a string array.

within

To define a root path for the fields:

schemas:  
  movie:
    within: info/movie_data
    title: original_title
    actor: main_actor

<xml>
  <info>
    <movie_data>
      <original_title>The Irishman</original_title>
      <main_actor>Robert De Niro</main_actor>
    </movie_data>
  </info>
</xml>

{ movie: { title: "The Irishman", actor: "Robert De Niro" } }

unescape

This option is pretty usefull when you have embbed XML or HTML inside some tag, like CDATA elements, and you need to unescape them first in order to parse their content:

schemas:  
  movie:
    unescape: response
    title: response/original_title
    actor: response/main_actor

<xml>
  <response>
    &ltoriginal_title&gt1&ltoriginal_title&gt&ltmain_actor&gt1&ltmain_actor&gt
  </response>
</xml>

This XML will be turned into this one during the parsing:

<xml>
  <response>
    <original_title>The Irishman</original_title>
    <main_actor>Robert De Niro</main_actor>
  </response>
</xml>

{ movie: { title: "The Irishman", actor: "Robert De Niro" } }

array_of

Defines the path to a XML collection, which will be looped generating an array of hashes:

schemas:
  people:
    array_of: characters/character
    name: firstname
    age: age

<xml>
  <characters>
    <character>
      <firstname>Geralt</firstname>
      <age>97</age>
    </character>
    <character>
      <firstname>Yennefer</firstname>
      <age>102</age>
    </character>
  </characters>
</xml>

{
  people: [
    { name: "Geralt", age: "97" },
    { name: "Yennefer", age: "102" }
  ]
}

If you need to loop trough nested collections, you can define an array of paths:

schemas:
  show:    
    within: show_data
    title: description
    people:
      array_of: [characters/character, info]
      name: name

<xml>
  <show_data>
    <description>Peaky Blinders</description>
    <characters>
      <character>
        <info>
          <name>Tommy Shelby</name>          
        </info>
      </character>
      <character>
        <info>
          <name>Arthur Shelby</name>          
        </info>
        <info>
          <name>Alfie Solomons</name>
        </info>
      </character>
    </characters>
  </show_data>
</xml>

{
  show: {
    title: "Peaky Blinders",
    people: [
      { name: "Tommy Shelby" },
      { name: "Arthur Shelby" },
      { name: "Alfie Solomons" }      
    ]
  }  
}

link

This command is useful when the XML contains references to other nodes, it works as a SQL JOIN. The path must be and expression containing the <link> identifier, which will be replaced by the value fetched from the link: command.

Example:

schemas:
  bookings:
    array_of: booking
    date: booking_date
    document: id
    products:
      array_of:
      accomodation:
        path: ../hotel[booking_id=<link>]/accomodation
        link: id

<xml>
  <booking>
    <id>1</id>
    <booking_date>2020-01-01</booking_date>
  </booking>
  <booking>
    <id>2</id>
    <booking_date>2020-01-02</booking_date>
  </booking>
  <hotel>
    <booking_id>1</booking_id>
    <accomodation>Standard</accomodation>
  </hotel>
  <hotel>
    <booking_id>2</booking_id>
    <accomodation>Premium</accomodation>
  </hotel>
</xml>

{
  bookings: [
    {
      date: "2020-01-01",
      document: "1"
      products: [
        { accomodation: "Standard" }
      ]
    },
    {
      date: "2020-01-02",
      document: "2"
      products: [
        { accomodation: "Premium" }
      ]
    }
  ]
}

In this example if I didn't use the link to get only the hotel of each booking, it would have returned two accomodations for each booking and instead of extract a string with the accomodation it would extract an array with all the accomodations for each booking.

You can combine the link with array_of if you want search for a list of elements filtering by some field, just provide the path and the link:

schemas:
  bookings:
    array_of: booking
    date: date
    document: id
    products:
      array_of:
        path: ../products[booking_id=<link>]
        link: id
      ....

uniq_by

Can only be used with array_of.

This functionality is useful when some XML nodes are duplicated and you want to extract data from the first occurrence only. It has a behavior similar to Ruby uniq method on arrays. For each path generated from array_of, the value fetched using uniq_by will be checked against the generated collection and the path will be discarded if the value already exists.

schemas:
  bookings:
    array_of:
      path: booking
      uniq_by: id
    date: bdate
    document: id

<xml>
  <booking>
    <id>1</id>
    <bdate>2020-01-01</bdate>
  </booking>
  <booking>
    <id>1</id>
    <bdate>2020-01-01</bdate>
  </booking>
</xml>

{
  bookings: [
    {
      date: "2020-01-01",
      document: "1"
    }
  ]
}

In this example if we don't use the tag uniq_by there would be extracted two elements with the same data, like:

{
  bookings: [
    {
      date: "2020-01-01",
      document: "1"
    },
    {
      date: "2020-01-01",
      document: "1"
    }
  ]
}

array_presence: first_only

The field that contains this property will be only added to the first item of the array.

Can only be used in fields that belong to a node of array_of.

passengers:
  array_of: bookings/booking/passengers/passenger
  id:
    path: document
    modifier: to_s
  name:
    attr: [FirstName, LastName]
    modifier:
      - name: join
        params: [" "]
  rav_tax:
    array_presence: first_only
    path: ../rav
    modifier: to_f

<bookings>
  <booking>
    <rav>150<rav>
    <passengers>
      <passenger>
        <document>109.111.019-79</document>
        <FirstName>Marcelo</FirstName>
        <LastName>Lauxen</LastName>
      </passenger>
      <passenger>
        <document>110.155.019-78</document>
        <FirstName>Corona</FirstName>
        <LastName>Virus</LastName>
      </passenger>
    </passengers>
  </booking>
</bookings>

{
  bookings: [
    {
      passengers: [
        { 
          id: "109.111.019-79",
          name: "Marcelo Lauxen",
          tax_rav: 150.00 
        },
        { 
          id: "110.155.019-78",
          name: "Corona Virus"
        }
      ]
    }
  ]
}

In this example the field tax_rav was only included on the first passenger because this field has the array_presence: first_only property.

in_parent

This option allows you to navigate to a parent node of the current node.

passengers:
  array_of: bookings/booking/passengers/passenger
  id:
    path: document
    modifier: to_s
  bookings_id:
    in_parent: bookings
    path: id

<bookings>
  <bookings_id>8888</bookings_id>
  <booking>
    <passengers>
      <passenger>
        <document>109.111.019-79</document>
      </passenger>
      <passenger>
        <document>110.155.019-78</document>
      </passenger>
    </passengers>
  </booking>
</bookings>

{
  bookings: [
    {
      passengers: [
        { 
          id: "109.111.019-79",
          bookings_id: 8888
        },
        { 
          id: "110.155.019-78",
          bookings_id: 8888
        }
      ]
    }
  ]
}

In this example the value of bookings_id will be extracted starting at the node provided in in_parent instead of the current node. It's possible to navigate to a parent node with ../ too (xpath provides this functionality), but using in_parent you just need to provide the name of the parent node, it will navigate up until the parent node is found, no matter how many levels.

keep_if

This option allows you to keep the part of the block of the hash in the final result only if the condition matches.

schemas:
  dummy:
    within: data
    description: additional_desc
    exchange: currency_info/value
    price: price
    payment:
      type: payment_info/method
      value: payment_info/price
      keep_if: "'type' == 'invoice'"

<data>
  <additional_desc>Keep walking</additional_desc>
  <currency_info kind="USD">
    <value>4.15</value>
  </currency_info>
  <price>55.09</price>
  <payment_info>
    <method>card</method>
    <price>55.48</price>
    <payment>
      <installments>2</installments>
      <card_number>333</card_number>
    </payment>
  </payment>
<data>

{
  dummy: {
    description: "Keep walking",
    exchange: "4.15",
    price: "55.09"
  }
}

In this example the condition didn't match since the payment method was card instead of invoice and then the extracted payment hash was removed from the final result.

Formatting:

fixed

Defines a fixed value for the field:

  currency:
    fixed: BRL

  { currency: "BRL" }

mapper

Uses a hash of predefined values to replace the extracted value with its respective option. If the extracted value is not found in any of the mapper options, it will be replaced by the default value, but if the default value is not defined, the returned value is not replaced.

mappers:
  currencies:
    default: unknown
    options:      
      BRL: R$
      USD: [US$, $]
schemas:
  money:    
    array_of: curr_types/type
    path: symbol
    mapper: currencies

  <xml>
    <curr_type>
      <type>
        <symbol>US$</symbol>
      </type>
      <type>
        <symbol>R$</symbol>
      </type>
      <type>
        <symbol>RB</symbol>
      </type>      
      <type>
        <symbol>$</symbol>
      </type>      
    </curr_type>  
  </xml>

  {
    money: ["USD", "BRL", "unknown", "USD"]
  }

modifier

Defines a method to be called on the returned value.

schemas:
  name:
    path: some_field
    modifier: upcase

<xml>
  <some_field>Lewandovski</some_field>
</xml>

{ name: "LEWANDOVSKI" }

You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the name and params keys:

schemas:
  name:
    path: [firstname, lastname]
    modifier: 
      - name: join
        params: [" "]
      - downcase

<xml>
  <firstname>Robert</firstname>
  <lastname>Martin</lastname>
</xml>

{ name: "robert martin" }

If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter:

schemas:
  name:
    path: final_price
    modifier: format_as_float

<xml>
  <final_price>R$ 12.99</final_price>  
</xml>

class MyMethods 
  def format_as_float(value)
    value.gsub(/[^\d.]/, "").to_f    
  end
end

XmlDataExtractor.new(yml, MyMethods.new).parse(xml)

{ price: 12.99 }