XmlDataExtractor
This gem provides a DSL for extracting formatted data from any XML structure.
Installation
Add this line to your application's Gemfile:
gem 'xml_data_extractor'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install xml_data_extractor
Usage
The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document.
structure = { schemas: { character: { path: "xml/FirstName" } } }
xml = "<xml><FirstName>Gandalf</FirstName></xml>"
result = XmlDataExtractor.new(structure).parse(xml)
# result -> { character: "Gandalf" }
For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using YAML.load(yml).deep_symbolize_keys
.
Considering the following yaml and xml:
schemas:
description:
path: xml/desc
modifier: downcase
amount:
path: xml/info/price
modifier: to_f
<xml>
<desc>HELLO WORLD</desc>
<info>
<price>123</price>
</info>
</xml>
The output is:
{
description: "hello world",
amount: 123.0
}
Defining the structure
The structure should be defined as a hash inside the schemas
key. See the complete example.
When defining the structure you can combine any available command in order to extract and format the data as needed.
The available commands are separated in two general pusposes:
Navigation & Extraction:
The data extraction process is based on Xpath
using Nokogiri.
path
Defines the xpath
of the element.
The path
is the default command of a field definition, so this:
schemas:
description:
path: xml/desc
Is equivalent to this:
schemas:
description: xml/desc
It can be defined as a string:
schemas:
description:
path: xml/some_field
<xml>
<some_field>ABC</some_field>
</xml>
{ description: "ABC" }
Or as a string array:
schemas:
address:
path: [street, info/city]
<xml>
<street>Diagon Alley</street>
<info>
<city>London</city>
</info>
</xml>
{ address: ["Diagon Alley", "London"] }
And even as a hash array, for complex operations:
schemas:
address:
path:
- path: street
modifier: downcase
- path: info/city
modifier: upcase
{ address: ["diagon alley", "LONDON"] }
attr
Defines a tag attribute which the value should be extracted from, instead of the tag value itself:
schemas:
description:
path: xml/info
attr: desc
<xml>
<info desc="ABC">some stuff<info>
</xml>
{ description: "ABC" }
Like the path, it can also be defined as a string array.
within
To define a root path for the fields:
schemas:
movie:
within: info/movie_data
title: original_title
actor: main_actor
<xml>
<info>
<movie_data>
<original_title>The Irishman</original_title>
<main_actor>Robert De Niro</main_actor>
</movie_data>
</info>
</xml>
{ movie: { title: "The Irishman", actor: "Robert De Niro" } }
unescape
This option is pretty usefull when you have embbed XML or HTML inside some tag, like CDATA elements, and you need to unescape them first in order to parse their content:
schemas:
movie:
unescape: response
title: response/original_title
actor: response/main_actor
<xml>
<response>
<original_title>1<original_title><main_actor>1<main_actor>
</response>
</xml>
This XML will be turned into this one during the parsing:
<xml>
<response>
<original_title>The Irishman</original_title>
<main_actor>Robert De Niro</main_actor>
</response>
</xml>
{ movie: { title: "The Irishman", actor: "Robert De Niro" } }
array_of
Defines the path to a XML collection, which will be looped generating an array of hashes:
schemas:
people:
array_of: characters/character
name: firstname
age: age
<xml>
<characters>
<character>
<firstname>Geralt</firstname>
<age>97</age>
</character>
<character>
<firstname>Yennefer</firstname>
<age>102</age>
</character>
</characters>
</xml>
{
people: [
{ name: "Geralt", age: "97" },
{ name: "Yennefer", age: "102" }
]
}
If you need to loop trough nested collections, you can define an array of paths:
schemas:
show:
within: show_data
title: description
people:
array_of: [characters/character, info]
name: name
<xml>
<show_data>
<description>Peaky Blinders</description>
<characters>
<character>
<info>
<name>Tommy Shelby</name>
</info>
</character>
<character>
<info>
<name>Arthur Shelby</name>
</info>
<info>
<name>Alfie Solomons</name>
</info>
</character>
</characters>
</show_data>
</xml>
{
show: {
title: "Peaky Blinders",
people: [
{ name: "Tommy Shelby" },
{ name: "Arthur Shelby" },
{ name: "Alfie Solomons" }
]
}
}
link
This command is useful when the XML contains references to other nodes, it works as a SQL JOIN. The path must be and expression containing the <link>
identifier, which will be replaced by the value fetched from the link:
command.
Example:
schemas:
bookings:
array_of: booking
date: booking_date
document: id
products:
array_of:
accomodation:
path: ../hotel[booking_id=<link>]/accomodation
link: id
<xml>
<booking>
<id>1</id>
<booking_date>2020-01-01</booking_date>
</booking>
<booking>
<id>2</id>
<booking_date>2020-01-02</booking_date>
</booking>
<hotel>
<booking_id>1</booking_id>
<accomodation>Standard</accomodation>
</hotel>
<hotel>
<booking_id>2</booking_id>
<accomodation>Premium</accomodation>
</hotel>
</xml>
{
bookings: [
{
date: "2020-01-01",
document: "1"
products: [
{ accomodation: "Standard" }
]
},
{
date: "2020-01-02",
document: "2"
products: [
{ accomodation: "Premium" }
]
}
]
}
In this example if I didn't use the link
to get only the hotel of each booking, it would have returned two accomodations for each booking and instead of extract a string with the accomodation it would extract an array with all the accomodations for each booking.
You can combine the link
with array_of
if you want search for a list of elements filtering by some field, just provide the path
and the link
:
schemas:
bookings:
array_of: booking
date: date
document: id
products:
array_of:
path: ../products[booking_id=<link>]
link: id
....
uniq_by
Can only be used with array_of.
This functionality is useful when some XML nodes are duplicated and you want to extract data from the first occurrence only. It has a behavior similar to Ruby uniq method on arrays.
For each path generated from array_of
, the value fetched using uniq_by
will be checked against the generated collection and the path will be discarded if the value already exists.
schemas:
bookings:
array_of:
path: booking
uniq_by: id
date: bdate
document: id
<xml>
<booking>
<id>1</id>
<bdate>2020-01-01</bdate>
</booking>
<booking>
<id>1</id>
<bdate>2020-01-01</bdate>
</booking>
</xml>
{
bookings: [
{
date: "2020-01-01",
document: "1"
}
]
}
In this example if we don't use the tag uniq_by
there would be extracted two elements with the same data, like:
{
bookings: [
{
date: "2020-01-01",
document: "1"
},
{
date: "2020-01-01",
document: "1"
}
]
}
array_presence: first_only
The field that contains this property will be only added to the first item of the array.
Can only be used in fields that belong to a node of array_of
.
passengers:
array_of: bookings/booking/passengers/passenger
id:
path: document
modifier: to_s
name:
attr: [FirstName, LastName]
modifier:
- name: join
params: [" "]
rav_tax:
array_presence: first_only
path: ../rav
modifier: to_f
<bookings>
<booking>
<rav>150<rav>
<passengers>
<passenger>
<document>109.111.019-79</document>
<FirstName>Marcelo</FirstName>
<LastName>Lauxen</LastName>
</passenger>
<passenger>
<document>110.155.019-78</document>
<FirstName>Corona</FirstName>
<LastName>Virus</LastName>
</passenger>
</passengers>
</booking>
</bookings>
{
bookings: [
{
passengers: [
{
id: "109.111.019-79",
name: "Marcelo Lauxen",
tax_rav: 150.00
},
{
id: "110.155.019-78",
name: "Corona Virus"
}
]
}
]
}
In this example the field tax_rav
was only included on the first passenger because this field has the array_presence: first_only
property.
in_parent
This option allows you to navigate to a parent node of the current node.
passengers:
array_of: bookings/booking/passengers/passenger
id:
path: document
modifier: to_s
bookings_id:
in_parent: bookings
path: id
<bookings>
<bookings_id>8888</bookings_id>
<booking>
<passengers>
<passenger>
<document>109.111.019-79</document>
</passenger>
<passenger>
<document>110.155.019-78</document>
</passenger>
</passengers>
</booking>
</bookings>
{
bookings: [
{
passengers: [
{
id: "109.111.019-79",
bookings_id: 8888
},
{
id: "110.155.019-78",
bookings_id: 8888
}
]
}
]
}
In this example the value of bookings_id
will be extracted starting at the node provided in in_parent
instead of the current node. It's possible to navigate to a parent node with ../
too (xpath provides this functionality), but using in_parent
you just need to provide the name of the parent node, it will navigate up until the parent node is found, no matter how many levels.
keep_if
This option allows you to keep the part of the block of the hash in the final result only if the condition matches.
schemas:
dummy:
within: data
description: additional_desc
exchange: currency_info/value
price: price
payment:
type: payment_info/method
value: payment_info/price
keep_if: "'type' == 'invoice'"
<data>
<additional_desc>Keep walking</additional_desc>
<currency_info kind="USD">
<value>4.15</value>
</currency_info>
<price>55.09</price>
<payment_info>
<method>card</method>
<price>55.48</price>
<payment>
<installments>2</installments>
<card_number>333</card_number>
</payment>
</payment>
<data>
{
dummy: {
description: "Keep walking",
exchange: "4.15",
price: "55.09"
}
}
In this example the condition didn't match since the payment method was card
instead of invoice
and then the extracted payment hash was removed from the final result.
Formatting:
fixed
Defines a fixed value for the field:
currency:
fixed: BRL
{ currency: "BRL" }
mapper
Uses a hash of predefined values to replace the extracted value with its respective option.
If the extracted value is not found in any of the mapper options, it will be replaced by the default
value, but if the default value is not defined, the returned value is not replaced.
mappers:
currencies:
default: unknown
options:
BRL: R$
USD: [US$, $]
schemas:
money:
array_of: curr_types/type
path: symbol
mapper: currencies
<xml>
<curr_type>
<type>
<symbol>US$</symbol>
</type>
<type>
<symbol>R$</symbol>
</type>
<type>
<symbol>RB</symbol>
</type>
<type>
<symbol>$</symbol>
</type>
</curr_type>
</xml>
{
money: ["USD", "BRL", "unknown", "USD"]
}
modifier
Defines a method to be called on the returned value.
schemas:
name:
path: some_field
modifier: upcase
<xml>
<some_field>Lewandovski</some_field>
</xml>
{ name: "LEWANDOVSKI" }
You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the name
and params
keys:
schemas:
name:
path: [firstname, lastname]
modifier:
- name: join
params: [" "]
- downcase
<xml>
<firstname>Robert</firstname>
<lastname>Martin</lastname>
</xml>
{ name: "robert martin" }
If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter:
schemas:
name:
path: final_price
modifier: format_as_float
<xml>
<final_price>R$ 12.99</final_price>
</xml>
class MyMethods
def format_as_float(value)
value.gsub(/[^\d.]/, "").to_f
end
end
XmlDataExtractor.new(yml, MyMethods.new).parse(xml)
{ price: 12.99 }