Feedstock
Feedstock is a Ruby library for extracting information from a webpage and converting it into an Atom feed.
Rationale
Feeds are great. But sometimes a website doesn't provide a feed or doesn't provide a feed for the specific content that you want. That's where Feedstock can help.
Feedstock is a Ruby library that you can use to create an Atom feed. It takes a URL to the webpage to check and a hash of rules. The rules tell Feedstock how to extract and transform the data it finds on the webpage.
Example
The feeds.inqk.net repository includes an example of how the Feedstock library can be used in practice.
Installation
Feedstock is available as a gem:
$ gem install feedstock
Usage
Feedstock extracts information from a given document using a collection of rules.
A collection of rules is expressed as a hash. The hash has two mandatory keys and one optional key.
Info
The :info key is mandatory. It must be associated with a hash. In this
README, this hash is referred to as the info hash.
Keys
The keys in the info hash should be symbols, not strings. When used with the
default template, Feedstock will use the key as the name of the XML entity in
the resulting feed. For example, if the key is :id, the XML entity in the
resulting feed will be <id>.
Values
The value associated with each key in the info hash can be either a string or a hash.
String
If the value is a string, this defines a path to a node in the document. The path is expressed using CSS's selector syntax. Although a CSS selector can match more than one node, when used in the info hash, a path will only match the first matching node in the document.
Hash
If the value is a hash, this is a data hash. A data hash defines the rules that Feedstock uses to extract data. It must contain one of two keys:
:literal: The value associated with this key is used for the content of the XML entity. This can be useful for elements that are not on the page or that don't change.:path: The path to the node in the document expressed in CSS's selector syntax. As noted above, if the value of a key in the info hash is a string, this is treated as a path. The reason to use a data hash with a:pathkey is when using one or more of the keys below. In the info hash, a path matches only the first matching node in the document.
The following keys may also be defined in a data hash:
:content: The default isnil. The:contentkey can be set to"inner_html"or a hash of the form{attribute: "<attribute>"}. If the value is"inner_html", Feedstock will extract the content of the node as HTML. If the value is an attribute hash, Feedstock will extract the value of that attribute. This is important for links, where the link itself is typically the content of thehrefattribute rather than the content of the<a>element. For all other values, the plaintext content of the node is extracted.:processor: The default isnil. The:processorkey can be set to a lambda function that takes two arguments. The first is the extracted content, the second is the rule being processed. The content extracted by Feedstock for the given path is processed by the processor.:prefix: The default isnil. If a prefix is provided, the string value of the prefix is appended to the beginning of the content extracted.:suffix: The default isnil. If a suffix is provided, the string value of the suffix is appended to the end of the content extracted.:type: The default isnil. A user may specify"datetime"or"cdata". If the value is"datetime", the content is parsed by the [Timeliness library][Timeliness] to return a string. If the value is"cdata", the content is wrapped in<![CDATA[and]]>tags.
[Timeliness]: https://github.com/adzap/timeliness "The official repository for the Timeliness library"
Formatting Order
The order for formatting content is: extract, process, wrapping.
Entry
The :entry key is mandatory. It must be associated with a hash. In this
README, this hash is referred to as the entry hash.
Keys
The keys in the entry hash should be symbols, not strings. When used with the
default template, Feedstock will use the key as the name of the XML entity in
the resulting feed. For example, if the key is "id", the XML entity in the
resulting feed will be <id>.
Values
The value associated with each key in the entry hash can be either a string or a hash.
String
If the value is a string, this defines a path to a node in the document. The path is expressed using CSS's selector syntax. Unlike with the info hash, a the CSS selector will match all nodes.
Hash
If the value is a hash, this is a data hash. A data hash defines the rules that Feedstock uses to extract data. It must contain one of two keys:
:literal: The value associated with this key is used for the content of the XML entity. This can be useful for elements that are not on the page or that don't change.:path: The path to the node in the document expressed in CSS's selector syntax. Unlike with the info hash, the CSS selector will match all nodes.
The following keys may also be defined in a data hash:
:content: The default isnil. The:contentkey can be set to"inner_html"or a hash of the form{attribute: "<attribute>"}. If the value is"inner_html", Feedstock will extract the content of the node as HTML. If the value is an attribute hash, Feedstock will extract the value of that attribute. This is important for links, where the link itself is typically the content of thehrefattribute rather than the content of the<a>element. For all other values, the plaintext content of the node is extracted.:repeat: The default isnil. If repeat is set totrue, Feedstock will use the content provided by either:literalor:pathrepeatedly. Since the value of:literalimplies:repeat, it is not necessary to specify it expressly.:processor: The default isnil. The:processorkey can be set to a lambda function that takes two arguments. The first is the extracted content, the second is the rule being processed. The content extracted by Feedstock for the given path is processed by the processor.:prefix: The default isnil. If a prefix is provided, the string value of the prefix is appended to the beginning of the content extracted.:suffix: The default isnil. If a suffix is provided, the string value of the suffix is appended to the end of the content extracted.:type: The default isnil. A user may specify"datetime"or"cdata". If the value is"datetime", the content is parsed by the [Timeliness library][Timeliness] to return a string. If the value is"cdata", the content is wrapped in<![CDATA[and]]>tags.
Entries
The :entries key is optional. It can be associated with a hash. In this
README, this hash is referred to as the entries hash.
The entries hash is offered as a convenience. It allows a user to simplify the paths used in the entry hash by omitting a reference to the node containing the entries.
If an entries hash is provided, it must contain the following key:
:path: The path to the node in the document expressed in CSS's selector syntax. This path is used as the root for the paths in the entry hash.
Bugs
Found a bug? I'd love to know about it. The best way is to report them in the Issues section on GitHub.
Versioning
Feedstock uses Semantic Versioning 2.0.0.
Licence
Feedstock is released into the public domain. See LICENSE.md for more details.