Searching for a ready to use app which serves generated feeds via HTTP?
Head over to html2rss-web
!
This Ruby gem builds RSS 2.0 feeds from a feed config.
With the feed config containing the URL to scrape and CSS selectors for information extraction (like title, URL, ...) your RSS builds. Extractors and chain-able post processors make information extraction, processing and sanitizing a breeze. Scraping JSON responses and setting HTTP request headers is supported, too.
Installation
🤩 Like it? | Star it! ⭐️ |
---|---|
Add this line to your application's Gemfile : |
gem 'html2rss' |
Then execute: | bundle |
In your code: | require 'html2rss' |
😍 Love it? Feel free to donate. Thank you! 💓
Building a feed config
Here's a minimal working example:
require 'html2rss'
rss =
Html2rss.feed(
channel: { url: 'https://stackoverflow.com/questions' },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
link: { selector: 'a', extractor: 'href' }
}
)
puts rss
A feed config consists of a channel
and a selectors
Hash.
The contents of both hashes are explained below.
Looks too complicated? See html2rss-configs
for ready-made feed configs!
The channel
attribute | type | default | remark | |
---|---|---|---|---|
url |
required | String | ||
title |
optional | String | auto-generated | |
description |
optional | String | auto-generated | |
ttl |
optional | Integer | 360 |
TTL in minutes |
time_zone |
optional | String | 'UTC' |
TimeZone name |
language |
optional | String | 'en' |
Language code |
author |
optional | String | Format: email (Name)' |
|
headers |
optional | Hash | {} |
Set HTTP request headers. See notes below. |
json |
optional | Boolean | false |
Handle JSON response. See notes below. |
The selectors
You must provide an items
selector hash which contains the CSS selector.
items
needs to return a collection of HTML tags.
The other selectors are scoped to the tags of the items' collection.
To build a
valid RSS 2.0 item
each item has to have at least a title
or a description
.
Your selectors
can contain arbitrary selector names, but only these
will make it into the RSS feed:
RSS 2.0 tag | name in html2rss |
remark |
---|---|---|
title |
title |
|
description |
description |
Supports HTML. |
link |
link |
A URL. |
author |
author |
|
category |
categories |
See notes below. |
enclosure |
enclosure |
See notes below. |
pubDate |
update |
An instance of Time . |
guid |
guid |
Generated from the title . |
comments |
comments |
A URL. |
source |
~~source~~ | Not yet supported. |
The selector
hash
Your selector hash can have these attributes:
name | value |
---|---|
selector |
The CSS selector to select the tag with the information. |
extractor |
Name of the extractor. See notes below. |
post_process |
A hash or array of hashes. See notes below. |
Reverse ordering of items
The items
selector hash can have an order
attribute.
If the value is reverse
the order of items in the RSS will be reversed.
See a YAML feed config example
```yml channel: # ... omitted selectors: items: selector: 'ul > li' order: 'reverse' # ... omitted ```Using extractors
Extractors help with extracting the information from the selected HTML tag.
- The default extractor is
text
, which returns the tag's inner text. - The
html
extractor returns the tag's outer HTML. - The
href
extractor returns a URL from the tag'shref
attribute and corrects relative ones to absolute ones. - The
attribute
extractor returns the value of that tag's attribute. - The
static
extractor returns the configured static value (it doesn't extract anything). - See file list of extractors.
Extractors can require additional attributes on the selector hash.
👉 Read their docs for usage examples.
See a Ruby example
```ruby Html2rss.feed( channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } } ) ```See a YAML feed config example
```yml channel: # ... omitted selectors: # ... omitted link: selector: 'a' extractor: 'href' ```Using post processors
Extracted information can be further manipulated with post processors.
name | |
---|---|
gsub |
Allows global substitution operations on Strings (Regexp or simple pattern). |
html_to_markdown |
HTML to Markdown, using reverse_markdown. |
markdown_to_html |
converts Markdown to HTML, using kramdown. |
parse_time |
Parses a String containing a time in a time zone. |
parse_uri |
Parses a String as URL. |
sanitize_html |
Strips unsafe and uneeded HTML and adds security related attributes. |
substring |
Cuts a part off of a String, starting at a position. |
template |
Based on a template, it creates a new String filled with other selectors values. |
⚠️ Always make use of the sanitize_html
post processor for HTML content. Never trust the internet! ⚠️
👉 Read their docs for usage examples.
See a Ruby example
```ruby Html2rss.feed( channel: {}, selectors: { description: { selector: '.content', post_process: { name: 'sanitize_html' } } } ) ```See a YAML feed config example
```yml channel: # ... omitted selectors: # ... omitted description: selector: '.content' post_process: - name: sanitize_html ```Chaining post processors
Pass an array to post_process
to chain the post processors.
YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML
```yml channel: # ... omitted selectors: # ... omitted price: selector: '.price' description: selector: '.section' post_process: - name: template string: | # %self Price: %price - name: markdown_to_html ``` Note the use of `|` for a multi-line String in YAML.Adding <category>
tags to an item
The categories
selector takes an array of selector names. Each value of those
selectors will become a <category>
on the RSS item.
See a Ruby example
```ruby Html2rss.feed( channel: {}, selectors: { genre: { # ... omitted selector: '.genre' }, branch: { selector: '.branch' }, categories: %i[genre branch] } ) ```See a YAML feed config example
```yml channel: # ... omitted selectors: # ... omitted genre: selector: ".genre" branch: selector: ".branch" categories: - genre - branch ```Adding an <enclosure>
tag to an item
An enclosure can be any file, e.g. a image, audio or video.
The enclosure
selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
Since html2rss
does no further inspection of the enclosure, its support comes with trade-offs:
- The content-type is guessed from the file extension of the URL.
- If the content-type guessing fails, it will default to
application/octet-stream
. - The content-length will always be undetermined and thus stated as
0
bytes.
Read the RSS 2.0 spec for further information on enclosing content.
See a Ruby example
```ruby Html2rss.feed( channel: {}, selectors: { enclosure: { selector: 'img', extractor: 'attribute', attribute: 'src' } } ) ```See a YAML feed config example
```yml channel: # ... omitted selectors: # ... omitted enclosure: selector: "img" extractor: "attribute" attribute: "src" ```Scraping and handling JSON responses
Although this gem is called html2rss, it's possible to scrape and process JSON.
Adding json: true
to the channel config will convert the JSON response to XML.
See a Ruby example
```ruby Html2rss.feed( channel: { url: 'https://example.com', json: true }, selectors: {} # ... omitted ) ```See a YAML feed config example
```yaml channel: url: https://example.com json: true selectors: # ... omitted ```See example of a converted JSON object
This JSON object: ```json { "data": [{ "title": "Headline", "url": "https://example.com" }] } ``` converts to: ```xmlSee example of a converted JSON array
This JSON array: ```json [{ "title": "Headline", "url": "https://example.com" }] ``` converts to: ```xmlSet any HTTP header in the request
You can add any HTTP headers to the request to the channel URL. Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.
See a Ruby example
```ruby Html2rss.feed( channel: { url: 'https://example.com', headers: { "User-Agent": "html2rss-request", "X-Something": "Foobar", "Authorization": "Token deadbea7", "Cookie": "monster=MeWantCookie" } }, selectors: {} ) ```See a YAML feed config example
```yaml channel: url: https://example.com headers: "User-Agent": "html2rss-request" "X-Something": "Foobar" "Authorization": "Token deadbea7" "Cookie": "monster=MeWantCookie" selectors: # ... ```The headers provided by the channel are merged into the global headers.
Usage with a YAML config file
This step is not required to work with this gem. If you're using
html2rss-web
and want to create your private feed configs, keep on reading!
First, create your YAML file, e.g. called feeds.yml
.
This file will contain your global config and feed configs.
Example:
headers:
'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
myfeed:
channel:
selectors:
myotherfeed:
channel:
selectors:
Your feed configs go below feeds
. Everything else is part of the global config.
Build your feeds like this:
require 'html2rss'
myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
Find a full example of a feeds.yml
at spec/config.test.yml
.
Gotchas and tips & tricks
- Check that the channel URL does not redirect to a mobile page with a different markup structure.
- Do not rely on your web browser's developer console.
html2rss
does not execute JavaScript. - Fiddling with
curl
andpup
to find the selectors seems efficient (curl URL | pup
). - CSS selectors are quite versatile, here's an overview.
Development
After checking out the repository, run bin/setup
to install dependencies. Then, run bundle exec rspec
to run the tests.
You can also run bin/console
for an interactive prompt that will allow you to experiment.
Releasing a new version
1. `git pull` 2. increase version in `lib/html2rss/version.rb` 3. `bundle` 4. `git add Gemfile.lock lib/html2rss/version.rb` 5. `VERSION=$(ruby -e 'require "./lib/html2rss/version.rb"; puts Html2rss::VERSION')` 6. `git commit -m "chore: release $VERSION"` 7. `git tag v$VERSION` 8. [`standard-changelog -f`](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/standard-changelog) 9. `git add CHANGELOG.md && git commit --amend` 10. `git tag v$VERSION -f` 11. `git push && git push --tags`Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/gildesmarais/html2rss.