uchardet
DESCRIPTION:
Fast character set encoding detection using International Components for Unicode C++ library.
SYNOPSIS:
require 'open-uri'
require 'uchardet'
encoding = ICU::UCharsetDetector.detect open('http://google.jp').read
encoding # => {:language=>"ja", :encoding=>"Shift_JIS", :confidence=>100}
From command line:
$ uchardet
Usage: uchardet [] file
-l, --list Display list of detectable character sets.
-s, --strip Strip HTML or XML markup before detection.
-e, --encoding Hint the charset detector about possible encoding.
-a, --all Show all matching encodings.
-h, --help Show this help .
$ uchardet `which uchardet`
ISO-8859-1 (confidence 60%)
REQUIREMENTS:
ICU (International Components for Unicode):
on Mac OS X:
sudo port install icu
on Debian/Ubuntu
sudo apt-get install libicu-dev
INSTALL:
sudo gem install uchardet
LICENSE:
Copyright © 2009 Dmitri Goutnik, released under the MIT license.