Encodings

The IRC protocol doesn’t define a specific encoding that should be used, nor does it provide any information on which encodings are being used.

At the same time, lots of different encodings have become popular on IRC. This presents a big problem, because, if you’re using a different encoding than someone else on IRC, you’ll receive their text as garbage.

Cinch tries to work around this issue in two ways, while also keeping the usual Ruby behaviour.

The encoding option

By setting the encoding option, you set your expectations on what encoding other users will use. Allowed values are instances of Encoding, names of valid encodings (as strings) and the special :irc encoding, which will be explained further down.

Encoding.default_internal

If set, Cinch will automatically convert incoming messages to the encoding defined by Encoding.default_internal, unless the special encoding :irc is being used as the encoding option

The :irc encoding

As mentioned earlier, people couldn’t decide on a single encoding to use. As such, specifying a single encoding would most likely lead to problems, especially if the bot is in more than one channel.

Luckily, even though people cannot decide on a single encoding, western countries usually either use CP1252 (Windows Latin-1) or UTF-8. Since text encoded in CP1252 fails validation as UTF-8, it is easy to tell the two apart. Additionally it is possible to losslessly re-encode CP1252 in UTF-8 and as such, a small subset of UTF-8 is also representable in CP1252.

If incoming text is valid UTF-8, it will be interpreted as such. If it fails validation, a CP1252 → UTF-8 conversion is performed. This ensures that you will always deal with UTF-8 in your code, even if other people use CP1252. Note, however, that we ignore Encoding.default_internal in this case and always present you with UTF-8.

If text you send contains only characters that fit inside the CP1252 code page, the entire line will be sent that way.

If the text doesn’t fit inside the CP1252 code page, (for example if you type Eastern European characters, or Russian) it will be sent as UTF-8. Only UTF-8 capable clients will be able to see these characters correctly.

Invalid bytes and unsupported translations

If Cinch receives text in an encoding other than the one assumed, it can happen that the message contains bytes that are not valid in the assumed encoding. Instead of dropping the complete message, Cinch will replace offending bytes with question marks.

Also, if you expect messages in e.g. UTF-8 but re-encode them in CP1252 (by setting Encoding.default_internal to CP1252), it can happen that some characters cannot be represented in CP1252. In such a case, Cinch will too replace the offending characters with question marks.