Encodings in a MultiLingual Web Application

Encodings in a Web Application

Tools: MySQL as a DB, Ruby/Python/PHP as a Language

Problem: If you are working on multi-lingual web application, and need to store them in database. One will surely encounter with the encodings issue. In Ruby1.8.6 I haven’t found anything promising that can clearly state what is the encoding of a String/data. To achieve the same is easy and explained very clearly for other languages and I feel Python’s support for encodings is the best and very  clean and self explanatory.

Things to remember:

  • MySql Database and each table must be created in UTF-8 format. By default its latin and it was very annoying to change at a later stage after realizing it.
CREATE DATABASE <database name> DEFAULT CHARACTER SET utf8
  • Make sure all data that is being stored in DB and in Tables is in UTF-8 format, else convert it will talk in about it in a while.
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
  • If struggling in displaying or storing characters in proper encodings, make sure you have set encoding: utf8 in your database.{yml/php}. e.g.
development:
  adapter: mysql
  database: db_name
  username: username
  password: password
  host: localhost
  encoding: utf8

UNICODE is not UTF-8

I will try to be succinct to explain UTF-8 is not Unicode or Unicode is not UTF-8. I don’t remember where I read about it but this expert advice helped me a lot to differentiate between Unicode and UTF-8. This post has explained Unicode philosophy.

As the computer reads the characters on user input, read them as UNICODE which is in Computer  Format and is unique . Once you try to store it in a variable or in DB then Encodings comes into picture and then it depends in what encodings you are saving. If its UTF-8, things are fine and as expected if its not, it may cause some trouble.

Unicode is a system that provides a unique number for every character of a language, no matter what the language.

The mapping of “0x40” for the letter “g” is called an encoding. The value is encoded as the letter. Depending on the encoding, “0x40” could be the letter “g” (as in many North American and European encodings) or the Bangladeshi “Ù„” or the Georgian “პ”.

Python way is the easiest and preferred:
1
t = "Héllo"; x = unicode(t); str = x.encode("utf-8")

To detect String/Text encoding in Ruby?

Where was I struck?

Characters (Cyrillic/Latin/Funny) are stored wrongly in database and need to changed and stored in utf-8, after analyzing what is the current encoding of the stored text.

How to do it:

Certainly there are ways to be solved by mysql itself, but none of them worked out in my case or I may need to learn more mysql. At the same time I felt more interested how to do it Ruby way!

So here is a way I tried out and it worked very well and helps me anytime I need to know about encoding of a text/string or need to convert in any format.

First, Install the chardet gem by issuing the following command:

 $ sudo gem install chardet

Then in irb:

 require 'rubygems'
 require 'UniversalDetector'
 p UniversalDetector::chardet('Ascii text')
 p UniversalDetector::chardet('åäö')
 p UniversalDetector::chardet("Déjà vu")

The respective output from this example is:

{"encoding"=>"ascii", "confidence"=>1.0}
{"encoding"=>"utf-8", "confidence"=>0.87625}
{"encoding"=>"utf-8", "confidence"=>0.7525}

Now to convert it into desired format:

 require 'chardet'
 require 'rubygems'
 require 'UniversalDetector'
 encoding = UniversalDetector::chardet(str)["encoding"] #detects the str encoding
 Iconv.iconv("UTF-8", encoding, str).to_s  #converts the current encoding to UTF-8 of the present string

I shall love to hear your suggestions/feedback if it doesn’t work out or if it helps you and save your nights work to research on how to handle encodings