Encodings in a MultiLingual Web Application
Encodings in a Web Application
Tools: MySQL as a DB, Ruby/Python/PHP as a Language
Problem: If you are working on multi-lingual web application, and need to store them in database. One will surely encounter with the encodings issue. In Ruby1.8.6 I haven’t found anything promising that can clearly state what is the encoding of a String/data. To achieve the same is easy and explained very clearly for other languages and I feel Python’s support for encodings is the best and very clean and self explanatory.
Things to remember:
- MySql Database and each table must be created in UTF-8 format. By default its latin and it was very annoying to change at a later stage after realizing it.
CREATE DATABASE <database name> DEFAULT CHARACTER SET utf8
- Make sure all data that is being stored in DB and in Tables is in UTF-8 format, else convert it will talk in about it in a while.
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
- If struggling in displaying or storing characters in proper encodings, make sure you have set encoding: utf8 in your database.{yml/php}. e.g.
development:
adapter: mysql
database: db_name
username: username
password: password
host: localhost
encoding: utf8
UNICODE is not UTF-8
I will try to be succinct to explain UTF-8 is not Unicode or Unicode is not UTF-8. I don’t remember where I read about it but this expert advice helped me a lot to differentiate between Unicode and UTF-8. This post has explained Unicode philosophy.
As the computer reads the characters on user input, read them as UNICODE which is in Computer Format and is unique . Once you try to store it in a variable or in DB then Encodings comes into picture and then it depends in what encodings you are saving. If its UTF-8, things are fine and as expected if its not, it may cause some trouble.
Unicode is a system that provides a unique number for every character of a language, no matter what the language.
The mapping of “0×40″ for the letter “g” is called an encoding. The value is encoded as the letter. Depending on the encoding, “0×40″ could be the letter “g” (as in many North American and European encodings) or the Bangladeshi “Ù„” or the Georgian “პ”.
Python way is the easiest and preferred:
1 t = "Héllo"; x = unicode(t); str = x.encode("utf-8")
To detect String/Text encoding in Ruby?
Where was I struck?
Characters (Cyrillic/Latin/Funny) are stored wrongly in database and need to changed and stored in utf-8, after analyzing what is the current encoding of the stored text.
How to do it:
Certainly there are ways to be solved by mysql itself, but none of them worked out in my case or I may need to learn more mysql. At the same time I felt more interested how to do it Ruby way!
So here is a way I tried out and it worked very well and helps me anytime I need to know about encoding of a text/string or need to convert in any format.
First, Install the chardet gem by issuing the following command:
$ sudo gem install chardet
Then in irb:
require 'rubygems' require 'UniversalDetector' p UniversalDetector::chardet('Ascii text') p UniversalDetector::chardet('åäö') p UniversalDetector::chardet("Déjà vu") The respective output from this example is: {"encoding"=>"ascii", "confidence"=>1.0} {"encoding"=>"utf-8", "confidence"=>0.87625} {"encoding"=>"utf-8", "confidence"=>0.7525}
Now to convert it into desired format:
require 'chardet'
require 'rubygems'
require 'UniversalDetector'
encoding = UniversalDetector::chardet(str)["encoding"] #detects the str encoding
Iconv.iconv("UTF-8", encoding, str).to_s #converts the current encoding to UTF-8 of the present string
I shall love to hear your suggestions/feedback if it doesn’t work out or if it helps you and save your nights work to research on how to handle encodings