Posts Tagged ‘ Python

Encodings in a MultiLingual Web Application

Encodings in a Web Application

Tools: MySQL as a DB, Ruby/Python/PHP as a Language

Problem: If you are working on multi-lingual web application, and need to store them in database. One will surely encounter with the encodings issue. In Ruby1.8.6 I haven’t found anything promising that can clearly state what is the encoding of a String/data. To achieve the same is easy and explained very clearly for other languages and I feel Python’s support for encodings is the best and very  clean and self explanatory.

Things to remember:

  • MySql Database and each table must be created in UTF-8 format. By default its latin and it was very annoying to change at a later stage after realizing it.
CREATE DATABASE <database name> DEFAULT CHARACTER SET utf8
  • Make sure all data that is being stored in DB and in Tables is in UTF-8 format, else convert it will talk in about it in a while.
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
  • If struggling in displaying or storing characters in proper encodings, make sure you have set encoding: utf8 in your database.{yml/php}. e.g.
development:
  adapter: mysql
  database: db_name
  username: username
  password: password
  host: localhost
  encoding: utf8

UNICODE is not UTF-8

I will try to be succinct to explain UTF-8 is not Unicode or Unicode is not UTF-8. I don’t remember where I read about it but this expert advice helped me a lot to differentiate between Unicode and UTF-8. This post has explained Unicode philosophy.

As the computer reads the characters on user input, read them as UNICODE which is in Computer  Format and is unique . Once you try to store it in a variable or in DB then Encodings comes into picture and then it depends in what encodings you are saving. If its UTF-8, things are fine and as expected if its not, it may cause some trouble.

Unicode is a system that provides a unique number for every character of a language, no matter what the language.

The mapping of “0×40″ for the letter “g” is called an encoding. The value is encoded as the letter. Depending on the encoding, “0×40″ could be the letter “g” (as in many North American and European encodings) or the Bangladeshi “Ù„” or the Georgian “პ”.

Python way is the easiest and preferred:

1
t = "Héllo"; x = unicode(t); str = x.encode("utf-8")

To detect String/Text encoding in Ruby?

Where was I struck?

Characters (Cyrillic/Latin/Funny) are stored wrongly in database and need to changed and stored in utf-8, after analyzing what is the current encoding of the stored text.

How to do it:

Certainly there are ways to be solved by mysql itself, but none of them worked out in my case or I may need to learn more mysql. At the same time I felt more interested how to do it Ruby way!

So here is a way I tried out and it worked very well and helps me anytime I need to know about encoding of a text/string or need to convert in any format.

First, Install the chardet gem by issuing the following command:

 $ sudo gem install chardet

Then in irb:

 require 'rubygems'
 require 'UniversalDetector'
 p UniversalDetector::chardet('Ascii text')
 p UniversalDetector::chardet('åäö')
 p UniversalDetector::chardet("Déjà vu")

The respective output from this example is:

{"encoding"=>"ascii", "confidence"=>1.0}
{"encoding"=>"utf-8", "confidence"=>0.87625}
{"encoding"=>"utf-8", "confidence"=>0.7525}

Now to convert it into desired format:

 require 'chardet'
 require 'rubygems'
 require 'UniversalDetector'
 encoding = UniversalDetector::chardet(str)["encoding"] #detects the str encoding
 Iconv.iconv("UTF-8", encoding, str).to_s  #converts the current encoding to UTF-8 of the present string

I shall love to hear your suggestions/feedback if it doesn’t work out or if it helps you and save your nights work to research on how to handle encodings