Why does Rails do utf8=✓

I noticed Rails apps always does utf8=✓ in their URLs. Rails at one point of time even placed a snowman unicode glyph. Here’s what Yehuda Katz has to say on this regard:

This parameter was added to forms in order to force Internet Explorer (5, 6, 7 and 8) to encode its parameters as unicode.

Specifically, this bug can be triggered if the user switches the browser’s encoding to Latin-1. To understand why a user would decide to do something seemingly so crazy, check out this google search: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=diamond+with+a+question+mark+in+it. Once the user has put the web-site into Latin-1 mode, if they use characters that can be understood as both Latin-1 and Unicode (for instance, é or ç, common in names), Internet Explorer will encode them in Latin-1.

This means that if a user searches for “Ché Guevara”, it will come through incorrectly on the server-side. In Ruby 1.9, this will result in an encoding error when the text inevitably makes its way into the regular expression engine. In Ruby 1.8, it will result in broken results for the user.

By creating a parameter that can only be understood by IE as a unicode character, we are forcing IE to look at the accept-charset attribute, which then tells it to encode all of the characters as UTF-8, even ones that can be encoded in Latin-1.

Keep in mind that in Ruby 1.8, it is extremely trivial to get Latin-1 data into your UTF-8 database (since NOTHING in the entire stack checks that the bytes that the user sent at any point are valid UTF-8 characters). As a result, it’s extremely common for Ruby applications (and PHP applications, etc. etc.) to exhibit this user-facing bug, and therefore extremely common for users to try to change the encoding as a palliative measure.

All that said, when I wrote this patch, I didn’t realize that the name of the parameter would ever appear in a user-facing place (it does with forms that use the GET action, such as search forms). Since it does, we will rename this parameter to _e, and use a more innocuous-looking unicode character.

Very funky although this has since become my standard way of determine if the application is running on Ruby on Rails.

Google moving to Unicode 5.1

More and more pages in unicode. Remember those times where you open a web page full of question marks? This just shouldn’t happen. Hopefully everyone moves to unicode soon. Lots of Chinese websites are still not on unicode actually.

Moving to Unicode 5.1

Just last December there was an interesting milestone on the web. For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings—and by coincidence, within 10 days of one another. What’s more impressive than simply overtaking them is the speed with which this happened; take a look at the blue line in this graph. (Source: Google blog)

Unicode growth

Unicode growth chart from Google blog.

[ad#simple]

Character encodings are important!

I’ve been reading about character encoding recently, in particular to the various unicode standards. I’ve been rather pissed off with setting up the wrong collation in MySQL, I just realized that at my other blog, I have posts that are in utf8_unicode_ci, latin1_general_ci and utf_general_ci. This is what you get when you migrate database blindly without knowing what is character set. I regret not reading enough. Now I set everything to utf8_general_ci.

Anyway, something about another encoding set – GB2312 – caught my attention.

Here’s a trivia, the older Chinese encoding GB2312 cannot write the former Chinese Premier Zhu Rongji’s name. His name has often appeared as 朱熔基. Zhu disapproves of this and prefers the correct version, 朱镕基. Continue reading “Character encodings are important!”