Random notes about text encoding

Where one of the first references that come into my mind is this great post by Joel Spolsky I have just felt the necessity of writing down some little pieces of information that relate to that weird little computing world that is the text encoding. Humbly, here my 5 cents.:

General notes

  • If you are doing a web application, the way to go is Unicode (i.e. UTF-8). Use it everywhere from the very beginning in every text that can be human typed and don’t look back to easier local encodings (ISO-8859-1). You’re audience is the world, we’re not playing in the backyard anymore and, believe me, upgrading later can lead to incredible problems.
  • If you are a beginner, start reading the article mentioned in the introduction paragraph. It’s a an eye-opener of the subject that will make you understand the subtle complexity of the text encoding problem.
  • Test your application for input characters that you don’t usually type and do it in all your target browsers.
    Some troubling examples include: ampersands (&), single quotes (‘), double quotes (“), less and great than simbols (< >), some accented letters (á ò û ŝ) and languages not based on latin simbols (汉语/漢語 or  لا أتكلم العربي for example).

About javascript and sending stuff over the Internet

  • With UTF8 forms, the browser makes some great work by encoding data automatically before sending it  - by GET, POST, … – to the web server. If you are sending data with javascript instead (hey AJAX!) you will notice that it is not automatically encoded. The working solution is to use the encodeURIComponent() function.

Leave a Comment