Tyler vs. JavaScript/PHP/AJAX/Charsets/Form Submissions: Hard Fought Victory!
Filed Under (Everything, Software) by Tyler on 23-11-2007
Web development is not refined. The whole industry seems to be quite kludgy, it seems like the whole thing was designed by a few guys sitting in their parents basement getting high on cough syrup. The idea of object oriented design seems to be lost on most projects, short of some of the newer .NET development, but even then, the quality of work I have seen is somewhat lacking.
That little rant has absolutely nothing to do with my current victory, it is just to outline the fact, that what has been done is not the right way to do things, what works in Firefox -a real browser- may not work in that piece of shit IE, and what works with old style form submission may not work with new fangled AJAXing.
Here is the battle that I was dealing with, and -I hope- a clear solution which I could not find anywhere on the web. I’m currently writing a multi-lingual website for a ski team I coach. By multilingual I mean English/French. For most of the admin console I have been using old style form submission, I mean why would I waste the fancy stuff on the backend. Yet I have a -quite kickass- photo manager that I have written and reused a few times now, that uses AJAX form submissions. This shouldn’t be any different right? Wrong!
The problem I was having is that some of my French characters (ie. ç, é, è, etc…) were getting muddled on the way to the database. As it turns out they were getting muddled in the transfer between the Javascript AJAX post and the php server side script. It appeared to me that I was doing everything right. I had my charset that I specified in the AJAX post correct:
ajaxRequest.setRequestHeader(”Content-Type”, “application/x-www-form-urlencoded;charset=ISO-8859-1“);
Or perhaps I didn’t so I switched it to UTF-8. That seemed to make no difference, a quick browse through the shitty information on the interweb led me to this:
Your are in luck! Transforming text in ISO 8859-1 to Unicode is the identity transform (as in no change at all), as the code points they share have the same meaning in both encodings. For all other encodings (save US ASCII, in part a subset ISO 8859-1), you need to resort to laborious replace() hacks.
Unfortunately that is a load of crap. For all the ASCII points they are the same, and I would imagine for many of the upper range characters that they share they are the same, but there is a range that is not shared. The latin characters that can be expressed as extended ASCII characters. For instance:

As you can see the character encodings for ‘é’ are not the same between the two. This is where the challenge got interesting. Some more research let me determine that the Javascript function encodeURI() would always produce UTF-8 code, and I was specifying the charset to be UTF-8. Perhaps the problem was decoding the URL on the other end. I tried the PHP function urldecode() but it produced the same two character output. é transformed to é
It was at this point that I realized that there was an issue in conversion from UTF-8 to ISO-8859-1. Why was my PHP script not able to decode it? The short answer is that PHP does not support UNICODE, and you need to convert incoming parameters. Easily there are two easy ways to do this: utf8_decode or iconv, iconv appears to be only part of PHP5. I used utf8_decode() and it worked as expected. So the transformations appear as such: ISO-8859-1 charset page > UTF-8 encoding to go over the wire > ISO-8859-1 to be usable in PHP.
Did I mention that I find an awful lot of this I18N business very frustrating? Although I suppose that the multilingual nature of the world I have the choice of getting better at it or giving up being a programmer.

Tyler you’ve hit on many issues in this post. There is the technical “how to’s” of getting data intact through an AJAX transport, there’s the business of finding relevant information on the Web (which is sometimes about as pleasant as a walk on the freeway), and there is the question of deeper design integrity in the entire web infrastructure which seems to be compromised by its peculiar origins as a cross between ‘basement invention’ and a tendency to regard it as a graphics designers medium.
At least that my reaction to your post. There is also a measure of frustration which I can completely understand as one who also comes from a desktop (read ‘real’) programming background. I too spent countless hours on this transport problem but searching the web for solutions produced for me nothing but contradictory and useless information. In the end I had to get a friend with experience to just explain the ground rules to me - then it was clear.
Part of the problem with searching the web for a solution to this problem (and any other) is that one has to wade through the endless accounts of issues with no solutions, or partial solutions, or outdated solutions. I found that even the official documentation would be trailed with numerous comments that indicate the ’standard practice’ won’t work in scenarios x, y and z. Very frustrating but I really don’t see that changing much any time soon. There’s a lot of churn in the WWW and some of the offered solutions to these things (such as the replace() strategy you mentioned) are nothing less than hit and run accidents as far as code integrity is concerned.
I hope that with people such as yourself (i.e. with a solid programming background) being involved, the practices may evolve “properly” - i.e. more object oriented and not just quick fixes.
That’s my 2.0 cents.