Tuesday, July 12, 2005

The perils of globalisation

... from a programmers persepctive...

Call it globalisation or internationalisation, either way it not as straightforward as I'd like to believe.

As a java programmer it all seems easy. Use a ResourceBundle and store all your text in handy resource files. These can easily be translated as needs require. However once you've externalised all of your user displayed text to property files, thats when the fun really begins. Especially if you're dealing with a non Western script.


Funny how Japanese text looks like fractions, symbols and punctuation marks you may find yourself saying....
.....
Once you get past that stage though, then its onto the world of character encodings.

Character encoding. It's something most Western language users blissfully ignore. Its only when you try and export your freshly translated text to say Japan that things come tumbling down.

The best links I found on this are here and here.

Ah ha, maybe that 3/4 symbol isn't in fact the japanese for Hello.

In fact its garbled character encoding. Looks like its time to instruct your webserver to serve up different character sets.

I started out setting up Tomcat to send UTF-8. There are a few distinct steps involved.


1.
The JVM must be initiated using the correct charset. This sets the format of the files that java will use (includes, properties file, jsp pages etc.)

This is done with the following switch

-D=file.encoding=UTF-8

so for Tomcat, I added the following to the cataline.bat file

set CATALINA_OPTS=-D=file.encoding=UTF-8


2.
Its important to realise the standard encoding used by the OS that you are running. My version of windows is using a Western Latin-1 script. This is unable to display japanese characters. In order to get japanese characters displayed I had to install japanese character supprt.

This also means that files (for example properties files) may be stored in a different character set than you may be expecting.

In my case, I was using Crimson as an editor. Crimson saves all charcters as Ascii, thus cannot display japanese character, or even Western scripts such as é or ß characters. I had to find a unicode enabled editor. Unipad is a good one that I found.

Java properties files do not by default support unicode. They must be encoded.

Once your files are stored in a unicode mapping, you have to run it through the nativeToAscii program which escapes out any strange foreign characters and puts in the exact unicode code point instead. (Theres an ant task which can be used for this).

<native2ascii encoding="UTF-8" src="$%7Bsrc.dir%7D" dest="${build.dir}" includes="**/*.properties">.

Once your basic java environment is set, you need to be able to set the character encoding for the page. This can be set with a page directive

<%@ page contentType="text/html;charset=UTF-8" />

or in xml syntax

<jsp:directive.page contenttype="text/html;charset=UTF-8">

4.
Finally (I think) set the encoding for the HttpResponse (So you can understand any text that gets sent back)

Using jstl

<fmt:requestencoding value="UTF-8">



Now in the above example I have used UTF-8 as the encoding, this encoding maps to a variable number of bytes (from 1 to 6). Western mappings e.g. ascii will normally be covered by 1 byte, but those foreign type scripts may take up to 6 bytes to be encoded.

If you are mainly directing pages to Japanese clients then UTF-8 will be quite wasteful. A fairer encoding might be UTF-16 (doesn't seem to be supported by IE, or Mozilla browsers), or even (I going out on a bit of a limb here) Shift-JIS.



Its all very nice, if it all works well , but if it doesn't..... Can be bit nasty.

What I'm looking at now is changing the encoding deplending on the locale. The serlvet 2.4 spec has a nice mapping

<local-encoding-mapping-list>
<locale>jp</locale>
<encoding>Shift_JIS</encoding>
</local-encoding-mapping-list>

which hopefully will sort things out.

If theres no more entires on this subject then it all worked out swimmingly

No comments: