internationalization, rdf parsers, postgres and mysql

I have spent several days in the past few weeks trying to get internationalization working with my RDF toollkit-to-be, and it should have been easy, as I do not have my own parser and was using ARP and Rio, which both have internationalization support. They did the hard bit so we don’t have to 🙂
Anyhow, I’m extremely chuffed that I’ve just managed to get internationalization working with the inmemory version and with postgres, which seems to just work, provided you create a
./initdb -E UNICODE
or to check that your database is compatible,
./pg_encoding UNICODE
./pg_encoding UTF8
Mysql seems to require 4.1, so I’ll do that another day. So, a few notes on what I did.
I was getting ??????? printed instead of any non-English characters in foaf files; Japanese, Arabic and French. I figured it was a problem with my java code because I knew that both parsers I was using were good. So I spent a lot of time doing things like this:
String lit1=new String(((Literal)val).getLabel().getBytes(“UTF8”));
before I started processing from the parser, to try and track the problem down. Java is supposed to use unicode by default so it should have been ok, but I found a bunch of examples like this, and tried it, but no dice.
Anyway, turns out it was a combination of my terminal not supporting UTF-8, my locale on my debian box not having been set up, and (I think this was the most important bit) my jsp pages not being set up to display UTF-8.
sigh.
This is a useful page about setting your locale in Debian, also part of the java tutorial on internationalization helped my realize my terminal wasn’t displaying UTF-8 properly. Using xterm like this:
LC_CTYPE=en_GB.UTF-8 xterm
made me realize that encoding was coming through the parsers (although that command won’t display Japanese or Arabic), and focus on getting webpages to display correctly, rather than command-line tools.
For jsps you seem to need two bits of information:

at the very top of the page, and
<head><meta http-equiv=”Content-Type” content=”text/html;
charset=utf-8″>
doesn’t seem to go amiss either.
So I just tried outputting html from my tests and then tried it on jsps and then – hurrah! – it worked 🙂
[later, 2003-10-15]
I just found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which clarifies a lot for me. Very nice.