Everything You Need to Know About InterBase Character Sets

Everything You Need to Know About InterBase Character Sets

The rules for using character sets in InterBase are quite simple, but failing to follow them can cause problems. If you will be storing data which uses characters not found in 7 bit ASCII you must take this into account when designing your metadata.

The Golden Rule of Using InterBase Character Sets

Pick the single most appropriate character set for your language or application. Create your database, specifying that character set as the default character set for the DB. Always specify that character set as the client character set when connecting to the DB.

Please read that section again. It’s very important.

Which Default Character Set Should I Choose For My Database?

CHARACTER SET NONE, unless you have a reason to use a different character set. Reasons to use a different character set include:

  • You want to use the UPPER function, and UPPER produces the wrong results for your language when using CHARACTER SET NONE, especially when used with words containing accent marks.
  • You need to use a collation which is not supported in CHARACTER SET NONE, but which is supported in a different character set.
  • You need to use multi-byte characters.

When Should I Specify a Default Character Set?

When you create the database, and when you connect to it. Do not specify a character set at any other time. Do not specify a character set when writing SQL queries or when creating metadata. The DB’s default character set will be used, which is what you want.

OK, but if I’m never supposed to specify a character set at any other time, why do I even have the option?

Because there may be times when you want to store data in a DB which cannot be stored (or cannot be stored efficiently in the DB’s default character set. The important thing to remember about column charsets is that they are intended to be used in addition to a default charset for the database, not in place of a default charset. See the section on column charsets below for more information.

OK, that makes sense for columns in the DB, but why would I ever want to connect to the DB with a client character set different than the DB’s default character set?

I have absolutely no idea. I don’t know why the IB client doesn’t just default to the DB’s character set, but it doesn’t.

What About Column Charsets?

Use a column charset when you want to have a certain column use a character set different than the DB’s default charset. For example, you might choose ISO-8859-1 as the default charset for your DB so that your InterBase can correctly deal with accented characters and the like. However, this special handling is not necessary for a postal code, and adds some overhead to operations on the column, and limits index length. So when you create the postal code column you can specify CHARACTER SET ASCII to override the default.

Important note: Do not use CHARACTER SET NONE as a column charset, as you will almost certainly encounter a "Cannot transliterate character between character sets" error when performing operations involving this column and anything else in the DB which uses the default charset. Instead, use CHARACTER SET ASCII or CHARACTER SET OCTETS as appropriate.

When you specify a column charset, you must carefully consider your choices when performing operations which involve columns in different charsets. If you compare or assign values in different charsets, InterBase must transliterate the characters in one charset into the other. Depending upon the charsets you have chosen, this may or may not be possible. Most of the charaters in CHARACTER SET UNICODE_FSS, for example, don’t exist in CHARACTER SET ASCII. And characters in CHARACTER SET NONE can’t be converted to anything — since there is no character set information associated with NONE, there is nothing to convert to or from.

How Can I Change the Default Character Set of an Existing Database?

There is no supported way to do this. You should recreate the database from metadata.

Which Collation Order Should I Choose?

Choose a collation order based on three factors:

  • How does the collation order behave when you use ORDER BY?
  • How does the collation order behave when you use a WHERE clause? This may or may not be the same as the behavior with ORDER BY.
  • How does the collation order affect maximum index length?

The ORDER BY and WHERE clause behavior affects a number of things, the most obvious being how accents and capital/lower-case letters are handled. In other words, are accented/unaccented or capital/lower-case letters treated as "equal," or "different." However, there are other factors, and here is no documentation for how the different collations in InterBase behave. You need to consider what is important for your language and application and try them out.

If the supplied collations don’t do what you need, you might look at this third-party library of collations.

What About Table Names?

Although InterBase stores all metadata names in CHARACTER SET UNICODE_FSS, some code in the server doesn’t deal with this correctly and will cause a transliteration error if you use non-ASCII characters in any metadata, including table names, column names, and even stored procedure or trigger source code or user exception messages! So don’t do it.

If you need to return a message to the user with non-ASCII characters, store it in a regular data field, not as a constant in source code.

I consider this a bug in InterBase. It has been reported to Borland, and I hope it is fixed in a future version of the server.