Unicode

(This entry is part of a series. The audience: SQL Server developers. The topic: SQLite on mobile devices.)


Well, Actually

First, go read this blog entry by Miguel de Icaza. Right now. I'll wait.

Welcome back. Now let me apologize. I don't want to be a pedantic jerk who quibbles about minor details. But the topic here is Unicode, so there really is no other way.

All Unicode, all the time

The relevant difference with SQLite is easy to describe:

  • In the world of Windows and SQL Server, you have all kinds of possible code pages.

  • In SQLite, everything is Unicode.

But if you don't have much context on these issues, I haven't really told you much. Let's go further.

But let's not go too far

I don't want to rewrite articles that have already been written quite well. So you should probably also go read this blog entry by Joel Spolsky. I'll be here when you get back.

OK, now let's get started

SQL Server has two basic ways you store text:

  • You can use the char/varchar types, which can be used with one of several collations, each of which implies a specific code page, which implies a specific character encoding.

  • Or you can use the nchar/nvarchar types (note the extra 'n'), which are Unicode.

SQLite has no such distinction. All text in SQLite is Unicode.

What the heck is Unicode again?

It's a character set: a collection of characters, each with a number that can be used to refer to it.

More specifically, it's the only character set which is [trying to be] complete. If you choose any character set or encoding which is not Unicode, there will be characters you cannot use.

And what's an encoding?

Saying that SQLite uses Unicode doesn't tell you how the text is actually represented. Unicode is not an encoding. It is more abstract than that. There are lots of different ways of representing Unicode as bytes.

Microsoft's sad history with Unicode

In the Windows and SQL Server world, there is a long history of encoding Unicode in 16-bit-ish ways. Originally, this was UCS-2, a simple encoding which represents each character as a 16-bit number. But then the Unicode consortium realized that 16 bits are not enough, so they expanded the space to 32 bits. This left Microsoft in an awkward spot. UCS-2 is a fundamentally defective encoding of Unicode, since there are many characters in Unicode that simply cannot be represented.

If the goal of Unicode is to be complete, it is reasonable to say that, well actually, UCS-2 is not Unicode.

The conceptual replacement for UCS-2 is to use 32 bits for every character. This encoding is called UCS-4 or UTF-32. But now the wasted space for storing a simple English string is getting out of hand. Switching the complete works of Arthur Conan Doyle from ASCII (which is also an encoding) to UCS-4 would take four times as much space.

Gradually, the world seems to be adopting UTF-8 as the most popular Unicode encoding. This is a variable width encoding. Sometimes a single character is represented with just one byte. Sometimes it needs more. That's very unfortunate, but the only fixed width alternative is UCS-4, which is also very unfortunate. Choose which problem you prefer, but keep in mind that almost everybody has chosen to accept the problems of UTF-8.

But Microsoft has so much history with UCS-2 that transitioning everything to UTF-8 would be really hard. So they have been moving from UCS-2 to UTF-16, which is basically a variable width encoding built around a 16-bit unit instead of an 8-bit unit. UTF-16 is approximately the worst correct way of representing Unicode, unless you have invested billions of dollars in the fundamentally broken UCS-2, in which case UTF-16 is a pretty awesome way out of the mess you ended up in.

Just remember that if you're going out tonight to a club for pedantic nerds and you want to impress someone, you've got to keep the terminology straight:

  • Unicode is an abstraction, not an encoding, not a code page, not a data format, and not a font.

  • Saying your text is Unicode says nothing about how it is represented. It might be UTF-8. It might UTF-16. It might be code point numbers handwritten on Post-It notes stuck on the wall. All of these are valid representations of Unicode.

  • If you ever say anything to suggest that you think Unicode is 16 bits per character, you will be identified as clueless.

  • If you say that your text is stored in Unicode, you are not entirely incorrect, but people will wonder about whether you really know the difference between Unicode and the encodings of same.

SQLite

SQLite always uses Unicode to represent text.

(Hopefully you are now screaming at me saying, "Yeah, but which encoding?!?")

The best answer to this question is: SQLite uses UTF-8. Forget about everything else.

A more correct answer to this question is: SQLite uses UTF-8 but also supports UTF-16.

Either way, there is no distinction between char and nchar. There is no way to save storage space in a column by realizing that you only use lower case english characters so it's safe to use char instead of nchar. There are no code pages. There is no way to move your Shift JIS data into SQLite without converting it to Unicode (or storing it as blobs, I suppose).

Summary

Microsoft has done a lot of things right, but its history with Unicode is very unfortunate. And it's not entirely their fault. They adopted Unicode early and it changed underneath them.

With respect to its emphasis on UTF-8, SQLite is far more typical of most non-Microsoft software today.