Unicode
Unicode is a character set containing a unified representation
of much of the world's characters. It is developed, maintained,
and promoted by the Unicode Consortium, a nonprofit computer industry
organization. The Unicode character set has been encoded using several
different encoding schemes, such as:
- fixed-width, 2-byte encoding (UTF-16 a.k.a. UCS-2 used commonly
on MS Windows platforms)
- fixed-width, 4-byte encoding (UTF-32 a.k.a. UCS-4 used commonly
on GNU systems)
- variable-width, multibyte encoding (UTF-8 a one to four 8-bit
byte scheme commonly used on UNIX platforms, or UTF-7 used for
compatibility with legacy email systems).
Unicode maps code points to characters, but does not actually specify
how the data will be represented in memory, in a database, or on
a Web page. This is where the actual encoding of Unicode data comes
into play. Some common encodings are:
- UCS-2: Universal Character Set maps from a character
set definition to the actual code units used to represent the
data. UCS-2 is the main Unicode encoding used by Microsoft Windows
NT® 4.0, Microsoft SQL Server version 7.0, and Microsoft SQL
Server 2000. UCS-2 allows for encoding of 65,536 different code
points. All information that is stored in Unicode (via NCHAR,
NVARCHAR, and NTEXT) in SQL Server 2000 is stored in this encoding,
which uses 2 bytes for every character, regardless of the character
being used.
- UTF-16: UCS Transformation Format transforms a UCS representation
so that the data can be passed more reliably through specific
environments and provide extensibility beyond the character limitations
for the surrogate range. UTF-16 is identical to the 16-bit encoding
form of Unicode (AKA UCS-2) except that it allows for the mapping
of the surrogate range above 65,536 by using encoded pairs of
16-bit values. It is the primary Unicode encoding scheme used
by Microsoft Windows 2000.
- UTF-32: AKA UCS-4, this form of encoding uses 32 bits and, therefore,
covers all of IS0 10646. It is the primary Unicode encoding scheme used by GNU systems, running
on UNIX/Linux platforms.
- UTF-8: Many ASCII and other byte oriented systems that
require 8-bit encodings (such as mail servers) must span a vast
array of computers using different encodings, byte orders, and
languages. UTF-8 treats Unicode data independently of the byte
ordering on the computer. Keep in mind that when Microsoft speaks
of Unicode they are speaking of UCS-2 and UTF-16. They consider
UTF-8 to be another multi-byte character set, whereas the rest
of the world considers UTF-8 and the word Unicode to be synonymous.
Unicode Support within Current Technologies
Many other database systems (such as Oracle and Sybase SQL Server)
support Unicode using UTF-8 storage. Depending on a server's implementation,
this can be technically easier for a database engine to implement,
since all of the existing text management code on the server that
is designed to deal with data one byte at a time does not require
major changes.
In the Windows environment, UTF-8 storage has these disadvantages:
- The Component Object Model (COM) supports UTF-16/UCS-2 in its
APIs and interfaces. This requires simple conversion of UTF-8
to UTF-16 when use in a COM interface is required. This issue
only applies when COM is used, as the SQL Server database engine
does not typically call COM interfaces.
- The Windows NT and Windows 2000 kernels are both Unicode and
use UCS-2 and UTF-16, respectively. Once again, a UTF-8 storage
format requires simple conversions to UTF-16. As with the previous
note on COM, this would not result in a conversion hit in the
SQL Server database engine, but would potentially affect many
client-side operations.
- UTF-8 can be slower for many string operations. Sorting, comparing,
and virtually any string operation can be slowed by the fact that
characters do not have a fixed width. However, UTF-8 makes things
as simple as possible by indicating the number of bytes in a multi-byte
sequence with the first byte in in the sequence.
- UTF-8 requires one byte for most Latin-based characters, two
bytes for most Middle Eastern locales, and three bytes for Asian
characters. Overall, it is reasonably efficient with respect to
storage.
- XML's default encoding is UTF-8, as is Oracle's Database when
implemented for handling multilingual data within a single database
instance. To include localized strings in an XML document, convert
the file and strings to UTF8 format. Recent changes to the XML
standard suggest always explicitly declaring UTF8 in the encoding
tag even though it is the default encoding. Microsoft XML parsers
since IE 4.0 write out XML as UTF8.
|