What is Unicode?

Unicode is an international standard for representing the characters used in many different languages.  It works “out of the box” on recent operating systems from Microsoft (Windows 2000, XP) and Apple (Mac OS X).

Some earlier operating systems have Unicode support as well, but for an earlier version of Unicode that supports only modern Greek, which has only one accent mark, not ancient or polytonic Greek with its various breathings, accents and iota subscripts.  If this is the case for you, you may need to download a Unicode font that supports polytonic Greek and configure your applications to use it.

What Problem Does Unicode Solve?

First, some terminology.  A character set  is a mapping (correspondence) between numbers and glyphs, the pictures of letters drawn on your screen or a piece of paper.  You feed the software a number and it produces a glyph.  The character set is an agreement about what number produces which glyph.  Sometimes a character set is called an encoding.  A font  is a set of glyphs for a particular character set.

In the bad old days, countries and computer manufacturers felt free to define their own character sets, invariably choosing numbers that fit in one byte, the numbers zero through 255.  Take a look at your browser's “View” menu and select “Character Coding” or “Text Coding” to see a long list of character sets.  English speakers most commonly encounter the ASCII character set and its extension ISO-8859-1.

There were two problems with this Babel of character sets.  First, the numbers used by the various character sets often overlapped.  For example, consider the number 97.  In the ASCII character set, this is the Latin lowercase a.  In IBM's EBCDIC character set, it is the slash /.  In the betacode character set used by the SPIonic font, it is the Greek lowercase alpha α.  The other problem is that there are many more than 255 glyphs to represent, whether it is all the symbols used by English speakers or the many thousands of symbols used in scripts such as Japanese and Chinese.  A single byte wouldn't suffice to enumerate the different characters.

The Unicode Solution

The Unicode Consortium standardized on a single set of numbers representing all characters used by all scripts in the world.  If your text is encoded in Unicode, you don't have to include information on what the number 97 means in a particular context.  We know that 97 always corresponds to the Latin lowercase a.  The Greek lowercase alpha α always corresponds to the number 945, and so forth.

The Consortium made two choices to ease adoption.  First, they reused existing standards.  The first 255 Unicode characters use the same numbers as the ISO-8859-1 character set, whose first 127 numbers in turn match those of the ASCII character set.  Secondly, they defined UTF-8, a way of representing Unicode characters such that the regular Roman (ASCII) characters take up one byte, whereas the more exotic characters (such as Greek letters) take up two or more bytes.  This has the deliberate side effect of making every existing ASCII document a Unicode compatible document.

If you treat UTF-8 as a synonym for Unicode, you won't go wrong.

There has been a big effort in the past decade to modernize operating systems and applications to handle the Unicode character set.  So while relatively new programs such as browsers are pretty much all Unicode-capable, a few older programs such as the email reader Eudora have not yet made that transition.

Software Details

For specific details on Unicode fonts and Unicode-capable software, consult the page on Unicode applications.