The Unicode Standard 5.0: An Appreciation
Tuesday, October 17 2006 @ 04:38 PM CDT
Contributed by: Andy Updegrove
Unicode marks the most significant advance in writing systems since the Phoenicians
James J. O'Donnell, Provost, Georgetown University
There are fundamental standards that are constantly in the news, such as XML (and its many offspring). And there are standards development organizations, like the W3C, that enjoy a high profile in part because of the importance of the technical domains that they serve. Some standards have even taken on socio-political significance, becoming pawns in international diplomacy, such as the root domains of the Internet, despite the fact that they are insignificant in size and design. .
But there are other standards that go largely unheralded, and are developed by consortia that are virtually never in the news, despite the vast social and technical significance of the standard in question. Perhaps chief among them is the Unicode, created and constantly extended by the Unicode Consortium, whose loyal and widely distributed team of contributors for the most part labor quietly in the background of information technology.
Notwithstanding the low profile of the Unicode and its creators, it is this standard that enables nearly all those living in the world today to communicate with each other in their native language character sets. It even permits the words of many of those that lived in the past to become accessible to those alive today in electronic form, and in their original character sets as well.
The occasion for my chosing to write about the Unicode today is that its newest version - version 5.0 - has now been published in book form and will be shipped in the next few weeks (you can preorder it from Amazon here).
What exactly is the Unicode? When I wrote in October of 2003 about the publication of version 4.0, I went about asking and answering that question as follows:
What is 11 1/8" x 8 3/4" x 2 1/4" and weighs 7.89 pounds? Among other things, the hardbound copy of the Unicode Standard 4.0, the Oxford English Dictionary of computerized language characters, numbers and symbols, contemporary and archaic, mainstream and obscure. The home of Khmer Lunar codes, Ogham alphabets and Cyrillic supplements. An alphanumeric expression of the means of human communication.
Less lyrically, the Unicode is described in its new, more accessible packaging (my reviewer's pre-release manuscript copy weighs in at a trim 5.2 pounds; the hardbound commercial copy will also be smaller, cheaper and lighter than its predecessor) like this:
The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. As the default encoding of HTML and XML, the Unicode Standard provides a sound underpinning for the World Wide Web and new methods of business in a networked world. Required in new Internet protocols and implemented in all modern operating systems and computer languages such as Java and C#, Unicode is the basis for software that must function all around the world.
As this introduction suggests, the Unicode was conceived to be one of the essential building blocks of broadly useful information technology. With the advent of the Internet and the Web and adoption of computer technology by the masses, it has become much more: a modern bit-based Rosetta Stone, providing the ultimate, real-time character-by-character translator for computer users everywhere around the world.
Unicode Consortium President Mark Davis defines the role of the Unicode in a globally-linked world as follows, in the foreword to the new edition:
With the rise of the Web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicode-every program handling XML must use Unicode internally. The search engines all use Unicode for good reason; even if a Web page is in a legacy character encoding, the only effective way to index that page for searching is to translate it into the lingua franca, Unicode. All of the text on the Web thus can be stored, searched, and matched with the same program code. Since all of the search engines translate Web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place.
Although the importance of the Unicode continues to increase, its creation remains largely unnoticed by the press. One reason is because while its existence is essential, its exact design does not have great significance to the proprietary plans of powerful technology vendors. The result is that while the Unicode Consortium has a long list of corporate (and other) supporters that recognize its importance and therefore subsidize its ongoing support and extension, no standards wars surround its creation, and no vendors compete aggressivley to control its working groups.
Instead, the Unicode Standard is the work of a dedicated group of (mostly) volunteers dedicated to achieving social as well as IT goals. Increasingly marginal (from a commercial perspective) languages are constantly being added, in each case enabling another step to be taken in providing equal access to the Internet and the Web to all. In the words of Tim Berners-Lee, "The path W3C follows to making text on the Web truly global is Unicode."
Even less commercial is the work of the Unicode Consortium in encoding the ancient scripts of languages no longer spoken, such as Old Persian, Sumero-Akkadian, Runic, Ogham and Phoenician, and in allowing archaic versions of still-spoken languages to be made accessible to maintain continuity of access over time. An example is Aramaic, now in use for over 3,000 years. The Unicode adopted the Syriac form of Aramaic, thereby, in the words of the President and Director of The Syriac Institute, "linking ancient tablets and parchment to today's digital memory cells and even to the unknown media of tomorrow. The inheritors of the Syriac heritage today and academia are most indebted to the Unicode Standard."
The end result is that the Unicode 5.0 in published form, complete with notes and explanations, is a unique hybrid of linguistic history, cultural continuity and technical data. As such, it is interesting for even a non-IT professional (such as me) to flip through and appreciate.
Still, the summary just provided doesn't quite do justice to what the Unicode represents. When the first electromechanical marvels were launched in the 19th century (e.g., the telephone, the gramophone and the movie projector), all the science remained invisible, hidden behind enigmatic mechanical parts like metal diaphragms, electromagnets and steel needles. Only sights, sounds and physical objects were apparent to normal mortals, just as the human voice, sight and hearing were mysteries until modern anatomists and optical physicists began to divine their secrets.
But with the advent of the personal computer, anyone can (and, increasingly, many more do) dive down to the code level of program creation, where all mystery dissolves into letters, colons, <> marks and the like. In doing so, the link between the past and the present, between linguistics and geography, between literary art and the reader, dissolves irretreivably into arbitrary and elemental characters.
And that is where the Unicode reveals a significance that goes beyond its technical utility and its social value, by providing the link between the sterility of code and the richness of history.
I tried to convey some of that significance in my prior essay as follows:
While all 1500 pages (including prefatory material) [of Unicode 4.0] are dedicated to the purely technical task of converting the essence of human communication into 1s and 0s, its explanatory sections are also a lyrical expression of the richness of evolved language. In the ongoing Unicode project, the arcana of linguistic structure and the logic of technology find their interface.
Witness, for example the following Jabberwockian discoveries (just) from the Table of Contents, and take time to savor the sounds of the syllables as you mentally pronounce them:
- Living Scripts: Thaana, Limbu and Malayalam; Tai Le, Tagalog and Tagbanwa; Devanagari, Bengali and Gurmukhi; Han, Hangul and Yi. Bopomofo. Canadian Aboriginal Syllabics. Cherokee.
- Archaic Scripts: Ogham; Ugaritic; Cypriot Syllabary
- Symbols: Byzantine Musical Symbols, Yijing Hexagrams and Dingbats
- General Structural Elements: Ligated Multiple Base Characters, Spacing Clones of European Diacritical Marks, and Grapheme Clusters
- Conformance: Corrigenda, Canonical Decomposition and Collation. Conjoining Jamo Behavior.
Is the Unicode Standard 4.0 a computer technical book? Only up to a point. In a much more real sense, it is the latest (and not the last) step in the human ordering of expression. The manner of its coming into being, involving many individuals passionate about their tasks, is but the latest leap in the long road that began when words were first converted into hash marks on a clay tablet.
If it would be possible to visualize a standards manual as a coffee table book without imagining an oxymoron, it would be The Unicode Standard 5.0. I highly recommend it.
For further blog entries on Standards and Society, click here