Words & Stuff

ddd: Di-Dah, Dah, Dit

(14 February 1999)

The characters of a writing system can be represented in many ways -- not only in a variety of fonts and lettering styles, but in any agreed-upon symbolic form: as Braille letters, as flags in Semaphore, as hand-signs, or as strings of dots and dashes...

Two weeks ago, Morse Code was officially retired as the language of seagoing emergency distress signals. In honor of the code's century and a half of service, this week's column is about binary character encodings.

It occurred to me shortly before the news about Morse Code's demise that the code was a fairly efficient binary encoding. Morse Code, invented in the 1830s, uses two symbols, the dot and the dash, in sequences, to indicate letters. (The dot and dash are really just written symbols representing "short" and "long" pulses of electricity, which more or less correspond to the 0s and 1s of today's binary digital world.) An E, for instance, as the most common letter in English, is represented in Morse as a single dot (pronounced "dit"); a T, the next most common, as a single dash (pronounced "dah"); and an A as dot-dash ("di-dah"). Less common letters use longer strings of dots and dashes; no more than four symbols are needed to represent any letter. Digits use five symbols; punctuation marks mostly use six.

In 1851, Europeans developed an "International" variant of Morse which provided for letters with diacritic marks. That International Morse Code (with slight modifications) is the version in use all over the world today.

In 1874, a French engineer named Baudot patented a new telegraphic code which replaced Morse in many applications. In Baudot Code, every letter is represented by a string of five on-or-off signals (an exact correspondence with the binary digits (or "bits") 1 and 0). There were thus only 32 possible signal combinations -- just enough for the letters of the uppercase English alphabet and a few punctuation marks.

There weren't enough codes for lowercase letters, so telegraphy was always limited to uppercase -- which added to the urgent feel (and perhaps to the charm) of telegraphed messages. Adding further to the charm was the "telegraphic" language use that developed around the economic fact that telegraph companies charged for transmission by the word. Since "word" was not defined in any meaningful way, people sending telegrams often created words by stringing together other words and pieces of words.

For instance, I recently heard an account of a foreign correspondent for the BBC in the heyday of telegraphy. After long silence from the reporter, the BBC wired him to ask: NEWS? The reporter wired back: UNNEWS. The BBC, seeing no point in paying him if he wasn't working, retorted: UNNEWS, UNJOB. To which the reporter replied: UPSTICK JOB ASSWARD. (I've also heard that last line attributed to a telegram from Hemingway; I assume it's apocryphal. But it makes a good story.)

(Of course, those telegrams leave out one of the most prominent features of most telegrams: in transcribing them, telegraph operators used words rather than punctuation marks. STOP (also known as a "full stop") is still (I'm told) the British word for what we Yanks call a period; INTERROGATION MARK was written out in place of a '?'; and so on.)

Baudot Code underwent modifications over time, but remained conceptually the same until ASCII came along. This code, introduced in the 1960s as a 7-bit binary encoding, came into its own with the advent of the now-familiar 8-bit byte in personal computers. Since ASCII provided 128 (and later 256) possible characters, computers could use lowercase letters.

But ASCII was again designed only for representing the Latin alphabet. It makes no provision for diacritical marks used in European languages, much less for the huge variety of characters used by languages in the rest of the world. In the past decade a new character encoding has risen to prominence: Unicode. Unicode (an international standard) uses 16 bits instead of 8; it can therefore encode over 65,000 characters, which is more than enough to represent all the characters currently in use in all the world's major languages. Unicode also has an extension capability that allows for millions more characters, intended for use in representing historical writing systems. Unicode defines codes for dozens of languages: English, Cyrillic, Hebrew, Arabic, Thai, Tamil, and so on. Chinese, Japanese, and Korean ideographs (such as the Japanese kanji) are included as well. Future developments (to fill some of the 18,000 currently unused character codes) will include Cherokee, Burmese, and Braille. The character encoding includes punctuation marks and other symbols, including diacritics.

(There's an important distinction here, by the way, between a character set (an abstract collection of characters, like the English alphabet) and an encoding (a mapping between characters and numeric representations of those characters). Unicode doesn't provide a set of glyphs, the character shapes themselves; it merely provides a set of numbers corresponding to abstract names for characters (like "LATIN CHARACTER CAPITAL A"). The appearance of the glyphs used to display the characters represented by Unicode depends on the font being used and on the display system.)

Unicode characters can be represented either as 16-bit characters (a representation known as UTF-16) or as variable-length strings of 8-bit bytes (known as UTF-8). UTF-8 appears to exist largely for ease of use by software that was written assuming all text would use 8-bit ASCII bytes. This development is another new twist on the old distinction between the variable-length strings of symbols used by Morse, and the fixed-length strings (more wasteful of space, but easier to determine boundaries between) of Baudot Code.

And so history repeats itself. Perhaps Unicode will prove sufficient for all our future character-encoding needs; perhaps one day, ASCII too will be officially retired, to join its ancestor Morse Code on the shelf of abandoned technologies.

Most of the historical facts in this column come from the articles "Morse Code," "Telecommunications Systems: Telegraph," and "Baudot, Jean-Maurice-Émile" on the Britannica CD, Version 99 © 1994-1999. Encyclopædia Britannica, Inc.

Thanks to Aaron Hertzmann for suggesting a connection between the letter D and Morse Code.

Jed Hartman <logophilia@kith.org>