Random Stuff
 
Home
Tools & Documents:
  Email Address Validator
  Anti-Spam
  Explanations
Games:
  Solitaire
 

Why do we have so many character sets (or encodings)? What is a character set anyway?

The problem is that internally computers store numbers. At the lowest level a single 'bit' of memory can store zero or one. It's easy to transmit individual bits though electrical wires or fibre optics or store them in memory or on a disk. A single bit on it's own isn't very useful - so they're normally grouped into 'bytes'. Each byte contains 8 bits and is capable of storing a number between zero and 255.

In order to work with text (such as this) computers must have a scheme for storing it as numbers. The normal approach is to split the text into a series of characters (a character is a single letter, numerical digit, punctuation mark or similar). All we need then is a system for storing each character as a number. We've all seen such schemes in various places, for example using the number 1 to mean 'A', the number 2 to mean 'B', etc. This simple scheme quickly runs into limitations: how do you tell the difference between upper and lower case letters and how do you represent numerical digits or punctuation?

A character set (also referred to as an encoding or, mainly by Microsoft, a code page) is simply a list of possible characters and the numbers they should be stored as. When a computer reads a series of numbers (from a disk, network or memory) that it believes should be interpreted as text it can refer to the character set to see which characters should be displayed. If different computers use different character sets then it becomes almost impossible to transfer textual data between them, so lets start with the good news.

ASCII (and what's wrong with it)

ASCII, the American Standard Code for Information Interchange, is a basic character set published in 1963 and is now used by nearly every computer in existance.

NumberCharacter
32 
33!
34"
35#
36$
37%
38&
39'
40(
41)
42*
43+
44,
45-
46.
47/
480
491
502
513
524
535
546
557
NumberCharacter
568
579
58:
59;
60<
61=
62>
63?
64@
65A
66B
67C
68D
69E
70F
71G
72H
73I
74J
75K
76L
77M
78N
79O
NumberCharacter
80P
81Q
82R
83S
84T
85U
86V
87W
88X
89Y
90Z
91[
92\
93]
94^
95_
96`
97a
98b
99c
100d
101e
102f
103g
NumberCharacter
104h
105i
106j
107k
108l
109m
110n
111o
112p
113q
114r
115s
116t
117u
118v
119w
120x
121y
122z
123{
124|
125}
126~

ASCII only defines meanings for the numbers 0 to 127, and numbers of this size only require 7 bits to store. This leaves one bit per byte unused, which can be used for error detection when sending data over a network. The numbers 0 to 31 and 127 aren't shown in the table because they are assigned special meanings and are often referred to as 'non-printing' or 'control' characters. Some of the most common 'control' characters are 'new line' (10) and 'carriage return' (13) which are used by computers to mark the end of a line of text. The latter of these gives you a hint that ASCII was designed to be able to control teletype (typewriter-like) printers. There are also control characters for 'backspace', 'tab', 'bell' and few others relating to sending data over a network. You can find more information on ASCII here on wikipedia.org.

As you can see, although it defines the most common characters you would normally need for English text, ASCII is severely limited. There are no accented characters, no pounds sterling, yen or Euro symbols, no Cyrillic characters, no Greek characters and no characters from Oriental or middle eastern alphabets. The omission of a few of these characters makes ASCII inadequate for the English speaking world in which most major computing companies currently reside. A real solution to the problem requires that all of these characters be included into one truely global character set.

Beyond ASCII

Unfortunately there is an obvious partial solution to the problem of the characters missing from ASCII. ASCII only uses numbers 0-127, yet we can store up to 255 in a byte (and using modern communication protocols using the 8th bit in the byte for error detection is no longer required). We can use numbers 128-255 to represent some of the characters missing from ASCII.

Most computer vendors haven't let the numbers 128-255 go to waste, they've defined their own character sets that use them. The IBM PC (and compatibles) by default include a few accented characters, the British currency symbol, and various line drawing characters in this space. This is the character set used by most MS-DOS applications, and is now known as code page 437 (cp437) - although older documentation may simply refer to it as the 'IBM Extended character set'. Other types of computers have their own sets of characters in the range 128-255, although most of these character sets (like the computers that used them) have fallen into obscurity and you're unlikely to find files using them. The other notable vendor character set is that used by the Apple Macintosh (see the MacRoman character set, although note that as of MacOS 8.5 character 219 has been changed from the rarely used generic currency symbol to the Euro symbol).

Even if all the vendors could agree, the 128 characters in the range 128-255 aren't enough to provide for all the alphabets and other characters in use around the world. Undeterred the ISO 8859 standard was published, which specifies a series of 8 bit character sets. Each one is based on ASCII, but includes a different set of characters in the range 128-255 - tailored to the needs of a specific region. ISO 8859-1 (Latin1) is tailored to western Europe, ISO 8859-5 is Cyrillic and so the list goes on. Of course the standard wasn't perfect, and failed to predict the creation of the Euro, so ISO 8859-15 was created to replace ISO 8859-1. More information on ISO 8859 can be found here.

To quote from unicode.org: "These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption [as the characters may get confused]." In other words, if you forget which encoding was used to store your text, and attempt to read it using a different encoding, then it may become corrupted. Any characters stored as numbers in the range 128-255 may be interpreted as a totally different character to the one that was originally intended.

Unicode

Unicode is designed to be the 'one true' character set, that will be big enough to contain every character used by every language on the planet - and hopefully big enough to cope with the introduction of the odd new currency or two. In this it looks like it may be successful, although complete adoption accross the entire computing industry will inevitably take time. If everyone can agree to use this one character set then it will solve the problem of having to know which encoding was used to store a piece of text. Unicode is, unfortunately, not without it's drawbacks.

Unicode defines characters using numbers not in the range 0 to 255, but in the range from 0 to well over 1 million. This means that you can not store a unicode character in 1 byte, or even 2 bytes. Since computers work well with powers of two this means that Unicode requires 4 bytes (32 bits) per character, an encoding normally called 'UTF-32'. While this is reasonably convenient, since most computers now have (at least) 32 bit processors and buses so can deal with 32 bit characters without a problem, it means that most textual data would take 4 times the memory or disk to store and take 4 times as long to transfer over a network as the alternative (8-bit) encodings. Compression would solve this problem in some circumstances, but is not ideal - particually for strings stored in memory and in active use, as the process of compressing and decompressing the data would be quite slow.

To make the use of the unicode character set more appealing, 2 unicode transformation formats are defined. These allow the storing of unicode characters using shorter sequences of bytes for common characters, but using longer sequences for other characters. This means that text will normally take up much less room than would be the case using UTF-32 - at the expense of being more complicated to process.

UTF-8 is designed to ease compatibility with existing code that is written to expect 8 bit characters. Like most 8 bit character sets, the numbers 0 to 127 are used for exactly the same characters as in ASCII. Numbers in the range 128-255 have a special meaning that is more complicated than a normal character set. In order to convert these numbers into a character you must also read the next few bytes, and consider them as a single (multi-byte) character. The encoding is carefully designed so that it will not confuse older programs, the numbers 0 to 127 are only ever used to represent the equivalent ASCII meaning .. they are never used as part of a multi-byte character. The encoding is also designed so that even if you only have part of piece of UTF-8 text data it can still be decoded, for example it is always possible to find the beginning and end bytes of a multi-byte character - you do not have to have decoded the entire string up to that point to work this out. This system is efficient at storing certain types of text, in-particular English, however, it is fairly inefficient at storing characters that are not part of ASCII - so is less appealing for many foreign languages which require the frequent use of such characters.

The second transformation encoding for Unicode is UTF-16. It was the original intention to be able to represent all Unicode characters in 2 bytes (16 bits), however the 65536 characters this allows aren't enough to cover all the languages currently in use. In UTF-16 almost all of the commonly used unicode characters are represented using exactly two bytes (16 bits), including all the common Chinese and Japanese characters. Some (fairly rare) characters need 2 sets of 16 bits (i.e. 32 bits). This means that UTF-16 data normally takes up about half the room, and never takes up more room, than the same data encoded using UTF-32 would - and at the same time avoids much of the complexity and inefficiency associated with UTF-8. This is at the expense of not working at all with programs designed to work with 8 bit characters.

There is a final complexity associated with both UTF-16 and UTF-32: 'endianess' or 'byte-ordering'. To explain the problem it's probably easiest to think about the numbering system we, as humans, are most used to: decimal. Take, for example, the number 27. That (as I'm sure you'll remember from school days) is 2 'tens' and 7 'units'. We, by convention, always write the tens before the units .. but it would make equal sense to write the same number the other way round as 7 then 2 - providing that everyone agrees that this is the way round to do things. When a computer deals with a number that cannot be stored in a single byte it must decide which order to put the bytes in. Unfortunately not all computers agree on which order this should be: in general computers designed around Intel or compatible CPUs (including PCs) will put the 'low-order' byte (sort-of like the 'units') first and the 'high-order' byte last (called 'little endian'), while other computers will generally put the bytes the other way round ('big endian'). UTF-16 and UTF-32 define 'byte-order marks' (BOM) which can be placed inside data to allow the computer to detect the endianess of the data. Transferring UTF-16 or UTF-32 data between computers relies on the inclusion of byte-order marks, and of the receiving computer correctly processing them. Clearly this is another complexity of Unicode that the world could do without.

Conclusion

While, in theory, Unicode is the only popular character set with the potential to become 'the one global character set' it is unlikely to completely replace all other character sets for a long time yet. The extra storage space associated with the 'more pure' UTF-32 (or even UTF-16) encodings when storing English text is more of a perceived than a real problem on most modern computers, but has the potential to be an issue for small devices such as PDAs, embedded devices or over low bandwidth networks. Breaking the conceptual link between a byte and a character is also an issue for many developers, as well as for legacy systems with which compatibility is a necessity.

UTF-8 has an important role to play as a bridge, it can generally be processed by legacy systems while providing access to the full range of characters supported by Unicode on systems that understand it properly. It's inefficient representation of most non-ASCII characters and the complexity of what should be simple operations (such as counting the number of characters in a string) make it unlikely to be adopted as the preferred encoding that everyone should strive for.

What we are likely to see is a gradually increasing number of applications that work with Unicode characters internally, but when communicating with other systems will re-code the data using one of the traditional 8-bit character sets. As the number of Unicode-aware applications increases it will become increasingly common for such communication to take place using UTF-8, allowing the applications to communicate to each other the full range of characters they are able to represent internally (even accross communication systems that do not themselves support Unicode). I suspect that only when the vast majority of applications support Unicode is it possible that we'll see a real migration away from 8-bit character sets and towards the adoption of either UTF-16 or UTF-32 as the default encoding for transmitting and storing data. Any migration will be painful for users, and will involve the complexity of byte-order detection - so it's my suspicion that UTF-8 will slowly become a de-facto standard for communication that will stick for a long time yet.

Links

 
This site is best viewed with any browser Valid HTML 4.01! © Copyright 2011, Stephen White