From CrossWire Bible Society
Revision as of 10:05, 18 January 2008 by Skc (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

As many people are making mistakes about encoding, here is a small text where I try to explain what does exist.


Computers are working on words of bits, usually 8, 16, 32 or 64 today. Those words are meanless for the computer, they can be interpreted in many ways, as numbers, as instructions, as characters, as true/false values...

An encoding is a set of rules defining how to translate a sequence of words into a sequence of characters.

Single byte encoding

As computer can easily work with groups of bit called bytes, one of the first solution was to use a different byte value for every character. It's easy to understand and to program: every byte encode a single character; it's space efficient (memory was really expensive), you can size a string by counting bytes... You can have up to 256 characters as a byte can have 256 different values.

When computers was mostly an occidental story, ASCII was one of the first and most successful encoding. It define 128 characters, including upper case and lower case of the latin alphabet, digits, punctuation and control. Values above 127 are not defined (allowing extensions or controls values). It is a good set of characters to write in english.

EBCDIC is another byte encoding defined by IBM at the same time which include mostly the sames characters but values above 127 are not free.

As many languages are using characters not included in the original ASCII set; many country, or group of country defined new encoding based on the ASCII encoding. Missing characters are filled in the unused 128 values (above 127). As 128 values is not enough for every earth alphabet, we have many of theses encodings. One of the most successful are ISO-8859 and in particular ISO-8859-1 which include characters for 29 western Europe and Africa languages.


Theses encoding where 1 byte correspond to 1 character are really great for English, it's working well for languages based on the latin alphabet, but is really bad for other alphabets. Some of them use more than 256 signs.

You can not write a text with characters from different encoding; and when sending a text, you always have to agree on a common encoding.

Here is Unicode. But first, learn this: Unicode is not an encoding

Unicode is an industry standard where every character in any real language, past or present (about 100,000 characters), is having a different numeric value called a code-point. This code-point can be recorded in many ways. The first 128 code-points are the same values used in the ASCII encoding.

Did you remember, Unicode is not an encoding? Ok, in fact, Unicode also define many different encodings to store code-point.

Multi-bytes encoding

One word = One character was a great rule, let's keep it. When Unicode was young, Code-point where all under 65536. So we can represent every character with a 2 bytes word, (a 16 bits word). This is called UCS2. UCS2 is deprecated.

Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called UCS4, or UTF32.

For an english text, because every character is encoded with 4 bytes, the size if 4 time bigger than the same text in ASCII encoding.

Variable byte length encoding

UCS2 and UCS4 are great, but it takes too much space, and if your old program does work with 1 byte = 1 character, it will not work. We need an encoding where ASCII characters are the same, but with the ability to encode any Unicode character.

So UTF-8 was created. It's a variable-byte length encoding. ASCII character are the same as in an ASCII encoded text; and other characters are encoded using 2, 3 or 4 characters. You should use this for all your sword files now.

For programs writen to use UCS2, there is also an extention, UTF-16 is designed to use a single 2 bytes word for characters under 65536, and a pair of words above.

Common mistakes

If you use a wrong encoding, some characters will not be displayed correctly as the computer decoded it using wrong rules.

The two most common encoding are ISO-8859-1 and UTF8

An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*)

You will see two or three characters for non-ASCII ones, usually à and ©.

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

will show as

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

An ISO-8859-1 text decoded as UTF-8

The UTF-8 decoder will find an invalid multi-byte character, it can report an error, or put question marks or rectangles instead:

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

will show as

Et Dieu vit la lumi?re, qu'elle ?tait bonne; et Dieu s?para la lumi?re d'avec les t?n?bres.