Encoding

From CrossWire Bible Society
Revision as of 12:56, 31 August 2011 by David Haslam (talk | contribs) (Normalization: [http://en.wikipedia.org/wiki/Combining_character Combining characters] are permitted in source text, e.g. for diacriticals where there is no precomposed character in the Unicode S)

Jump to: navigation, search

As many people are making mistakes with encoding, here is a small text where I try to explain what does exist.

Introduction

Computers are working on words of bits, usually 8, 16, 32 or 64 today. Those words are meanless for the computer, they can be interpreted in many ways, as numbers, as instructions, as characters, as true/false values...

An encoding is a set of rules defining how to translate a sequence of words into a sequence of characters.

Single byte encoding

As computer can easily work with groups of bit called bytes, one of the first solution was to use a different byte value for every character. It's easy to understand and to program: every byte encode a single character; it's space efficient (memory was really expensive), you can size a string by counting bytes... You can have up to 256 characters as a byte can have 256 different values.

When computers was mostly an occidental story, ASCII was one of the first and most successful encoding. It define 128 characters, including upper case and lower case of the Latin alphabet, digits, punctuation and control. Values above 127 are not defined (allowing extensions or controls values). It is a good set of characters to write in English.

EBCDIC is another byte encoding defined by IBM at the same time which include mostly the sames characters but values above 127 are not free.

As many languages are using characters not included in the original ASCII set; many country, or group of country defined new encoding based on the ASCII encoding. Missing characters are filled in the unused 128 values (above 127). As 128 values is not enough for every earth alphabet, we have many of theses encodings. One of the most successful are ISO-8859 and in particular ISO-8859-1 which include characters for 29 western European and African languages.

Unicode

Theses encoding where 1 byte correspond to 1 character are really great for English, it's working well for languages based on the Latin alphabet, but is really bad for other alphabets. Some of them use more than 256 signs.

You can not write a text with characters from different encoding; and when sending a text, you always have to agree on a common encoding.

Here is Unicode. But first, learn this: Unicode is not an encoding

Unicode is an industry standard where every character in any real language, past or present (now over 100,000 characters), is having a different numeric value called a code-point. This code-point can be recorded in many ways. The first 128 code-points are the same values used in the ASCII encoding.

Did you remember, Unicode is not an encoding? Ok, in fact, Unicode also define many different encodings to store code-point.

Multi-bytes encoding

One word = One character was a great rule, let's keep it. When Unicode was young, Code-point were all under 65536. So we can represent every character with a 2 bytes word, (a 16 bits word). This is called UCS2. UCS2 is deprecated.

Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called UCS4, or UTF-32.

For an English text, because every character is encoded with 4 bytes, the size if 4 time bigger than the same text in ASCII encoding.

Variable byte length encoding

UCS2 and UCS4 are great, but it takes too much space, and if your old program does work with 1 byte = 1 character, it will not work. We need an encoding where ASCII characters are the same, but with the ability to encode any Unicode character.

So UTF-8 was created. It's a variable-byte length encoding. ASCII character are the same as in an ASCII encoded text; and other characters are encoded using 2, 3 or 4 characters. You should use this for all your SWORD files now.

For programs written to use UCS2, there is also an extension, UTF-16 is designed to use a single 2 bytes word for characters under 65536, and a pair of words above.

Common mistakes

If you use a wrong encoding, some characters will not be displayed correctly as the computer decoded it using incorrect rules.

The two most common encoding are ISO-8859-1 and UTF-8

1. An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*)

You will see two or three characters for non-ASCII ones, usually à and ©.

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

will show as

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

2. An ISO-8859-1 text decoded as UTF-8

The UTF-8 decoder will find an invalid multi-byte character, it can report an error, or put question marks or squares instead:

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

will show as

Et Dieu vit la lumi?re, qu'elle ?tait bonne; et Dieu s?para la lumi?re d'avec les t?n?bres.

Warning: If you see squares, it can also be a font problem! This happen if you're using a font which does not define some characters you are using.

3. Using a font that does not include the characters you want

If you use accented greek characters, for example, but do not use a unicode font that supports precomposed greek characters (like @Arial Unicode MS) accented characters will appear as empty rectangles in your text. In BibleCS you can change the font in Options/Preferences/Display.

Normalization

For module making, it is strongly recommended that Unicode source files encoded as UTF-8 are normalized to NFC. Combining characters are permitted in source text, e.g. for diacriticals where there is no precomposed character in the Unicode Standard.

See also

External links

  • TECkit – a Text Encoding Conversion toolkit.
As well as being able to convert between encodings, it can also be used to convert text from one script to another, for any language that can be represented by more than one script. Example: Some languages in Central Asia can be represented in either Latin or Cyrillic script.