Difference between revisions of "Encoding"
(→An ISO-8859-1 text decoded as UTF-8: Font problem warning) |
m |
||
Line 1: | Line 1: | ||
− | As many people are making mistakes | + | As many people are making mistakes with encoding, here is a small text where I try to explain what does exist. |
== Introduction == | == Introduction == | ||
Line 10: | Line 10: | ||
As computer can easily work with groups of bit called bytes, one of the first solution was to use a different byte value for every character. It's easy to understand and to program: every byte encode a single character; it's space efficient (memory was really expensive), you can size a string by counting bytes... You can have up to 256 characters as a byte can have 256 different values. | As computer can easily work with groups of bit called bytes, one of the first solution was to use a different byte value for every character. It's easy to understand and to program: every byte encode a single character; it's space efficient (memory was really expensive), you can size a string by counting bytes... You can have up to 256 characters as a byte can have 256 different values. | ||
− | When computers was mostly an occidental story, [http://en.wikipedia.org/wiki/ASCII ASCII] was one of the first and most successful encoding. It define 128 characters, including upper case and lower case of the | + | When computers was mostly an occidental story, [http://en.wikipedia.org/wiki/ASCII ASCII] was one of the first and most successful encoding. It define 128 characters, including upper case and lower case of the Latin alphabet, digits, punctuation and control. Values above 127 are not defined (allowing extensions or controls values). It is a good set of characters to write in English. |
[http://en.wikipedia.org/wiki/EBCDIC EBCDIC] is another byte encoding defined by IBM at the same time which include mostly the sames characters but values above 127 are not free. | [http://en.wikipedia.org/wiki/EBCDIC EBCDIC] is another byte encoding defined by IBM at the same time which include mostly the sames characters but values above 127 are not free. | ||
− | As many languages are using characters not included in the original ASCII set; many country, or group of country defined new encoding based on the ASCII encoding. Missing characters are filled in the unused 128 values (above 127). As 128 values is not enough for every earth alphabet, we have many of theses encodings. One of the most successful are [http://en.wikipedia.org/wiki/ISO/IEC_8859 ISO-8859] and in particular [http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ISO-8859-1] which include characters for 29 western | + | As many languages are using characters not included in the original ASCII set; many country, or group of country defined new encoding based on the ASCII encoding. Missing characters are filled in the unused 128 values (above 127). As 128 values is not enough for every earth alphabet, we have many of theses encodings. One of the most successful are [http://en.wikipedia.org/wiki/ISO/IEC_8859 ISO-8859] and in particular [http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ISO-8859-1] which include characters for 29 western European and African languages. |
== Unicode == | == Unicode == | ||
− | Theses encoding where 1 byte correspond to 1 character are really great for English, it's working well for languages based on the | + | Theses encoding where 1 byte correspond to 1 character are really great for English, it's working well for languages based on the Latin alphabet, but is really bad for other alphabets. Some of them use more than 256 signs. |
You can not write a text with characters from different encoding; and when sending a text, you always have to agree on a common encoding. | You can not write a text with characters from different encoding; and when sending a text, you always have to agree on a common encoding. | ||
Line 34: | Line 34: | ||
Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called [http://en.wikipedia.org/wiki/UTF-32/UCS-4 UCS4], or UTF32. | Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called [http://en.wikipedia.org/wiki/UTF-32/UCS-4 UCS4], or UTF32. | ||
− | For an | + | For an English text, because every character is encoded with 4 bytes, the size if 4 time bigger than the same text in ASCII encoding. |
Line 40: | Line 40: | ||
UCS2 and UCS4 are great, but it takes too much space, and if your old program does work with 1 byte = 1 character, it will not work. We need an encoding where ASCII characters are the same, but with the ability to encode any Unicode character. | UCS2 and UCS4 are great, but it takes too much space, and if your old program does work with 1 byte = 1 character, it will not work. We need an encoding where ASCII characters are the same, but with the ability to encode any Unicode character. | ||
− | So [http://en.wikipedia.org/wiki/UTF-8 UTF-8] was created. It's a variable-byte length encoding. ASCII character are the same as in an ASCII encoded text; and other characters are encoded using 2, 3 or 4 characters. You should use this for all your | + | So [http://en.wikipedia.org/wiki/UTF-8 UTF-8] was created. It's a variable-byte length encoding. ASCII character are the same as in an ASCII encoded text; and other characters are encoded using 2, 3 or 4 characters. You should use this for all your Sword files now. |
− | For programs | + | For programs written to use UCS2, there is also an extension, [http://en.wikipedia.org/wiki/UTF-16/UCS-2 UTF-16] is designed to use a single 2 bytes word for characters under 65536, and a pair of words above. |
== Common mistakes == | == Common mistakes == | ||
− | If you use a wrong encoding, some characters will not be displayed correctly as the computer decoded it using | + | If you use a wrong encoding, some characters will not be displayed correctly as the computer decoded it using incorrect rules. |
− | The two most common encoding are ISO-8859-1 and | + | The two most common encoding are ISO-8859-1 and UTF-8 |
=== An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*) === | === An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*) === |
Revision as of 13:57, 18 January 2008
As many people are making mistakes with encoding, here is a small text where I try to explain what does exist.
Contents
Introduction
Computers are working on words of bits, usually 8, 16, 32 or 64 today. Those words are meanless for the computer, they can be interpreted in many ways, as numbers, as instructions, as characters, as true/false values...
An encoding is a set of rules defining how to translate a sequence of words into a sequence of characters.
Single byte encoding
As computer can easily work with groups of bit called bytes, one of the first solution was to use a different byte value for every character. It's easy to understand and to program: every byte encode a single character; it's space efficient (memory was really expensive), you can size a string by counting bytes... You can have up to 256 characters as a byte can have 256 different values.
When computers was mostly an occidental story, ASCII was one of the first and most successful encoding. It define 128 characters, including upper case and lower case of the Latin alphabet, digits, punctuation and control. Values above 127 are not defined (allowing extensions or controls values). It is a good set of characters to write in English.
EBCDIC is another byte encoding defined by IBM at the same time which include mostly the sames characters but values above 127 are not free.
As many languages are using characters not included in the original ASCII set; many country, or group of country defined new encoding based on the ASCII encoding. Missing characters are filled in the unused 128 values (above 127). As 128 values is not enough for every earth alphabet, we have many of theses encodings. One of the most successful are ISO-8859 and in particular ISO-8859-1 which include characters for 29 western European and African languages.
Unicode
Theses encoding where 1 byte correspond to 1 character are really great for English, it's working well for languages based on the Latin alphabet, but is really bad for other alphabets. Some of them use more than 256 signs.
You can not write a text with characters from different encoding; and when sending a text, you always have to agree on a common encoding.
Here is Unicode. But first, learn this: Unicode is not an encoding
Unicode is an industry standard where every character in any real language, past or present (about 100,000 characters), is having a different numeric value called a code-point. This code-point can be recorded in many ways. The first 128 code-points are the same values used in the ASCII encoding.
Did you remember, Unicode is not an encoding? Ok, in fact, Unicode also define many different encodings to store code-point.
Multi-bytes encoding
One word = One character was a great rule, let's keep it. When Unicode was young, Code-point where all under 65536. So we can represent every character with a 2 bytes word, (a 16 bits word). This is called UCS2. UCS2 is deprecated.
Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called UCS4, or UTF32.
For an English text, because every character is encoded with 4 bytes, the size if 4 time bigger than the same text in ASCII encoding.
Variable byte length encoding
UCS2 and UCS4 are great, but it takes too much space, and if your old program does work with 1 byte = 1 character, it will not work. We need an encoding where ASCII characters are the same, but with the ability to encode any Unicode character.
So UTF-8 was created. It's a variable-byte length encoding. ASCII character are the same as in an ASCII encoded text; and other characters are encoded using 2, 3 or 4 characters. You should use this for all your Sword files now.
For programs written to use UCS2, there is also an extension, UTF-16 is designed to use a single 2 bytes word for characters under 65536, and a pair of words above.
Common mistakes
If you use a wrong encoding, some characters will not be displayed correctly as the computer decoded it using incorrect rules.
The two most common encoding are ISO-8859-1 and UTF-8
An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*)
You will see two or three characters for non-ASCII ones, usually à and ©.
Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.
will show as
Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.
An ISO-8859-1 text decoded as UTF-8
The UTF-8 decoder will find an invalid multi-byte character, it can report an error, or put question marks or squares instead:
Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.
will show as
Et Dieu vit la lumi?re, qu'elle ?tait bonne; et Dieu s?para la lumi?re d'avec les t?n?bres.
Warning: If you see squares, it can also be a font problem! This happen if you're using a font which does not define some characters you are using.