Character Encoding

From Exterior Memory
Jump to: navigation, search

Character Encoding

There are many character encodings, including (7-bit) ASCII, ISO 8859-1 (Latin-1) and UTF-8 encoding of Unicode.

For details, please see: UTF-8 and Unicode FAQ for Unix/Linux. It is imperative that you understand that Unicode defines Codepoints, and that one codepoint can be represented as one or more bytes in UTF-8 (or two or more bytes in UTF-16). Also, you should be aware that in rare cases, one character may be represented by more than one codepoint: the characters and the diacritical marks separately (although this is rare if the characters are normalized according to normalization factor C, so you may ignore this for a basic level 1 implementation).

Recommended Encodings

In short, you should use UTF-8 if you can. It is based on Unicode, so allows you to represent virtually all characters you can think of (perhaps with the exception of made-up characters like Klingon, and even that is semi-standardized in Unicode). Best of all, UTF-8 is compatible with ASCII.

Different Types of UTF-8

Note that one UTF-8 string is not the other one. There can still be other encodings differences:

Line Ending
The way a line ending is encoded. Typically with a Line Feed (UNIX), Carriage Return (old versions of Mac OS) or both (Windows). Other line break characters (form feed, line tabulation, next line, line separator, and paragraph separator) are extremely uncommon.
Byte Order
The way that the 2 bytes of UTF-16 is stored: big endian (UTF-16BE, the recommended way) or little endian (UTF-16LE). Not relevant for UTF-8, since there is only one byte per 8 bits!
Byte-order Mark
A character (0xFEFF) at the start of a file that signified the byte order (Big Endian is encoded as 0xFE 0xFF, Little Endian as 0xFF 0xFE). Even though there is no byte order in UTF-8, a BOM is sometime used at the start of a file to indicate it has UTF-8 encoding. See also Byte-order mark on Wikipedia.
Normalization Form
Unicode can store ü as either u-umlaut (one codepoint: LATIN SMALL LETTER U WITH DIAERESIS (00FC)) or u + umlaut (two codepoints, character and combining diacritical mark: LATIN SMALL LETTER U (0075) + COMBINING DIAERESIS (0308)). The first is called composed or precomposed; the second is called decomposed. The translation to composed (Canonical Composition) is called Normalization Form C (NFC); the translation to decomposed (Canonical Decomposition) is called Normalization Form D (NFD). In addition, NFC and NFD may be prepended by a Compatibility Decomposition, forming the normalizations NFKC and NFKD respectively. Compatibility Decompositions translates ligatures (e.g. ij or fi) and formatting (e.g. ƒ or ²) to their regular equivalents (e.g. ij, fi, f and 2). See also Unicode Standard Annex #15: Unicode Normalization Forms with related FAQ. Normalization Form C is the more-or-less recommended form. Be aware that not all character-diacritical combinations exist (e.g. there is no codepoint for Latin small letter y with breve), so even NFC normalized text can contain separate diacritical codepoints (e.g. Latin small letter y + Combining breve)

Recommended Encoding by Platform

While there is no "recommended encoding", you will see that most file systems use a specific encoding, and most applications often use that same encoding. Thankfully, all decent file systems use UTF-8.

Encoding Line Ending Byte Order Mark Normalization
File contents
POSIX (Linux) UTF-8 Line Feed No Byte Order Mark Precomposed (Form C)
Mac OS X UTF-8 Line Feed No Byte Order Mark Precomposed (Form C)
Windows UTF-8 Carriage Return and Line Feed (CRLF) Byte Order Mark Precomposed (Form C)
File system (names of files)
ext3, ReiserFS, ZFS (Linux) Unicode Line Feed No Byte Order Mark Precomposed (Form C)
HFS (Mac OS X) Unicode Line Feed No Byte Order Mark Decomposed (Form D)
APFS (Mac OS X) Unicode Line Feed No Byte Order Mark Normalization insensitive
FAT (Windows) Unicode Carriage Return and Line Feed (CRLF) Byte Order Mark Precomposed (Form C)
Memory
UTF-16 or UTF-8 N/A No Byte Order Mark N/A
Protocols
HTTP Latin-1 (default) or UTF-8 Line Feed No Byte Order Mark Mixed

POSIX Recommended Encoding

On Linux and Mac (OS X), the recommended encoding is:

  • Unicode character set
  • UTF-8 encoding. This produces ASCII-compatible files, and has no big-endian/little endian problem.
  • No BOM (byte order mark) at the start of the file. A BOM would break special file starts such as the "#!" shebang, and makes concatenation of two text more complex then needed.
  • POSIX prefers composed characters instead of combining sequences (Normalization Form C). This allows simple level 1 implementations such as cat and grep command line tools to still process the file. The HFS filesystem on Mac OS X used decomposed sequences (Normalization Form D). So file names use NFD, even though the contents is typically NFC normalized. The newer APFS supports both normalization, so it is recommended to stick with NFC, for compatibility with other file systems.
  • Line Feed for line endings

While this is no official recommendation, it simply makes most sense to do it this way.

The advantage of Normalization Form D (NFD) is that it is possible to do a (fast) byte-wise sort. This is why it is used in the HFS filesystem. The advantage of Normalization Form C (NFC) is that it easier to parse by character-processing tools, and so most POSIX tools expect NFC. So you should be aware that most POSIX tools can not process files with accented characters in their file name (on a HFS file system), although they can process it's contents.

Windows Default Encoding

On Windows, the current most-often used encoding is:

  • Unicode character set
  • UTF-8 encoding
  • BOM (byte order mark) at the start of the file (despite that it is useless in 8-bit UTF-8!)
  • The file system in Windows uses composed sequences (Normalization Form C)
  • Carriage Return and Line Feed (CRLF) for line endings

While this is an official recommendation, it does not make sense.

Beside the well-known CRLF versus LF difference, Windows recommends to use a byte mark order on all UTF encoding, even though it is completely useless in UTF-8. The reason is the (apparently) very poor encoding detection in Windows system, where any non-ASCII text files is assumed to have a Windows-1252 encoding (this despite that it is relative straightforward to detect UTF-8 encodings for non-ASCII files.)

Java

Java uses UTF-16 as internal encoding. However, I'm told that this internal structure should be irrelevant to if you use the proper API.

RDF

While RDF files do not specify the encoding or line ending, if you use Unicode like UTF-8 and UTF-16, you are supposed to use Unicode Normalization form C. Pure XML does not have this requirement (it can also use Normalization form D).

Conversion tools

The most well-known conversion tool is libiconv. Iconv can convert between most encodings, but does not (yet) understand the difference between the different normalizations.

Most POSIX tools can not process files with accented characters in their file name on a HFS file system (since they expect NFC, while Mac OS X uses NFD). convmv is a tool that does exactly this conversion.

If you really like to dig deeper into the different normalization factors, and don't mind a steep learning curve, have a look at the International Components for Unicode (ICU) library (C and Java).