For details, please see: UTF-8 and Unicode FAQ for Unix/Linux. It is imperative that you understand that Unicode defines Codepoints, and that one codepoint can be represented as one or more bytes in UTF-8 (or two or more bytes in UTF-16). Also, you should be aware that in rare cases, one character may be represented by more than one codepoint: the characters and the diacritical marks separately (although this is rare if the characters are normalized according to normalization factor C, so you may ignore this for a basic level 1 implementation).
In short, you should use UTF-8 if you can. It is based on Unicode, so allows you to represent virtually all characters you can think of (perhaps with the exception of made-up characters like Klingon, and even that is semi-standardized in Unicode). Best of all, UTF-8 is compatible with ASCII.
Different Types of UTF-8
Note that one UTF-8 string is not the other one. There can still be other encodings differences:
- Line Ending
- The way a line ending is encoded. Typically with a Line Feed (UNIX), Carriage Return (old versions of Mac OS) or both (Windows). Other line break characters (form feed, line tabulation, next line, line separator, and paragraph separator) are extremely uncommon.
- Byte Order
- The way that the 2 bytes of UTF-16 is stored: big endian (UTF-16BE, the recommended way) or little endian (UTF-16LE). Not relevant for UTF-8, since there is only one byte per 8 bits!
- Byte-order Mark
- A character (0xFEFF) at the start of a file that signified the byte order (Big Endian is encoded as 0xFE 0xFF, Little Endian as 0xFF 0xFE). Even though there is no byte order in UTF-8, a BOM is sometime used at the start of a file to indicate it has UTF-8 encoding. See also Byte-order mark on Wikipedia.
- Normalization Form
- Unicode can store ü as either u-umlaut (one codepoint: LATIN SMALL LETTER U WITH DIAERESIS (00FC)) or u + umlaut (two codepoints, character and combining diacritical mark: LATIN SMALL LETTER U (0075) + COMBINING DIAERESIS (0308)). The first is called composed or precomposed; the second is called decomposed. The translation to composed (Canonical Composition) is called Normalization Form C (NFC); the translation to decomposed (Canonical Decomposition) is called Normalization Form D (NFD). In addition, NFC and NFD may be prepended by a Compatibility Decomposition, forming the normalizations NFKC and NFKD respectively. Compatibility Decompositions translates ligatures (e.g. ĳ or ﬁ) and formatting (e.g. ƒ or ²) to their regular equivalents (e.g. ij, fi, f and 2). See also Unicode Standard Annex #15: Unicode Normalization Forms with related FAQ. Normalization Form C is the more-or-less recommended form. Be aware that not all character-diacritical combinations exist (e.g. there is no codepoint for Latin small letter y with breve), so even NFC normalized text can contain separate diacritical codepoints (e.g. Latin small letter y + Combining breve)
Recommended Encoding by Platform
While there is no "recommended encoding", you will see that most file systems use a specific encoding, and most applications often use that same encoding. Thankfully, all decent file systems use UTF-8.
|Encoding||Line Ending||Byte Order Mark||Normalization|
|POSIX (Linux)||UTF-8||Line Feed||No Byte Order Mark||Precomposed (Form C)|
|Mac OS X||UTF-8||Line Feed||No Byte Order Mark||Precomposed (Form C)|
|Windows||UTF-8||Carriage Return and Line Feed (CRLF)||Byte Order Mark||Precomposed (Form C)|
|File system (names of files)|
|ext3, ReiserFS, ZFS (Linux)||Unicode||Line Feed||No Byte Order Mark||Precomposed (Form C)|
|HFS (Mac OS X)||Unicode||Line Feed||No Byte Order Mark||Decomposed (Form D)|
|APFS (Mac OS X)||Unicode||Line Feed||No Byte Order Mark||Normalization insensitive|
|FAT (Windows)||Unicode||Carriage Return and Line Feed (CRLF)||Byte Order Mark||Precomposed (Form C)|
|UTF-16 or UTF-8||N/A||No Byte Order Mark||N/A|
|HTTP||Latin-1 (default) or UTF-8||Line Feed||No Byte Order Mark||Mixed|
POSIX Recommended Encoding
On Linux and Mac (OS X), the recommended encoding is:
- Unicode character set
- UTF-8 encoding. This produces ASCII-compatible files, and has no big-endian/little endian problem.
- No BOM (byte order mark) at the start of the file. A BOM would break special file starts such as the "#!" shebang, and makes concatenation of two text more complex then needed.
- POSIX prefers composed characters instead of combining sequences (Normalization Form C). This allows simple level 1 implementations such as cat and grep command line tools to still process the file. The HFS filesystem on Mac OS X used decomposed sequences (Normalization Form D). So file names use NFD, even though the contents is typically NFC normalized. The newer APFS supports both normalization, so it is recommended to stick with NFC, for compatibility with other file systems.
- Line Feed for line endings
While this is no official recommendation, it simply makes most sense to do it this way.
The advantage of Normalization Form D (NFD) is that it is possible to do a (fast) byte-wise sort. This is why it is used in the HFS filesystem. The advantage of Normalization Form C (NFC) is that it easier to parse by character-processing tools, and so most POSIX tools expect NFC. So you should be aware that most POSIX tools can not process files with accented characters in their file name (on a HFS file system), although they can process it's contents.
Windows Default Encoding
On Windows, the current most-often used encoding is:
- Unicode character set
- UTF-8 encoding
- BOM (byte order mark) at the start of the file (despite that it is useless in 8-bit UTF-8!)
- The file system in Windows uses composed sequences (Normalization Form C)
- Carriage Return and Line Feed (CRLF) for line endings
While this is an official recommendation, it does not make sense.
Beside the well-known CRLF versus LF difference, Windows recommends to use a byte mark order on all UTF encoding, even though it is completely useless in UTF-8. The reason is the (apparently) very poor encoding detection in Windows system, where any non-ASCII text files is assumed to have a Windows-1252 encoding (this despite that it is relative straightforward to detect UTF-8 encodings for non-ASCII files.)
Java uses UTF-16 as internal encoding. However, I'm told that this internal structure should be irrelevant to if you use the proper API.
While RDF files do not specify the encoding or line ending, if you use Unicode like UTF-8 and UTF-16, you are supposed to use Unicode Normalization form C. Pure XML does not have this requirement (it can also use Normalization form D).
The most well-known conversion tool is libiconv. Iconv can convert between most encodings, but does not (yet) understand the difference between the different normalizations.
Most POSIX tools can not process files with accented characters in their file name on a HFS file system (since they expect NFC, while Mac OS X uses NFD). convmv is a tool that does exactly this conversion.
If you really like to dig deeper into the different normalization factors, and don't mind a steep learning curve, have a look at the International Components for Unicode (ICU) library (C and Java).