Encoding Normalization

What to do if you are stuck with a file with non-printable ASCII characters. Well, you normalise the content, by either removing or converting unwanted characters.

What to do exactly depends on the input source (what's wrong with the file, if that is known) and the desired output.

See also: Character Encoding

File contains control characters
Example: a file with NULL characters:

TLV Name            Code Len Value --- - Product Name        0x21  64 ABC2100  Part Number         0x22  20 ABC2100-DE3FG 

Resolution:

tr -cd '\11\12\15\40-\176' < file-with-control-characters.txt > file-without-control-characters.txt

strings file-with-control-characters.txt

The tr example simply removes all control characters. strings only returns sequences of (four or more) consecutive ASCII characters.

File contains a Byte Order Mark (BOM)
A Byte Order Mark is a control sequence (U+FEFF) which is not displayed by most editors, although some will display it as.

In UTF-8 files, U+FEFF is encoded as EF BB BF. For example, a hex editor may show:

EF BB BF 23 21 2F 62 69 6E 2F 62 61 73 68 0A 65 63 68 6F 20 22 68 65 6C 6C 6F 20 77 6F 72 6C 64 22 0A

for a file with this contents:

echo "hello world"
 * 1) !/bin/bash

When executing this, it may give the following error:

./bash.sh
 * 1) !/bin/bash: No such file or directory

A byte order mark is not recommended in UTF-8 files, in particular on Unix and macOS systems. It may only be useful on some Windows systems (because due to poor encoding detection, Windows may assume Windows-1252 encoding on non-ASCII UTF-8 files without a BOM).

Resolution:

tr -cd '\11\12\15\40-\176' < file-with-control-characters.txt > file-without-control-characters.txt

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' < file-with-bom.txt > file-without-bom.txt  strings

dos2unix file-with-bom.txt

The tr example simply removes all control characters. The awk example only replaces the BOM at the beginning of a file. dos2unix converts UTF-8 and UTF-16 files with BOM to UTF-8 without BOM.

File has unknown (non-UTF-8) encoding
The following encodings will never yield certain bytes, and are therefor easy to guess: ASCII, UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE and UTF-32-BE. If the encoding is neither of these, it might be ISO-8859-1 (latin-1, similar to Windows-1252), Windows-1251 (Cyrilic), Shift JIS (Japanese), GBK (Chinese) or any other encoding.

Resolution:

Use  to determine the encoding, than   to convert to UTF-8.

enca -L russian encoding-cyrillic-*.txt encoding-cyrillic-cp154.txt: MS-Windows code page 1251 encoding-cyrillic-koi8.txt: KOI8-R Cyrillic encoding-cyrillic-mac.txt: Macintosh Cyrillic encoding-cyrillic-win.txt: MS-Windows code page 1251

iconv -f cp154 -t utf-8 encoding-cyrillic-cp154.txt > encoding-cyrillic-utf8.txt iconv -f KOI8-R -t utf-8 encoding-cyrillic-koi8.txt > encoding-cyrillic-utf8.txt iconv -f maccyrillic -t utf-8 encoding-cyrillic-mac.txt > encoding-cyrillic-utf8.txt iconv -f cp1251 -t utf-8 encoding-cyrillic-win.txt > encoding-cyrillic-utf8.txt

Note that in this case, the CP154 encoded text was recognised as CP1251, because the encoding was the same for the characters used in this file (CP154 is based on CP1251).

Unwanted line endings
Example:

cat textfile-windows.txt Line 1 ^M Line 2 ^M

To convert line endings to line feed only (UNIX): flip -u *.txt or textconv -u *.txt

Flip can also display line endings: flip -t *.txt

Other utilities include,  , using  , etc. But these require you to know in advance what the current line ending is.

Unwanted normalization
Even UTF-8 files can have different encodings (normalizations): with precomposed or decomposed accents (diacritical marks).

As mentioned on the Character Encoding page, the defaults somewhat differ per platform, but NFC-normalized is most commonly used.

Resolution:

uconv -x any-NFC < utf8file.txt > normalized-utf8.txt

is part of the UCI library.

On macOS, the iconv library has the "UTF-8-MAC" encoding, which is UTF-8 NFD normalized.

iconv -f UTF-8-MAC -t UTF-8 utf8-nfd.txt > utf8-nfc.txt

It is also possible to create your own script. For example, the following snippet removes formatting and diacritical marks:

import codecs, unicodedata class NoAccentWrapper(codecs.StreamReader): def readline(self): line = self.stream.readline # first decompose accented characters (NFD normalization), # and remove the Combining Diacritical Marks (those in the Mn category). # Also use Compatibility normalization (NFKD instead of NFD) to convert e.g. to '1/4' return ''.join(ch for ch in normalize('NFKD', line) if category(ch) not in ('Mn',))

Be aware that the output is not necessarily ASCII. For example,  will be converted into , with some, but not all characters converted to their ASCII-equivalent.

Converting file names
On macOS file names on HFS are NFD normalized, and on APFS not normalied. On most other platforms, file names are NFC normalized.

You can use the iconv hook with rsync when synchronizing HFS to a non-HFS-based platform:

rsync -a --iconv=utf-8-mac,utf-8 localdir/ mynas:remotedir/

To explicitly convert the encoding of (one or more) filenames, use :

convmv -r -f utf8 -t utf8 --nfc --notest directory/