Encoding Normalization

From Exterior Memory
Jump to: navigation, search

What to do if you are stuck with a file with non-printable ASCII characters. Well, you normalise the content, by either removing or converting unwanted characters.

What to do exactly depends on the input source (what's wrong with the file, if that is known) and the desired output.

See also: Character Encoding

File contains control characters

Example: a file with NULL characters:

TLV Name             Code Len Value
-------------------- ---- --- -----
Product Name         0x21  64 ABC2100<NUL><NUL><NUL>
Part Number          0x22  20 ABC2100-DE3FG<NUL><NUL><NUL>

Resolution:

tr -cd '\11\12\15\40-\176' <  file-with-control-characters.txt > file-without-control-characters.txt
strings file-with-control-characters.txt

The tr example simply removes all control characters. strings only returns sequences of (four or more) consecutive ASCII characters.

File contains a Byte Order Mark (BOM)

A Byte Order Mark is a control sequence (U+FEFF) which is not displayed by most editors, although some will display it as .

In UTF-8 files, U+FEFF is encoded as EF BB BF. For example, a hex editor may show:

EF BB BF 23 21 2F 62 69 6E 2F 62 61 73 68 0A 65 63 68 6F 20 22 68 65 6C 6C 6F 20 77 6F 72 6C 64 22 0A

for a file with this contents:

#!/bin/bash
echo "hello world"

When executing this, it may give the following error:

./bash.sh
#!/bin/bash: No such file or directory

A byte order mark is not recommended in UTF-8 files, in particular on Unix and macOS systems. It may only be useful on some Windows systems (because due to poor encoding detection, Windows may assume Windows-1252 encoding on non-ASCII UTF-8 files without a BOM).

Resolution:

tr -cd '\11\12\15\40-\176' <  file-with-control-characters.txt > file-without-control-characters.txt
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' < file-with-bom.txt > file-without-bom.txt   strings
dos2unix file-with-bom.txt

The tr example simply removes all control characters. The awk example only replaces the BOM at the beginning of a file. dos2unix converts UTF-8 and UTF-16 files with BOM to UTF-8 without BOM.

File has unknown (non-UTF-8) encoding

The following encodings will never yield certain bytes, and are therefor easy to guess: ASCII, UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE and UTF-32-BE. If the encoding is neither of these, it might be ISO-8859-1 (latin-1, similar to Windows-1252), Windows-1251 (Cyrilic), Shift JIS (Japanese), GBK (Chinese) or any other encoding.

Resolution:

Use enca to determine the encoding, than iconv to convert to UTF-8.

enca -L russian encoding-cyrillic-*.txt
encoding-cyrillic-cp154.txt: MS-Windows code page 1251
encoding-cyrillic-koi8.txt: KOI8-R Cyrillic
encoding-cyrillic-mac.txt: Macintosh Cyrillic
encoding-cyrillic-win.txt: MS-Windows code page 1251
iconv -f cp154 -t utf-8 encoding-cyrillic-cp154.txt > encoding-cyrillic-utf8.txt
iconv -f KOI8-R -t utf-8 encoding-cyrillic-koi8.txt > encoding-cyrillic-utf8.txt
iconv -f maccyrillic -t utf-8 encoding-cyrillic-mac.txt > encoding-cyrillic-utf8.txt
iconv -f cp1251 -t utf-8 encoding-cyrillic-win.txt > encoding-cyrillic-utf8.txt

Note that in this case, the CP154 encoded text was recognised as CP1251, because the encoding was the same for the characters used in this file (CP154 is based on CP1251).

Unwanted line endings

Main article: Line Endings

Example:

cat textfile-windows.txt
Line 1^M
Line 2^M

To convert line endings to line feed only (UNIX):

flip -u *.txt
   or
textconv -u *.txt

Flip can also display line endings:

flip -t *.txt

Other utilities include dos2unix, tofrodos, using tr, etc. But these require you to know in advance what the current line ending is.

Unwanted normalization

Even UTF-8 files can have different encodings (normalizations): with precomposed or decomposed accents (diacritical marks).

As mentioned on the Character Encoding page, the defaults somewhat differ per platform, but NFC-normalized is most commonly used.

Resolution:

uconv -x any-NFC < utf8file.txt > normalized-utf8.txt

uconv is part of the UCI library.

On macOS, the iconv library has the "UTF-8-MAC" encoding, which is UTF-8 NFD normalized.

iconv -f UTF-8-MAC -t UTF-8 utf8-nfd.txt > utf8-nfc.txt

It is also possible to create your own script. For example, the following snippet removes formatting and diacritical marks:


import codecs, unicodedata
 class NoAccentWrapper(codecs.StreamReader):
     def readline(self):
        line = self.stream.readline()
        # first decompose accented characters (NFD normalization), 
        # and remove the Combining Diacritical Marks (those in the Mn category).
        # Also use Compatibility normalization (NFKD instead of NFD) to convert e.g. to '1/4'
        return ''.join(ch for ch in normalize('NFKD', line) if category(ch) not in ('Mn',))

Be aware that the output is not necessarily ASCII. For example, åß©∂éƒgʰijǣ¼ will be converted into aß©∂eƒghijæ1⁄4, with some, but not all characters converted to their ASCII-equivalent.

Converting file names

On macOS file names on HFS are NFD normalized, and on APFS not normalied. On most other platforms, file names are NFC normalized.

You can use the iconv hook with rsync when synchronizing HFS to a non-HFS-based platform:

rsync -a --iconv=utf-8-mac,utf-8 localdir/ mynas:remotedir/

To explicitly convert the encoding of (one or more) filenames, use convmv:

convmv -r -f utf8 -t utf8 --nfc --notest directory/