Multilingual Terminal

From Exterior Memory
Jump to: navigation, search

How to create a Unicode aware Terminal

To create a terminal which can deal with all characters:

  • the terminal must display the correct characters
  • the terminal must grok Unicode input
  • the tools must grok Unicode

UTF-8

First of, we standardize on UTF-8. It's backward compatible with ANSI, and is a well-known Unicode encoding, so there is no reason to look any further.

Start encoding all your text-based files (text files, code, HTML, Latex, etc.) in UTF-8. No excuse not to. Mark them as such in the file itself. For example, start XML with <?xml version="1.0" encoding="UTF-8"?>, XHTML with <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" /> and Python with # -*- coding: utf-8 -*-

Note that you should not use a Byte-Order Mark (BOM) (the FEFF codepoint at the start of a Unicode file) to identify Unicode. While it works great to identify UTF-8, not all applications understand it. In particular, using a BOM with a shebang (#!/bin/sh) does not work. As a rule of thumb, don't use it on UNIX based systems.

While we're at it: if you understand the difference, try to use Normalization Form C (NFC, or precomposed), not Normalization Form D (NFD, or decomposed). If you don't understand, never mind. Also if you use HFS+ (the Mac OS X file system): that's NFD, while most other applications expect NFC, and you may run into problems if you use accented characters in your file names.

Display correct characters

The first step is simple: simply set your locale to a UTF-8 variant.

export LC_ALL=en_GB.UTF-8
export LANG=en_GB.UTF-8

Set this in your .zprofile, .profile or .login file (for respectively zsh, bash and tcsh).

From now on, ls and cat show the correct characters.

% cat UTF-8-demo.txt
[...]
  ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β)
[...]

Generate Locale

If you see the following error:

setlocale: LC_ALL: cannot change locale (en_GB.UTF-8)

You need to generate your locale first.

On Debian: Edit /etc/locale.gen and run locale-gen

On CentOS: localedef -v -c -i en_GB -f UTF-8 en_GB.UTF-8

Unicode support in terminal

The terminal actually consists of two applications:

  • The terminal client program
  • The shell on the remote computer

The client can be a GUI application, like Terminal or iTerm on Mac OS X, or term or mlterm as the X11 terminal. In case you use X11, I recommend the mlterm terminal instead of the default term terminal.

As the shell, I personally use zsh, and for Unicode support, you need at least version 4.3 or higher. This

Unicode support for applications

Finally, you should use applications with proper UTF-8 support. This may be the least simple part, due to the sheer number of applications used. For starters, make sure you use the GNU coreutils version of "ls" (version 5.3 and up), or recent version of the BSD utilities (Most file utilities have support for UTF-8 since early 2000s, but BSD ls was relative late with it).