Encoding of Python stdout

Default Behaviour
Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is a tty. So if I just output to the terminal, LC_CTYPE (or LC_ALL) define the encoding. However, when the output is piped to a file or to a different process, the encoding is not defined, and defaults to 7-bit ASCII.

Example
encoding.py: import sys print(sys.stdout.encoding)
 * 1) !/usr/bin/env python2

If the output is a terminal, the output is self-explanatory:

% LC_ALL=C % ./encoding.py US-ASCII

% LC_ALL=en_GB.UTF-8 % ./encoding.py     UTF-8

However, as soon as the output is piped to a logfile, you get a different behaviour in Python 2:

% ./encoding.py > encoding.out ; cat encoding.out None

This poses a problem when writing Unicode to log files:

writelog.py: import sys sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n")
 * 1) !/usr/bin/env python
 * 2) encoding: utf-8

% ./writelog.py Convert currency ($1.00 = €0.75)

% ./writelog.py 2> error.log ; cat error.log Traceback (most recent call last): File "./encoding.py", line 5, in     sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n") UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 26: ordinal not in range(128)

Overriding the Encoding of stdout or stderr
Basically, Python makes an educated guess about the encoding of stdout and stderr. Unfortunately, it is not possible to simple set the encoding of stdout if Pythons is wrong, like this:

sys.stdout.encoding = 'utf-8' This results in a TypeError: readonly attribute

My personal opinion is that Python fails it's own guidelines, and I'm not the first to complain.

The good news is that Python 3 no longer exhibits this behaviour.

The general solution is to explicitly feed binary data to stdout and stderr. This can be done by doing the encoding manually:

import sys sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n".encode('utf-8'))
 * 1) !/usr/bin/env python
 * 2) encoding: utf-8

Fixing the output on a given encoding may seem ugly if the output is always send to a terminal (this would even send UTF-8 to a non-UTF-8 terminal), but may be desirable if the output is often piped to other programs or to a file. It makes sure that the output is always consistent, regardless of the environment. This is a tremendous benefit when debugging.

Instead of calling  in each , it is also possible to redirect stdout and/or stderr through a StreamWriter at the start of the program:

I recommend not to call sys.stdout.write, print, or sys.stderr.write directly, but use a wrapper function and call that:

def write(line): print(line.encode('utf-8'))

write("hello world")

StreamWriter Wrapper around Stdout
Instead of calling  in each , or using a wrapper function, it is also possible to redirect stdout and/or stderr through a StreamWriter at the start of the program:

In Python 2:

if sys.stdout.encoding != 'UTF-8': sys.stdout = codecs.getwriter('utf-8')(sys.stdout, 'strict') if sys.stderr.encoding != 'UTF-8': sys.stderr = codecs.getwriter('utf-8')(sys.stderr, 'strict')

In Python 3:

if sys.stdout.encoding != 'UTF-8': sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict') if sys.stderr.encoding != 'UTF-8': sys.stderr = codecs.getwriter('utf-8')(sys.stderr.buffer, 'strict')

StreamWriter Wrapper with Dynamic Encoding
The above examples use a fixed encoding, which is highly recommended if the output is written to file, piped to another program or otherwise logged or processed.

If the output is never processed nor logged, but only for interactive debugging, it is also possible to let the output encoding depend on the environment. For example,,  ,  ,   or -if none of these are known- US-ASCII. This is roughly what the   function does, only slightly more robust, because it uses xml chars if a character can't be displayed instead of raising a UnicodeEncodingError.

The following code snippet sets the encoding according to these variables. It also makes sure to replace non-ASCII characters with their XML-encoded counterpart if the desired output is ASCII.

import sys,codecs,locale print(sys.stdout.encoding) print(sys.stderr.encoding) print(locale.getpreferredencoding) if sys.stdout.encoding.upper != 'UTF-8': encoding = sys.stdout.encoding or locale.getpreferredencoding try: encoder = codecs.getwriter(encoding) except LookupError: sys.stdout.write("Warning: unknown encoding %s specified in locale.\n" % encoding) encoder = codecs.getwriter('UTF-8') if encoding.upper != 'UTF-8': sys.stdout.write("Warning: stdout in %s formaat. Diacritical signs are represented in XML-coded format." % encoding) try: sys.stdout = encoder(sys.stdout.buffer, 'xmlcharrefreplace') except AttributeError: sys.stdout = encoder(sys.stdout, 'xmlcharrefreplace') if sys.stderr.encoding.upper != 'UTF-8': encoding = sys.stderr.encoding or locale.getpreferredencoding try: encoder = codecs.getwriter(encoding) except LookupError: sys.stderr.write("Warning: unknown encoding %s specified in locale.\n" % encoding) encoder = codecs.getwriter('UTF-8') if encoding.upper != 'UTF-8': sys.stderr.write("Warning: stderr in %s formaat. Diacritical signs are represented in XML-coded format." % encoding) try: sys.stderr = encoder(sys.stderr.buffer, 'xmlcharrefreplace') except AttributeError: sys.stderr = encoder(sys.stderr, 'xmlcharrefreplace')
 * 1) !/usr/bin/env python