Encoding of Python stdout

From Exterior Memory
Jump to: navigation, search

Default Behaviour

Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is a tty. So if I just output to the terminal, LC_CTYPE (or LC_ALL) define the encoding. However, when the output is piped to a file or to a different process, the encoding is not defined, and defaults to 7-bit ASCII.

Example

encoding.py:

#!/usr/bin/env python2
import sys
print(sys.stdout.encoding)

If the output is a terminal, the output is self-explanatory:

% LC_ALL=C
% ./encoding.py
US-ASCII
% LC_ALL=en_GB.UTF-8
% ./encoding.py     
UTF-8

However, as soon as the output is piped to a logfile, you get a different behaviour in Python 2:

% ./encoding.py > encoding.out ; cat encoding.out
None

This poses a problem when writing Unicode to log files:

writelog.py:

#!/usr/bin/env python
# encoding: utf-8
import sys
sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n")
% ./writelog.py
Convert currency ($1.00 = €0.75)
% ./writelog.py 2> error.log ; cat error.log
Traceback (most recent call last):
  File "./encoding.py", line 5, in <module>
    sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 26: ordinal not in range(128)

Overriding the Encoding of stdout or stderr

Basically, Python makes an educated guess about the encoding of stdout and stderr. Unfortunately, it is not possible to simple set the encoding of stdout if Pythons is wrong, like this:

sys.stdout.encoding = 'utf-8'

This results in a

TypeError: readonly attribute

My personal opinion is that Python fails it's own guidelines, and I'm not the first to complain.

The good news is that Python 3 no longer exhibits this behaviour.

The general solution is to explicitly feed binary data to stdout and stderr. This can be done by doing the encoding manually:

#!/usr/bin/env python
# encoding: utf-8
import sys
sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n".encode('utf-8'))

Fixing the output on a given encoding may seem ugly if the output is always send to a terminal (this would even send UTF-8 to a non-UTF-8 terminal), but may be desirable if the output is often piped to other programs or to a file. It makes sure that the output is always consistent, regardless of the environment. This is a tremendous benefit when debugging.

Instead of calling encode() in each sys.stderr.write(), it is also possible to redirect stdout and/or stderr through a StreamWriter at the start of the program:

I recommend not to call sys.stdout.write, print, or sys.stderr.write directly, but use a wrapper function and call that:

def write(line):
    print(line.encode('utf-8'))
write("hello world")

StreamWriter Wrapper around Stdout

Instead of calling encode() in each sys.stderr.write(), or using a wrapper function, it is also possible to redirect stdout and/or stderr through a StreamWriter at the start of the program:

In Python 2:

if sys.stdout.encoding != 'UTF-8':
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout, 'strict')
if sys.stderr.encoding != 'UTF-8':
    sys.stderr = codecs.getwriter('utf-8')(sys.stderr, 'strict')

In Python 3:

if sys.stdout.encoding != 'UTF-8':
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'UTF-8':
    sys.stderr = codecs.getwriter('utf-8')(sys.stderr.buffer, 'strict')

StreamWriter Wrapper with Dynamic Encoding

The above examples use a fixed encoding, which is highly recommended if the output is written to file, piped to another program or otherwise logged or processed.

If the output is never processed nor logged, but only for interactive debugging, it is also possible to let the output encoding depend on the environment. For example, sys.stdout.encoding, sys.stderr.encoding, locale.getpreferredencoding(), sys.getdefaultencoding() or -if none of these are known- US-ASCII. This is roughly what the print() function does, only slightly more robust, because it uses xml chars if a character can't be displayed instead of raising a UnicodeEncodingError.

The following code snippet sets the encoding according to these variables. It also makes sure to replace non-ASCII characters with their XML-encoded counterpart if the desired output is ASCII.

#!/usr/bin/env python
import sys,codecs,locale
print(sys.stdout.encoding)
print(sys.stderr.encoding)
print(locale.getpreferredencoding())

if sys.stdout.encoding.upper() != 'UTF-8':
    encoding = sys.stdout.encoding or locale.getpreferredencoding()
    try:
        encoder = codecs.getwriter(encoding)
    except LookupError:
        sys.stdout.write("Warning: unknown encoding %s specified in locale().\n" % encoding)
        encoder = codecs.getwriter('UTF-8')
    if encoding.upper() != 'UTF-8':
        sys.stdout.write("Warning: stdout in %s formaat. Diacritical signs are represented in XML-coded format." % encoding)
    try:
        sys.stdout = encoder(sys.stdout.buffer, 'xmlcharrefreplace')
    except AttributeError:
        sys.stdout = encoder(sys.stdout, 'xmlcharrefreplace')
if sys.stderr.encoding.upper() != 'UTF-8':
    encoding = sys.stderr.encoding or locale.getpreferredencoding()
    try:
        encoder = codecs.getwriter(encoding)
    except LookupError:
        sys.stderr.write("Warning: unknown encoding %s specified in locale().\n" % encoding)
        encoder = codecs.getwriter('UTF-8')
    if encoding.upper() != 'UTF-8':
        sys.stderr.write("Warning: stderr in %s formaat. Diacritical signs are represented in XML-coded format." % encoding)
    try:
        sys.stderr = encoder(sys.stderr.buffer, 'xmlcharrefreplace')
    except AttributeError:
        sys.stderr = encoder(sys.stderr, 'xmlcharrefreplace')

Further Reading