Difference between revisions of "Encoding of Python stdout"

From Exterior Memory
Jump to: navigation, search
(Example)
(Overriding the Encoding of stdout or stderr)
Line 51: Line 51:
 
  TypeError: readonly attribute
 
  TypeError: readonly attribute
  
My personal opinion is that Python fails it's own guidelines, and [http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/ I'm not the first to complain].
+
My personal opinion is that Python fails it's [[Opinion:Smart Software is Stupid Software|own guidelines]], and [http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/ I'm not the first to complain].
  
 
The general solution is to explicitly feed binary data to stdout and stderr. This can be done by doing the encoding manually:
 
The general solution is to explicitly feed binary data to stdout and stderr. This can be done by doing the encoding manually:

Revision as of 18:57, 4 March 2012

Default Behaviour

Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is a tty. So if I just output to the terminal, LC_CTYPE (or LC_ALL) define the encoding. However, when the output is piped to a file or to a different process, the encoding is not defined, and defaults to 7-bit ASCII.

Example

encoding.py:

#!/usr/bin/env python
import sys
print(sys.stdout.encoding)

If the output is a terminal, there should be few surprises:

% LC_ALL=C
% ./encoding.py
US-ASCII
% LC_ALL=en_GB.UTF-8
% ./encoding.py     
UTF-8

However, as soon as the output is piped to a logfile, you get a different behaviour:

% ./encoding.py > encoding.out ; cat encoding.out
None

This poses a problem when writing UTF-8 to log files:

writelog.py:

#!/usr/bin/env python
# encoding: utf-8
import sys
sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n")
% ./writelog.py
Convert currency ($1.00 = €0.75)
% ./writelog.py 2> error.log ; cat error.log
Traceback (most recent call last):
  File "./encoding.py", line 5, in <module>
    sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 26: ordinal not in range(128)

Overriding the Encoding of stdout or stderr

Basically, Python makes an educated guess about the encoding of stderr. Unfortunately, it is not possible to simple set the encoding of stdout if Pythons is wrong, like this:

sys.stdout.encoding = 'utf-8'

This results in a

TypeError: readonly attribute

My personal opinion is that Python fails it's own guidelines, and I'm not the first to complain.

The general solution is to explicitly feed binary data to stdout and stderr. This can be done by doing the encoding manually:

#!/usr/bin/env python
# encoding: utf-8
import sys
sys.stderr.write(u"Convert currency ($1.00 = €0.75)\n".encode('utf-8'))

Fixing the output on a given encoding may seem ugly if the output is always send to a terminal (this would even send UTF-8 to a non-UTF-8 terminal), but may be desirable if the output is often piped to other programs or to a file. It makes sure that the output is always consistent, regardless of the environment. This is a tremendous benefit when debugging.

Instead of calling encode() in each sys.stderr.write(), it is also possible to redirect stdout and/or stderr through a StreamWriter at the start of the program:

if sys.stdout.encoding != 'UTF-8':
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout, 'strict')
if sys.stderr.encoding != 'UTF-8':
    sys.stderr = codecs.getwriter('utf-8')(sys.stderr, 'strict')

An alternative is to let the output encoding depend on the environment. For example, sys.stdout.encoding, sys.stderr.encoding, locale.getpreferredencoding() or -if none of these are known- UTF-8 or US-ASCII. The following code snippet sets the encoding according to these variables. It also makes sure to replace non-ASCII characters with their XML-encoded counterpart if the desired output is ASCII.

#!/usr/bin/env python
import sys,codecs,locale
print(sys.stdout.encoding)
print(sys.stderr.encoding)
print(locale.getpreferredencoding())

if sys.stdout.encoding.upper() != 'UTF-8':
    encoding = sys.stdout.encoding or locale.getpreferredencoding()
    try:
        encoder = codecs.getwriter(encoding)
    except LookupError:
        sys.stdout.write("Warning: unknown encoding %s specified in locale().\n" % encoding)
        encoder = codecs.getwriter('UTF-8')
    if encoding.upper() != 'UTF-8':
         sys.stdout.write("Warning: stdout in %s formaat. Diacritical signs are represented in XML-coded format." % encoding)
    sys.stdout = encoder(sys.stdout, 'xmlcharrefreplace')
if sys.stderr.encoding.upper() != 'UTF-8':
    encoding = sys.stderr.encoding or locale.getpreferredencoding()
    try:
        encoder = codecs.getwriter(encoding)
    except LookupError:
        sys.stderr.write("Warning: unknown encoding %s specified in locale().\n" % encoding)
        encoder = codecs.getwriter('UTF-8')
    if encoding.upper() != 'UTF-8':
         sys.stderr.write("Warning: stderr in %s formaat. Diacritical signs are represented in XML-coded format." % encoding)
    sys.stderr = encoder(sys.stderr, 'xmlcharrefreplace')