Unicode Strings in Python

From Exterior Memory
Revision as of 01:19, 29 March 2012 by MacFreek (Talk | contribs) (Created page with "Unicode Strings in Python Both Python 2 and Python 3 have good Unicode support. The difference is that Python 3 no longer silently translates between a decoded codepoint sequ...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Unicode Strings in Python

Both Python 2 and Python 3 have good Unicode support. The difference is that Python 3 no longer silently translates between a decoded codepoint sequence (aka a Unicode string) and a encoded byte string.

It is possible to write code that works fine in both Python 2 and Python 3 at the same time, although you have to be careful of a few pitfalls, in particular about the use of the str and unicode function

Encoding of Stdout

The heart of any unicode problem is that the input must be decoded and the output must be decoded.

Consider the following Python program:

#!/usr/bin/env python
# encoding: utf-8
print("Caractères Unicode")

The second line makes it obvious that the input encoding (of the source file) is UTF-8. But it is never obvious what the output encoding (to stdout) is.

The print() function takes care of the output encoding by encoding the given string with the encoding listed in sys.stdout.encoding.

sys.stdout.encoding usually the encoding of the terminal, but may be unset if the output of Python is piped to a file or other program. In those cases, print will raise an UnicodeEncodeError if the text is not plain ASCII. See Encoding of Python stdout for details.

General Principle

The general principle to deal with output in Python is:

  • Use Unicode strings in the body of your program. Decode input, and encode output explicitly. In particular, do not encode at any other place in you program.
  • Be careful of functions that implicitly translate from Unicode to byte strings, including print(), str() (in Python 2) and repr().

Caveats

There are still some pitfalls, even with proper decoding of input and encoding of output.

Functions to Avoid

Functions that translate from Unicode to byte strings should be avoided. In particular, the str() should be avoided since it makes an implicit encoding to ascii and will fail for codepoints above U+007F in Python 2.

str() in Python 3 is the equivalent to unicode() in Python 2 and can be used without problems. Be aware that it does not exist in Python 3. The following preamble allows you to use the unicode() function in both Python 2 and Python 3.

try:
    unicode
except NameError:
    unicode = str   # Python 3 compatibility

The repr() function also encodes from Unicode strings to byte strings in Python 2, but is generally safe, since the output of repr() should be valid Python, which is generally ASCII-encoded in Python 2.

The print() uses the encoding of the underlying TTY (terminal) to determine the encoding. This usually gives the result you want, but may still fail if the output of a script is piped to a file or other program.

I recommend against using code that maps __str__ to __unicode__, such as the UnicodeMixin class defined on http://docs.python.org/howto/pyporting.html#str-unicode. The reason is that (a) it still encodes your unicode strings, and you should only do so just before calling print() or sys.stdout.write(). str() is not the right place. In addition, the print() function does determine the output of TTY pretty good. UnicodeMixin does not, and assumes it is always UTF-8, which may be a wrong assumption.

Replacing str() with unicode()

Here is a list of functions that call __str__, __repr__, and __unicode__:

Code Python 2 Python 2 with Unicode literals Python 3
print(s) __str__ or __repr__ __str__ or __repr__ __str__ or __repr__
'%s' % s __str__ or __repr__ __unicode__ or __str__ or __repr__ __str__ or __repr__
str(s) __str__ or __repr__ __str__ or __repr__ __str__ or __repr__
unicode(s) __unicode__ or __str__ or __repr__ __unicode__ or __str__ or __repr__ N/A (raises NameError)
repr(s) __repr__ __repr__ __repr__

You can specify Unicode literals with:

from __future__ import unicode_literals

I have found that this changes the behaviour of the '%s' code, regardless if the import statement is present near the '%s' % s line, or if it is near the s.__str__() function. (these two may reside in a different module).

While the table does not contain any surprises, it reinforces the advise to avoid using the str() function.

TODO: Fails with .... in terminal Fails with .... in TextMate call

Multilingual Plane

Python 2.2 to 3.2 differentiate between regular and wide builds. Wide builds are required to support codepoints above 0xFFFF.

See: http://www.python.org/dev/peps/pep-0261/

Since Python 3.3, wide Python characters are supported by default.