Unicode Strings in Python

Both Python 2 and Python 3 have good Unicode support. The difference is that Python 3 no longer silently translates between a decoded codepoint sequence (aka a Unicode string) and a encoded byte string.

It is possible to write code that works fine in both Python 2 and Python 3 at the same time, although you have to be careful of a few pitfalls, in particular about the use of the str and unicode function

Encoding of Stdout
The heart of any unicode problem is that the input must be decoded and the output must be decoded.

Consider the following Python program:

print("Caractères Unicode")
 * 1) !/usr/bin/env python
 * 2) encoding: utf-8

The second line makes it obvious that the input encoding (of the source file) is UTF-8. But it is never obvious what the output encoding (to stdout) is.

The print function takes care of the output encoding by encoding the given string with the encoding listed in.

usually the encoding of the terminal, but may be unset if the output of Python is piped to a file or other program. In those cases,  will raise an   if the text is not plain ASCII. See Encoding of Python stdout for details.

General Principle
The general principle to deal with output in Python is:


 * Use Unicode strings in the body of your program. Decode input, and encode output explicitly. In particular, do not encode at any other place in you program.


 * Be careful of functions that implicitly translate from Unicode to byte strings, including,   (in Python 2) and.

Caveats
There are still some pitfalls, even with proper decoding of input and encoding of output.

Functions to Avoid
Functions that translate from Unicode to byte strings should be avoided. In particular, the  should be avoided since it makes an implicit encoding to ascii and will fail for codepoints above U+007F in Python 2.

in Python 3 is the equivalent to  in Python 2 and can be used without problems. Be aware that it does not exist in Python 3. The following preamble allows you to use the  function in both Python 2 and Python 3.

try: unicode except NameError: unicode = str  # Python 3 compatibility

Another good alternative is to use old-style string formatting, along with Unicode literals:

from __future__ import unicode_literals myunicodestring = "%s" % myobject

I recommend against using code that maps  to , such as the UnicodeMixin class defined on http://docs.python.org/howto/pyporting.html#str-unicode. The reason is that (a) it still encodes your unicode strings, and you should only do so just before calling  or. is not the right place. In addition, the print function does determine the output of TTY pretty good. UnicodeMixin does not, and assumes it is always UTF-8, which may be a wrong assumption.

The  function also encodes from Unicode strings to byte strings in Python 2, but is generally safe, since the output of   should be valid Python, which is generally ASCII-encoded in Python 2.

The  uses the encoding of the underlying TTY (terminal) to determine the encoding. This usually gives the result you want, but may still fail if the output of a script is piped to a file or other program. However, this is a different issue, and in general the print function is fine for debugging output. For more consistent output, the  is recommended, as it gives a consistent output, regardless of the environment.

Replacing str with unicode
Here is a list of functions that call,  , and  :

In this table, green function always return Unicode, and red function always return byte strings.

You can specify Unicode literals with:

from __future__ import unicode_literals

I have found that this changes the behaviour of the  code, regardless if the import statement is present near the   line, or if it is near the   function. (these two may reside in a different module).

While the table does not contain any surprises, it reinforces the advise to avoid using the  function in Python 2, and points to the string formatting operator, , as a viable alternative.

Multilingual Plane
Python 2.2 to 3.2 differentiate between regular and wide builds. Wide builds are required to support codepoints above 0xFFFF.

See: http://www.python.org/dev/peps/pep-0261/

Since Python 3.3, wide Python characters are supported by default.