Unicode Strings in Python
Unicode Strings in Python
Both Python 2 and Python 3 have good Unicode support. The difference is that Python 3 no longer silently translates between a decoded codepoint sequence (aka a Unicode string) and a encoded byte string.
It is possible to write code that works fine in both Python 2 and Python 3 at the same time, although you have to be careful of a few pitfalls, in particular about the use of the str and unicode function
Contents
Encoding of Stdout
The heart of any unicode problem is that the input must be decoded and the output must be decoded.
Consider the following Python program:
#!/usr/bin/env python # encoding: utf-8 print("Caractères Unicode")
The second line makes it obvious that the input encoding (of the source file) is UTF-8. But it is never obvious what the output encoding (to stdout) is.
The print() function takes care of the output encoding by encoding the given string with the encoding listed in sys.stdout.encoding
.
sys.stdout.encoding
usually the encoding of the terminal, but may be unset if the output of Python is piped to a file or other program. In those cases, print
will raise an UnicodeEncodeError
if the text is not plain ASCII. See Encoding of Python stdout for details.
General Principle
The general principle to deal with output in Python is:
- Use Unicode strings in the body of your program. Decode input, and encode output explicitly. In particular, do not encode at any other place in you program.
- Be careful of functions that implicitly translate from Unicode to byte strings, including
print()
,str()
(in Python 2) andrepr()
.
Caveats
There are still some pitfalls, even with proper decoding of input and encoding of output.
Functions to Avoid
Functions that translate from Unicode to byte strings should be avoided. In particular, the str()
should be avoided since it makes an implicit encoding to ascii and will fail for codepoints above U+007F in Python 2.
str()
in Python 3 is the equivalent to unicode()
in Python 2 and can be used without problems. Be aware that it does not exist in Python 3. The following preamble allows you to use the unicode()
function in both Python 2 and Python 3.
try: unicode except NameError: unicode = str # Python 3 compatibility
The repr()
function also encodes from Unicode strings to byte strings in Python 2, but is generally safe, since the output of repr()
should be valid Python, which is generally ASCII-encoded in Python 2.
The print()
uses the encoding of the underlying TTY (terminal) to determine the encoding. This usually gives the result you want, but may still fail if the output of a script is piped to a file or other program.
I recommend against using code that maps __str__
to __unicode__
, such as the UnicodeMixin class defined on http://docs.python.org/howto/pyporting.html#str-unicode. The reason is that (a) it still encodes your unicode strings, and you should only do so just before calling print()
or sys.stdout.write()
. str()
is not the right place. In addition, the print() function does determine the output of TTY pretty good. UnicodeMixin does not, and assumes it is always UTF-8, which may be a wrong assumption.
Replacing str() with unicode()
Here is a list of functions that call __str__
, __repr__
, and __unicode__
:
Code | Python 2 | Python 2 with Unicode literals | Python 3 |
---|---|---|---|
print(s) |
__str__ or __repr__ |
__str__ or __repr__ |
__str__ or __repr__
|
'%s' % s |
__str__ or __repr__ |
__unicode__ or __str__ or __repr__ |
__str__ or __repr__
|
str(s) |
__str__ or __repr__ |
__str__ or __repr__ |
__str__ or __repr__
|
unicode(s) |
__unicode__ or __str__ or __repr__ |
__unicode__ or __str__ or __repr__ |
N/A (raises NameError) |
repr(s) |
__repr__ |
__repr__ |
__repr__
|
You can specify Unicode literals with:
from __future__ import unicode_literals
I have found that this changes the behaviour of the '%s'
code, regardless if the import statement is present near the '%s' % s
line, or if it is near the s.__str__()
function. (these two may reside in a different module).
While the table does not contain any surprises, it reinforces the advise to avoid using the str()
function.
TODO: Fails with .... in terminal Fails with .... in TextMate call
Multilingual Plane
Python 2.2 to 3.2 differentiate between regular and wide builds. Wide builds are required to support codepoints above 0xFFFF.
See: http://www.python.org/dev/peps/pep-0261/
Since Python 3.3, wide Python characters are supported by default.