Unicode Strings in Python
Unicode Strings in Python
Both Python 2 and Python 3 have good Unicode support. The difference is that Python 3 no longer silently translates between a decoded codepoint sequence (aka a Unicode string) and a encoded byte string.
It is possible to write code that works fine in both Python 2 and Python 3 at the same time, although you have to be careful of a few pitfalls, in particular about the use of the str and unicode function
Encoding of Stdout
The heart of any unicode problem is that the input must be decoded and the output must be decoded.
Consider the following Python program:
#!/usr/bin/env python # encoding: utf-8 print("Caractères Unicode")
The second line makes it obvious that the input encoding (of the source file) is UTF-8. But it is never obvious what the output encoding (to stdout) is.
The print() function takes care of the output encoding by encoding the given string with the encoding listed in
sys.stdout.encoding usually the encoding of the terminal, but may be unset if the output of Python is piped to a file or other program. In those cases,
UnicodeEncodeError if the text is not plain ASCII. See Encoding of Python stdout for details.
The general principle to deal with output in Python is:
- Use Unicode strings in the body of your program. Decode input, and encode output explicitly. In particular, do not encode at any other place in you program.
- Be careful of functions that implicitly translate from Unicode to byte strings, including
str()(in Python 2) and
There are still some pitfalls, even with proper decoding of input and encoding of output.
Functions to Avoid
Functions that translate from Unicode to byte strings should be avoided. In particular, the
str() should be avoided since it makes an implicit encoding to ascii and will fail for codepoints above U+007F in Python 2.
str() in Python 3 is the equivalent to
unicode() in Python 2 and can be used without problems. Be aware that it does not exist in Python 3. The following preamble allows you to use the
unicode() function in both Python 2 and Python 3.
try: unicode except NameError: unicode = str # Python 3 compatibility
Another good alternative is to use old-style string formatting, along with Unicode literals:
from __future__ import unicode_literals myunicodestring = "%s" % myobject
I recommend against using code that maps
__unicode__, such as the UnicodeMixin class defined on http://docs.python.org/howto/pyporting.html#str-unicode. The reason is that (a) it still encodes your unicode strings, and you should only do so just before calling
str() is not the right place. In addition, the print() function does determine the output of TTY pretty good. UnicodeMixin does not, and assumes it is always UTF-8, which may be a wrong assumption.
repr() function also encodes from Unicode strings to byte strings in Python 2, but is generally safe, since the output of
repr() should be valid Python, which is generally ASCII-encoded in Python 2.
print() uses the encoding of the underlying TTY (terminal) to determine the encoding. This usually gives the result you want, but may still fail if the output of a script is piped to a file or other program. However, this is a Encoding of Python stdout different issue, and in general the print function is fine for debugging output. For more consistent output, the
fileobject.write(myoutput.encode('utf-8')) is recommended, as it gives a consistent output, regardless of the environment.
Replacing str() with unicode()
Here is a list of functions that call
|Code||Python 2||Python 2 with Unicode literals||Python 3|
||N/A (raises NameError)|
You can specify Unicode literals with:
from __future__ import unicode_literals
I have found that this changes the behaviour of the
'%s' code, regardless if the import statement is present near the
'%s' % s line, or if it is near the
s.__str__() function. (these two may reside in a different module).
While the table does not contain any surprises, it reinforces the advise to avoid using the
TODO: Fails with .... in terminal Fails with .... in TextMate call
Python 2.2 to 3.2 differentiate between regular and wide builds. Wide builds are required to support codepoints above 0xFFFF.
Since Python 3.3, wide Python characters are supported by default.