Python Unicode
Unicode scripts
In Python 3, all strings are Unicode. In Python 2.x, you can use Unicode in comments and strings inside a .py script if you make Python aware of it. The first line of the script file should be a special comment:
# -*- coding: utf-8 -*-
Unicode strings
In Python 2.x, strings are byte strings. A byte string stores each character as 1 byte = 256 possible characters = ASCII. Unfortunately, ASCII doesn't have room for special characters (e.g., diacritics, Chinese, Hebrew). You can instead create a Unicode string with the u- prefix:
>>> string1 = 'unicode' >>> string2 = u'ünîcødé' >>> print string1.__class__ >>> print string2.__class__ <type 'str'> <type 'unicode'>
Unicode strings can be encoded, converted to ASCII by representing special characters with a code. For example: é → \xc3\xa9. Such a byte string can be decoded into Unicode later on.
>>> string = string.encode('utf-8') # Unicode => ASCII>>> string = string.decode('utf-8') # ASCII => Unicode
Brute force encoding
If the conversion is not possible a UnicodeEncodeError or UnicodeDecodeError will be raised. To brute-force encode() or decode() so that no error is raised, an additional 'ignore' parameter can be given. Sometimes, not crashing is more important than special characters (e.g., in a web crawler).
The following function uses a brute-force approach to convert a string to Unicode:
def decode_utf8(string):
if isinstance(string, str):
for encoding in (('utf-8',), ('windows-1252',), ('utf-8', 'ignore')):
try:
return string.decode(*encoding)
except:
pass
return string # Don't know how to handle it...
return unicode(string, 'utf-8')
Reading UTF-8 files
When you read a Unicode file as a byte string, special characters (such as diacritics) become garbled or an error is raised. Reading an ASCII file as a Unicode string is harmless. There is no magic function to detect the encoding of a file, so you need to know how it was stored in order to read it correctly.
To read a Unicode file as a Unicode string:
>>> from codecs import open >>> string = open(path, encoding='utf-8').read()
If the file is in Latin-1 (ASCII + a few special characters), use encoding='latin-1' to read it.
Writing UTF-8 files
To write a Unicode string as a Unicode file:
>>> from codecs import open >>> open(path, 'w', encoding='utf-8').write(u'ünîcødé')
That said, some applications such as Mac OS X TextEdit may not recognize the UTF-8 content:

In this case you can encode the string manually and include a byte order marker at the start of the file:
>>> from codecs import open, BOM_UTF8
>>> s = u'ünîcødé'
>>> s = s.encode('utf-8')
>>> open(path, 'w').write(BOM_UTF8 + s)
Remember to strip the byte order marker when you open the file:
>>> from codecs import open, BOM_UTF8
>>> s = open(path).read()
>>> s = s.lstrip(BOM_UTF8)
>>> s = s.decode('utf-8')
Exporting XML
XML can only contain ASCII characters. So to store Unicode in XML you need to use encode() together with the right XML-header. Furthermore, some control characters like < and > must be represented as entities (the XML would of course break otherwise):
def encode_xml(string, encoding='utf-8'):
string = string.encode(encoding)
string = string.replace( '&', '&')
string = string.replace( '<', '<')
string = string.replace( '>', '>')
string = string.replace('\\', '"')
return string>>> persons = [u'Max Planck', u'Erwin Schrödinger']
>>>
>>> xml = ['<?xml version="1.0" encoding="UTF-8"?>']
>>> xml.append('<persons>')
>>> for s in persons:
>>> xml.append('\t<person>%s</person>' % encode_xml(s))
>>> xml.append('</persons>')
>>> xml = '\n'.join(xml)
Writing the XML file
We can simply write the encoded string as an ASCII file:
>>> open('test.xml', 'w').write(xml)
Reading the XML file
We can read it as an ASCII file – the minidom parser will decode it correctly:
>>> from xml.dom import minidom
>>> xml = open("test.xml").read()
>>> xml = minidom.parseString(xml)
>>>
>>> n = xml.childNodes[0] # <persons>...</persons>
>>> n = n.getElementsByTagName('person')[1] # <person>...</person>
>>> v = n.childNodes[0].nodeValue # Erwin Schrödinger
>>> print v
>>> print v.__class__
Erwin Schrödinger
<type 'unicode'>
http://www.evanjones.ca/python-utf8.html
http://docs.python.org/howto/unicode.html
![]()
![]()
