In Python 3, all strings are Unicode. In Python 2.x, you can use Unicode in comments and strings inside a .py script if you make Python aware of it. The first line of the script file should be a special comment:
# -*- coding: utf-8 -*-
In Python 2.x, strings are byte strings. A byte string stores each character as 1 byte = 256 possible characters = ASCII. Unfortunately, ASCII doesn't have room for special characters (e.g., diacritics, Chinese, Hebrew). You can instead create a Unicode string with the u- prefix:
>>> string1 = 'unicode' >>> string2 = u'ünîcødé' >>> print string1.__class__ >>> print string2.__class__ <type 'str'> <type 'unicode'>
Unicode strings can be encoded, converted to ASCII by representing special characters with a code. For example: é → \xc3\xa9. Such a byte string can be decoded into Unicode later on.
>>> string = string.encode('utf-8') # Unicode => ASCII
>>> string = string.decode('utf-8') # ASCII => Unicode
Brute force encoding
If the conversion is not possible a UnicodeEncodeError or UnicodeDecodeError will be raised. To brute-force encode() or decode() so that no error is raised, an additional 'ignore' parameter can be given. Sometimes, not crashing is more important than special characters (e.g., in a web crawler).
The following function uses a brute-force approach to convert a string to Unicode:
def decode_utf8(string): if isinstance(string, str): for encoding in (('utf-8',), ('windows-1252',), ('utf-8', 'ignore')): try: return string.decode(*encoding) except: pass return string # Don't know how to handle it... return unicode(string, 'utf-8')
Reading UTF-8 files
When you read a Unicode file as a byte string, special characters (such as diacritics) become garbled or an error is raised. Reading an ASCII file as a Unicode string is harmless. There is no magic function to detect the encoding of a file, so you need to know how it was stored in order to read it correctly.
To read a Unicode file as a Unicode string:
>>> from codecs import open >>> string = open(path, encoding='utf-8').read()
If the file is in Latin-1 (ASCII + a few special characters), use encoding='latin-1' to read it.
Writing UTF-8 files
To write a Unicode string as a Unicode file:
>>> from codecs import open >>> open(path, 'w', encoding='utf-8').write(u'ünîcødé')
That said, some applications such as Mac OS X TextEdit may not recognize the UTF-8 content:
In this case you can encode the string manually and include a byte order marker at the start of the file:
>>> from codecs import open, BOM_UTF8 >>> s = u'ünîcødé' >>> s = s.encode('utf-8') >>> open(path, 'w').write(BOM_UTF8 + s)
Remember to strip the byte order marker when you open the file:
>>> from codecs import open, BOM_UTF8 >>> s = open(path).read() >>> s = s.lstrip(BOM_UTF8) >>> s = s.decode('utf-8')
XML can only contain ASCII characters. So to store Unicode in XML you need to use encode() together with the right XML-header. Furthermore, some control characters like < and > must be represented as entities (the XML would of course break otherwise):
def encode_xml(string, encoding='utf-8'): string = string.encode(encoding) string = string.replace( '&', '&') string = string.replace( '<', '<') string = string.replace( '>', '>') string = string.replace('\\', '"') return string
>>> persons = [u'Max Planck', u'Erwin Schrödinger'] >>> >>> xml = ['<?xml version="1.0" encoding="UTF-8"?>'] >>> xml.append('<persons>') >>> for s in persons: >>> xml.append('\t<person>%s</person>' % encode_xml(s)) >>> xml.append('</persons>') >>> xml = '\n'.join(xml)
Writing the XML file
We can simply write the encoded string as an ASCII file:
>>> open('test.xml', 'w').write(xml)
Reading the XML file
We can read it as an ASCII file – the minidom parser will decode it correctly:
>>> from xml.dom import minidom >>> xml = open("test.xml").read() >>> xml = minidom.parseString(xml) >>> >>> n = xml.childNodes # <persons>...</persons> >>> n = n.getElementsByTagName('person') # <person>...</person> >>> v = n.childNodes.nodeValue # Erwin Schrödinger >>> print v >>> print v.__class__ Erwin Schrödinger <type 'unicode'>