Unicode character encodings

Share
Copied to clipboard.
Series: Files
Trey Hunner smiling in a t-shirt against a yellow wall
Trey Hunner
4 min. read 3 min. video Python 3.8—3.12
Python Morsels
Watch as video
02:58

All text that comes from outside of your Python process starts as binary data.

All input starts as raw bytes

When you open a file in Python, the default mode is r or rt, for read text mode:

>>> with open("my_file.txt") as f:
...     contents = f.read()
...
>>> f.mode
'r'

Meaning when we read our file, we'll get back strings that represent text:

>>> contents
'This is a file ✨\n'

But that's not what Python actually reads from disk.

If we open a file with the mode rb and read from our file we'll see what Python sees; that is bytes:

>>> with open("my_file.txt", mode="rb") as f:
...     contents = f.read()
...
>>> contents
b'This is a file \xe2\x9c\xa8\n'
>>> type(contents)
<class 'bytes'>

Bytes are what Python decodes to make strings.

Encoding strings into bytes

If you have a string in Python and you'd like to convert it into bytes, you can call its encode method:

>>> text = "Hello there! \u2728"
>>> text.encode()
b'Hello there! \xe2\x9c\xa8'

The encode method uses the character encoding utf-8 by default:

>>> text.encode("utf-8")
b'Hello there! \xe2\x9c\xa8'

But you can specify a different character encoding if you'd like:

>>> text.encode("utf-16-le")
b"H\x00e\x00l\x00l\x00o\x00 \x00t\x00h\x00e\x00r\x00e\x00!\x00 \x00('"

Decoding bytes into strings

If you have a bytes object and you'd like to convert it into a string, you need to decode it by calling its decode method:

>>> data = b"Hello there! \xe2\x9c\xa8"
>>> data.decode()
'Hello there! ✨'

Like the string encode method, the bytes decode method uses the character encoding utf-8 by default:

>>> data.decode("utf-8")
'Hello there! ✨'

But if you have bytes that represent data in a different character encoding, you'll need to specify that character encoding instead:

>>> data = b"H\x00e\x00l\x00l\x00o\x00 \x00t\x00h\x00e\x00r\x00e\x00!\x00 \x00('"
>>> data.decode("utf-16le")
'Hello there! ✨'

Specifying a character encoding when opening files

When you open a file in Python, whether for writing or for reading, it's considered a best practice to specify the character encoding that you're working with:

>>> with open("message.txt", mode="wt", encoding="utf-8") as f:
...     f.write("In Jan 2020 I said \u201cI'm glad I upgraded to Python 3\u201d.")
...
53
>>> with open("message.txt", mode="rt", encoding="utf-8") as f:
...     contents = f.read()
...
>>> contents
'In Jan 2020 I said \u201cI'm glad I upgraded to Python 3\u201d.'

This is because on different operating systems, Python will use a different character encoding by default when it's working with text files.

On my machine, the default character encoding is utf-8. But on Windows, the default character encoding is usually cp1252.

Note: Since Python 3.6, all files are read and written by Python using utf-8 by default, even on Windows. Encodings can still be a problem for any file that wasn't generated by Python though.

Be careful with your character encodings

So if we read this UTF-8 file on a Windows machine without specifying an encoding, we would get a UnicodeDecodeError:

>>> with open("message.txt", mode="rt") as f:
...     contents = f.read()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.10/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 55: character maps to <undefined>
>>>

This traceback for this UnicodeDecodeError is trying to tell us that there's a mismatch between the character encoding of the bytes that we're reading and the character encoding that Python is trying to use to read them.

But you can't rely on UnicodeDecodeErrors always being raised when there's a character encoding mismatch. Sometimes two different encodings may use the same bytes to represent different text.

Here we've saved a file with using the UTF-8 character encoding:

>>> text = "Yay unicode! \N{SPARKLES}"
>>> print(text)
Yay unicode! ✨
>>> with open("sparkles.txt", mode="wt", encoding="utf-8") as f:
...     f.write(text)
...
14

If read this file using the cp1252 character encoding, we'll see different text than what we started with:

>>> with open("sparkles.txt", encoding="cp1252") as f:
...     contents = f.read()
...
>>> contents
'Yay unicode! ✨'
>>>

We used cp1252 to decode bytes that were encoded using utf-8 and ended up with mojibake.

This is actually a really common problem between utf-8 (default encoding on Linux/Mac) and cp1252 (default encoding on Windows) in particular because these two character encodings are very similar, but far from the same.

Summary

When you read a file, Python will read bytes from disk and then decode those bytes to make them into strings.

When you write to a file, Python will take your strings and encode those strings into bytes to write them to disk.

It's considered a best practice to specify the character encoding that you're working with whenever you're reading or writing text from outside of your Python process, especially if you're working with non-ASCII text.

A Python Tip Every Week

Need to fill-in gaps in your Python skills? I send weekly emails designed to do just that.

Python Morsels
Watch as video
02:58