Reading binary files in Python

Series: Files

Trey Hunner

4 min. read • Watch as video • Python 3.8—3.12 • May 16, 2022

Show captions

Autoplay

Auto-expand

What if you wanted to read from a file that isn't a text file?

How to read a binary file in Python

If we try to read a zip file using the built-in open function in Python using the default read mode, we'll get an error:

>>> with open("exercises.zip") as zip_file:
...     contents = zip_file.read()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.10/codecs.py", line 322, in de
code
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 11: invalid sta
rt byte

We get an error because zip files aren't text files, they're binary files.

To read from a binary file, we need to open it with the mode rb instead of the default mode of rt:

>>> with open("exercises.zip", mode="rb") as zip_file:
...     contents = zip_file.read()
...

When you read from a binary file, you won't get back strings. You'll get back a bytes object, also known as a byte string:

>>> with open("exercises.zip", mode="rb") as zip_file:
...     contents = zip_file.read()
...
>>> type(contents)
<class 'bytes'>
>>> contents[:20]
b'PK\x03\x04\n\x00\x00\x00\x00\x00Y\x8e\x84T\x00\x00\x00\x00\x00\x00'

Byte strings don't have characters in them: they have bytes in them.

The bytes in a file won't help us very much unless we understand what they mean.

Use a library to read your binary file

You probably won't read a binary file yourself very often.

When working with binary files you'll typically use a library (either a built-in Python library or a third-party library) that knows how to process the specific type of file you're working with . That library will do the work of decoding the bytes from your file into something that's easier to work with.

For example, Python's ZipFile module can help us read data that's within a zip file:

>>> from zipfile import ZipFile
>>>
>>> with ZipFile("exercises.zip") as zip_file:
...     test_file = zip_file.read("exercises/test.py").decode("utf-8")
...
>>> test_file[:30]
'#!/usr/bin/env python3\nfrom __'

It's best to avoid implementing your own byte-checking or byte manipulation logic if someone has already done that work for you.

Working at byte level in Python

Sometimes you'll work with a library or an API that requires you to work directly at the byte-level. In that case, you'll want to have at least a little bit of familiarity with binary files and byte strings.

For example, let's say we'd like to calculate the sha256 checksum of a given file.

Here we have a function called get_sha256_hash that does that:

import hashlib


def get_sha256_hash(filename):
    with open(filename, mode="rb") as f:
        return hashlib.sha256(f.read()).hexdigest()

This function reads all of the binary data within this file. We're reading bytes because the Python's hashlib module requires us to work with bytes. The hashlib module works at a low-level: it works with bytes instead of with strings.

So we're passing in all the bytes in our file to get a hash object and then calling the hexdigest method on that hash object to get a string of hexadecimal characters that represent the SHA-256 checksum of this file:

>>> get_sha256_hash("exercises.zip")
'9e98242a21760945ec815668fc79d8621fa15dd23659ea29be2c5949153fe96d'

This function works well, but reading very big files with this function might be a problem.

Reading binary files in chunks

Our get_sha256_hash function reads the whole file into memory all at once. With a really big file that might take up a lot of memory.

With a text file, the usual way to solve this problem would be to read the file line-by-line. But binary files don't necessarily have lines! Instead, we could try to read chunk by chunk.

First we'll read an eight kilobyte chunk from our file:

import hashlib


def get_sha256_hash(filename, buffer_size=2**10*8):
    file_hash = hashlib.sha256()
    with open(filename, mode="rb") as f:
        chunk = f.read(buffer_size)

We make a new hash object first and then reading one eight kilobyte chunk (by passing the number of bytes to our file object's read method).

Now we need the rest of our file's chunks. So we'll loop:

import hashlib


def get_sha256_hash(filename, buffer_size=2**10*8):
    file_hash = hashlib.sha256()
    with open(filename, mode="rb") as f:
        chunk = f.read(buffer_size)
        while chunk:
            file_hash.update(chunk)
            chunk = f.read(buffer_size)
    return file_hash.hexdigest()

We're repeatedly reading a chunk, updating our hash object, and then reading another chunk.

As long as we're not at the end of our file, we'll get back a truthy chunk when we read.

But when we read at the very end of our file we'll get back an empty byte string. Empty byte strings (like empty strings) are falsey, so at the end of our file we'll break out of our loop. Then we'll return the hexdigest just like we did before.

This modified get_sha256_hash function works just like before:

>>> get_sha256_hash("exercises.zip")
'9e98242a21760945ec815668fc79d8621fa15dd23659ea29be2c5949153fe96d'

But instead of reading our entire file into memory, we're now reading our file chunk-by-chunk.

Aside: using an assignment expression

It's common to see an assignment expression used (via Python's walrus operator) when reading files chunk-by-chunk:

import hashlib


def get_sha256_hash(filename, buffer_size=2**10*8):
    file_hash = hashlib.sha256()
    with open(filename, mode="rb") as f:
        while chunk := f.read(buffer_size):
            file_hash.update(chunk)
    return file_hash.hexdigest()

Repeatedly reading data within a while loop is a pretty good use case for an assignment expression. It may look a little bit weird, but it does save us a few lines of code.

The walrus operator was added in Python 3.8.

Avoid reading binary files if you can

When you read a binary file in Python, you'll get back bytes.

When you're reading a large binary file, you'll probably want to read it chunk-by-chunk.

But it's best to avoid reading binary files yourself if you can. If there's a third-party library that can help you process your binary file, you should probably use that library to do the byte-based processing for you.