Reading files line-by-line

Transcript

Let's talk about reading files line-by-line in Python.

Looping over file objects to read line-by-line

Here we're calling the read method on a file object (for a file called diary980.md):

>>> filename = "diary980.md"
>>> with open(filename) as diary_file:
...     contents = diary_file.read()
...
>>> contents
'Python Log -- Day 980\n\nToday I learned about metaclasses.\nMetaclasses are a class\'s class.\nMeaning every class is an instance of a metaclass.\nThe default metaclass is "type".\n\nClasses control features (like string representations) of all their instances.\nMetaclasses can control similar features for their classes.\n\nI doubt I\'ll ever need to make a metaclass, at least not for production code.\n'

When you call the read method on a file object, Python will read the entire file into memory all at once. But that could be a bad idea if you're working with a really big file.

There's another common way to process files in Python: you can loop over a file object to read it line-by-line:

>>> filename = "diary980.md"
>>> with open(filename) as diary_file:
...     n = 1
...     for line in diary_file:
...         print(n, line)
...         n += 1
...

Here, we're printing out a number (counting upward) in each line in our file:

1 Python Log -- Day 980

2

3 Today I learned about metaclasses.

4 Metaclasses are a class's class.

5 Meaning every class is an instance of a metaclass.

6 The default metaclass is "type".

7

8 Classes control features (like string representations) of all their instances.

9 Metaclasses can control similar features for their classes.

10

11 I doubt I'll ever need to make a metaclass, at least not for production code.

Notice that as we print, Python isn't just printing out the line, but an extra blank line in between each line in our file. By default, Python's print function prints a newline character (\n) after whatever else that it prints (see the print function's end argument). But each of our lines also end in a newline character, because newline characters are what separate lines in a file:

>>> line
"I doubt I'll ever need to make a metaclass, at least not for production code.\n"

Getting rid of the newline character when reading line-by-line

So we either need to suppress the newline character that the print function prints out or we need to remove the newline characters from each line in our file as we print them out:

>>> filename = "diary980.md"
>>> with open(filename) as diary_file:
...     n = 1
...     for line in diary_file:
...         print(n, line.rstrip("\n"))
...         n += 1
...
1 Python Log -- Day 980
2
3 Today I learned about metaclasses.
4 Metaclasses are a class's class.
5 Meaning every class is an instance of a metaclass.
6 The default metaclass is "type".
7
8 Classes control features (like string representations) of all their instances.
9 Metaclasses can control similar features for their classes.
10
11 I doubt I'll ever need to make a metaclass, at least not for production code.

We're using the string lstrip method here to "strip" newline characters from the left-hand side (the beginning) of each of our line strings just before print each line.

File objects are lazy iterables

File objects in Python are lazy iterables, which means we can treat them pretty much the same way as any other iterable.

So instead of manually counting upward, we could pass our file object to the built-in enumerate function. The enumerate function could then do the counting for us as we loop:

>>> filename = "diary980.md"
>>> with open(filename) as diary_file:
...     for n, line in enumerate(diary_file, start=1):
...         print(n, line.rstrip('\n'))

We've remove two lines of code but we get the same output as before:

1 Python Log -- Day 980
2
3 Today I learned about metaclasses.
4 Metaclasses are a class's class.
5 Meaning every class is an instance of a metaclass.
6 The default metaclass is "type".
7
8 Classes control features (like string representations) of all their instances.
9 Metaclasses can control similar features for their classes.
10
11 I doubt I'll ever need to make a metaclass, at least not for production code.

Summary

Files are lazy iterables, and as we loop over a file object, we'll get lines from that file.

When Python reads a file line-by-line, it doesn't store the whole file in memory all at once. Instead, it stores a small buffer of upcoming lines in that file, so it's more memory-efficient.

That means looping over files line-by-line is especially important if you're working with really big files.