To ‘b’ or not to ‘b’ – opening files in Python 3

Recently we’ve started moving our code from Python 2 to Python 3. The process has largely been smooth as most code works very similarly in both versions. However there are a couple of major changes that would not allow Python 2 code to work in Python 3. This post is relating to one specific instance of that, in reading and writing text files with inputs and outputs.

 

Admittedly we’re not very proficient in coding and don’t always adopt the right practices. I learnt of that difference in reading and writing files not through prior research before migration, but by simply attempting to run the old Python 2 code in Python 3.

 

For reading and writing strings we’d use ‘rb’ and ‘wb’ in Python 2, but doing the same on Python 3 throws all sorts of errors. Using ‘r’ and ‘w’ instead solves these problems, but I wanted to dig deeper to understand what we had been doing wrong.

 

The Problem

This:

with open('test.txt', 'w') as f:
    text = f.write('this is a test string')

works.

This, however:

with open('test.txt', 'wb') as f:
    text = f.write('this is a test string')

throws this error:

TypeError: a bytes-like object is required, not 'str'

Both, however, work on Python 2.

Likewise, this:

with open('test.txt', 'r') as f:
    text = f.read()
print(text)
print(text == 'this is a test string')

returns:

this is a test string
True

This, however:

with open('test.txt', 'rb') as f:
    text = f.read()
print(text)
print(text == 'this is a test string')

returns:

b'this is a test string'
False

Again, either ‘r’ or ‘rb’ would work on Python 2.

 

From the output you can see that Python 3 processes the strings differently, expecting a bytes input and producing a bytes output when ‘b’ is appended to ‘r’ or ‘w’ (b for binary mode and without b, text mode). Python 2, however, is somehow able to process the string input and is able to set the output to a string regardless.

 

What’s Happening?

Referring to the official Python2 documentation on reading and writing files with inputs and outputs, the case for whether or not to add a ‘b’:

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb''wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

So apparently there was some behind the scenes magic going on, and for text files the binary mode would be sort of a catch all case that would allow code to work on any platform.

 

On the official Python 3 documentation on reading and writing files with inputs and outputs, though, this is said:

Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding. If encoding is not specified, the default is platform dependent (see open()). 'b' appended to the mode opens the file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.

So the text mode (without ‘b’) should be used in reading and writing text files.

 

Conclusion

I think I can understand what is going on in Python 3 and it sounds intuitive. Binary mode reads and writes bytes, text mode reads and writes text. But after learning about that now I’m a little confused as to what Python 2 was doing to handle it. Either way, lesson learnt that same code may not work across different Python versions and more care has to be taken in the migration before a errors with more serious consequences happen. I may not be right in the above analysis so feel free to chip in with a comment below! Would love to learn more about it. Cheers!

2 thoughts on “To ‘b’ or not to ‘b’ – opening files in Python 3

  1. Joshua Proffitt Reply

    Stumbled upon this error while doing some conversion of projects from 2->3 and found this article. Good overview of the problem space!

    Regarding this line from your conclusion. “I’m a little confused as to what Python 2 was doing to handle it.” I think this section of the python2/3 states it pretty plainly.

    “As the str and bytes types cannot be mixed, you must always explicitly convert between them.” Source: https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

    There was a explicit conversion between the two types occurring for you in python 2 that has since been removed for python 3. So they just did a bit of backpedaling on implicit conversions since those can get folks into trouble. 🙂

    Keep up the good work. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *