Python File Loops

January 6, 2015

Recently I was looking to improve the performance of a Python 2.5 script that compares two files line by line. What I found interesting was the huge performance improvements I saw with just a few tweaks. I’ve simplified the problem to just reading the file and updating a counter, to keep things simple.

Each of these script was ran against a 75 million line text file and the results were averaged over 3 runs. This was a quick and unscientific test, so no guarantees of accuracy.

Caveat: I’m not an experienced Python or Perl developer. I can’t guarantee any of this code is best practice or pretty :)

##Perl

This is the baseline we’re trying to hit. This also outperformed our original bash implementation.

##readline()

This is the code I was given to start with. This seems to look like it would perform well, but I have very little Python background.

Performance isn’t good enough. Unfortunately, reading line by line from disk isn’t helping performance. Lets try to improve that.

##for line in file

The for loop uses a buffer, so we shouldn’t be so dependent on disk I/O.

A 50% performance improvement is good, but not good enough. Still a bit off of the Perl baseline. Lets try some other tweaks.

##for line in file + mutable int

Since Python treats all strings and ints as immutable, we’ll throw our counter into a mutable list, which lets us reuse the same memory space when incrementing.

Bingo. Right on par with our Perl baseline. Good enough for me.