Note: Long lines may reward those reading on larger screens than phones.
There was an interesting problem posted to reddit:
- A large text file has one word per line
- Words have many duplications on other lines
- Sort the file in order of lines with most duplications first.
The size of the file is in comparison with the amount of memory available to sort it - no in-memory sorts of the file can be assumed to work.
Many commenters ignored the request of producing a sort of the input file and instead considered generating only the unique words of the input in order of their decreasing occurrence. That would not be a sort as duplicate entries would be removed.
The original poster posted to the r/Python reddit group but said he would prefer a Linux command line solution.
sort, uniq, awk
I remembered that the original sort was written when memory was in kilobytes and data being larger than memory was much more common. Original sort could use disk to sort files that could not fit in memory quite seemlessly and Gnu sort has that capability too.
I was going to:
- sort the file
- Use uniq to count the occurrences
- sort the word-counts numerically by order of the counts.
- Expand the ordered words by their counts to create the answer using awk.
Here's the one-liner working on some test input:
Test data
What makes this problem is the size of the data - The input text needs to be individual words per line and have words to differing frequencies of occurrencies.
I decided to download the text to Huck` Finn and double that up until I had a file larger than my own 8Gigs of memory.
The resultant file may have "junk" lines of just whitespace, numbers, or punctuation but it should still serve its purpose.
sort, uniq, awk: timings
I split the command line above into two and timed each half:
That is 28minutes in total.
Pure Python
When thinking of my purely Python solution, I reread the original post which emphasised the amount of duplication in the large file. I tookj a gamble in assuming that I would always have enough memory to process the frequency of each individual word.
The Python script streams the file in counting the frequency of each line then uses Counter.most_common to order the output.
Pure Python: timing
(For the same 9.5G input file)
Python and sqlite3
User bOne123 had a neat solution that used sqlite3 from Python his code is reproduced here:
Python and sqlite3: timing
Those timings take over 50 minutes just to insert into the DB.
SQLite3
User mikeblas has another solution that this time uses just sqlite3. Unfortunately I could not get it to work, (as he did). I tried his code to read my 9.5G file into an sqlite3 DB and had to kill it after it ran for nearly two hours and seemed to be doing very little after growing its DB to 27 Gigs.
Timings so far
id | Algorithm | Minutes |
1 | sort, uniq, awk | 28 |
2 | Pure Python | 22 |
3 | Python and sqlite3 | 60 |
That initial sort for the first algorithm is needed to apply the uniq command to and would take a significant amount of time.
I ruminated on this for a couple of weeks, searched and eventually found GNU datamash
datamash
I've been using Unix and Linux for several decades, but have learnt to keep an eye out for new, major, command line utilities, you know, like moving from vi to vim. Well this wqas one of those times when I thought that counting word or line frequencies might be useful enough for people like GNU to write some utility, and I found one; datamash!
Some reading and tinkering lead me to the following shell one-liner with its timing:
All timings
id | Algorithm | Minutes |
1 | sort, uniq, awk | 28 |
2 | Pure Python | 22 |
3 | Python and sqlite3 | 60 |
4 | datamash | 19 |
The datamash solution was fastest, but is new to me and I don't yet have a true feel for its internals.
The pure Python and the sort-uniq-awk solution I am most familiar with.
I take away from this exercise the following:
- When reading solutions to problems then actual code trumps number of replies and upvotes on just a suggestion of tool to use, (AWS, DB's, ...)
- Try and give a sample of test data and expected output with the question; this helps tie down what is needed in a solution, and in this case would help resolve the problem of those who gave a sort of the input and those that stopped with just ordering unique words.
- The unix command line is still a powerful tool. Those who just learn cloud tools capable of handling terabytes of data but only use them on much smaller datasets ('cos they'll scale), put a smile on those cloud vendors faces.
END.