Sunday, July 10, 2022

Raw Python vs Python & SQLite vs GNU Linux command line utilities!

Note: Long lines may reward those reading on larger screens than phones.

 

There was an interesting problem posted to reddit:

  1. A large text file has one word per line
  2. Words have many duplications on other lines
  3. Sort the file in order of lines with most duplications first.

The size of the file is in comparison with the amount of memory available to sort it - no in-memory sorts of the file can be assumed to work.

Many commenters ignored the request of producing a sort of the input file and instead considered generating only the unique words of the input in order of their decreasing occurrence. That would not be a sort as duplicate entries would be removed.

The original poster posted to the r/Python reddit group but said he would prefer a Linux command line solution.

sort, uniq, awk

 I remembered that the original sort was written when memory was in kilobytes and data being larger than memory was much more common. Original sort could use disk to sort files that could not fit in memory quite seemlessly and Gnu sort has that capability too.

I was going to:

  1. sort the file
  2. Use uniq to count the occurrences
  3. sort the word-counts numerically by order of the counts.
  4. Expand the ordered words by their counts to create the answer using awk.

Here's the one-liner working on some test input:

$ printf 'w1\nw4\nw3\nw1\nw2\nw1\nw3\nw4\nw3\nw2\n' > /tmp/word.lst
$ cat /tmp/word.lst
w1
w4
w3
w1
w2
w1
w3
w4
w3
w2
$ sort /tmp/word.lst | uniq -c | sort -n -r
      3 w3
      3 w1
      2 w4
      2 w2
$ sort /tmp/word.lst | uniq -c | sort -n -r| awk '{for(i=0;i<$1;i++){print$2}}'
w3
w3
w3
w1
w1
w1
w4
w4
w2
w2
$

Test data

What makes this problem is the size of the data - The input text needs to be individual words per line and have words to differing frequencies of occurrencies.

I decided to download the text to Huck` Finn and double that up until I had a file larger than my own 8Gigs of memory.

$ # Download Huck_finn:
$ curl https://gutenberg.org/files/76/76-0.txt -o huck_finn.txt
$ # One word per line using fmt:
$ fmt -w1 huck_finn.txt >huck_word_per_line.txt
$ # Double up to increase the file size:
$ cp huck_word_per_line.txt words.lst
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc  words.lst && du -h words.lst
 253464  228666 1239542 words.lst
1.2M    words.lst
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc  words.lst && du -h words.lst
 506928  457332 2479084 words.lst
2.4M    words.lst

...

$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc  words.lst && du -h words.lst
 2076377088  1873231872 10154328064 words.lst
9.5G    words.lst


The resultant file may have "junk" lines of just whitespace, numbers, or punctuation but it should still serve its purpose.

 

sort, uniq, awk: timings

I split the command line above into two and timed each half:

$ # Find the frequency of each word
$ time sort words.lst |uniq -c > words_uniq_count.txt

real    20m5.247s
user    83m33.507s
sys     1m38.863s
$ # Now to order by most frequent first and expand the counts into that many lines of words:
$ time sort -n -r words_uniq_count.txt | awk '$2~/[a-zA-Z]/{for(i=0;i<$1;i++){print$2}}' > words_ordered.txt

real    7m56.900s
user    3m34.341s
sys     0m33.422s


That is 28minutes in total.

Pure Python

When thinking of my purely Python solution, I reread the original post which emphasised the amount of duplication in the large file. I tookj a gamble in assuming that I would always have enough memory to process the frequency of each individual word.

The Python script streams the file in counting the frequency of each line then uses Counter.most_common to order the output.

import fileinput
from collections import Counter

word_freq = Counter(line for ln in fileinput.input()
                    if (line := ln.strip()))  # No blank lines
for word, count in word_freq.most_common():
    print((word + '\n') * count, end='')


Pure Python: timing

(For the same 9.5G input file)

real    21m53.181s
user    16m26.880s
sys     0m22.188s


Python and sqlite3

User bOne123 had a neat solution that used sqlite3 from Python his code is reproduced here:

# -*- coding: utf-8 -*-
"""
@author: bOne123
"""

import sqlite3
import time
con = sqlite3.connect('/tmp/file.db') # or 'file.db' if not enough ram
cur = con.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS data (word text, count int)''')
cur.execute('''CREATE UNIQUE INDEX IF NOT EXISTS dataidx ON data(word)''')
now = time.time()
with open("words.lst", "r") as f:
    i = 0
    words = []
    for line in f:
        words.append((line,))
        i += 1
        if i % 50_000_000 == 0:
            print(time.time() - now, i)
            cur.executemany("INSERT INTO data VALUES (?, 1) ON CONFLICT(word) DO UPDATE SET count=count+1", words)
            words=[]
print(time.time() - now, 'index counts')
cur.execute('''CREATE INDEX IF NOT EXISTS countidx ON data(count)''')
with open("words_ordered_by_bOne123.txt", "w", newline='\n') as f:
    for row in cur.execute('SELECT word,count FROM data ORDER BY count DESC'):
        for i in range(row[1]):
            f.write(row[0])
print(time.time() - now, 'done')
con.close()

Python and sqlite3: timing

$ time python sort_words_by_bOne123.py
22.579142570495605 50000000
102.96021580696106 100000000
...
3125.1875145435333 2050000000
3189.4789474010468 index counts
3583.4524500370026 done

real    59m44.886s
user    50m36.011s
sys     2m11.008s

Those timings take over 50 minutes just to insert into the DB.

SQLite3

User mikeblas has another solution that this time uses just sqlite3. Unfortunately I could not get it to work, (as he did). I tried his code to read my 9.5G file into an sqlite3 DB and had to kill it after it ran for nearly two hours and seemed to be doing very little after growing its DB to 27 Gigs.

Timings so far

id Algorithm Minutes
1 sort, uniq, awk 28
2 Pure Python 22
3 Python and sqlite3 60
 

That initial sort for the first algorithm is needed to apply the uniq command to and would take a significant amount of time.

I ruminated on this for a couple of weeks, searched and eventually found GNU datamash

datamash

I've been using Unix and Linux for several decades, but have learnt to keep an eye out for new, major, command line utilities, you know, like moving from vi to vim. Well this wqas one of those times when I thought that counting word or line frequencies might be useful enough for people like GNU to write some utility, and I found one; datamash!

Some reading and tinkering lead me to the following shell one-liner with its timing:

$ # Test on small file
$ datamash --sort --group 1 count 1 < /tmp/word.lst | sort -n -r -k 2 | awk 'NF==2{for(i=0;i<$2;i++){print$1}}'
w3
w3
w3
w1
w1
w1
w4
w4
w2
w2
$ # That worked, so run on 9.5G file
$ time datamash --sort --group 1 count 1 < words.lst | sort -n -r -k 2 | awk 'NF==2{for(i=0;i<$2;i++){print$1}}' > words_ordered_by_datamash.txt

real    19m15.608s
user    39m11.182s
sys     1m39.364s


All timings

id Algorithm Minutes
1 sort, uniq, awk 28
2 Pure Python 22
3 Python and sqlite3 60
4 datamash 19

 The datamash solution was fastest, but is new to me and I don't yet have a true feel for its internals.

The pure Python and the sort-uniq-awk solution I am most familiar with.

I take away from this exercise the following:

  1. When reading solutions to problems then actual code trumps number of replies and upvotes on just a suggestion of tool to use, (AWS, DB's, ...)
  2. Try and give a sample of test data and expected output with the question; this helps tie down what is needed in a solution, and in this case would help resolve the problem of those who gave a sort of the input and those that stopped with just ordering unique words.
  3. The unix command line is still a powerful tool. Those who just learn cloud tools capable of handling terabytes of data but only use them on much smaller datasets ('cos they'll scale), put a smile on those cloud vendors faces.

 END.