Go deh!: Answering a Reddit Question with timings.

Best viewed on larger than a phone screen.

Someone had a problem on Reddit r/python:

Hello guys, I want to find a string in a list and this list has 350K elements all they are strings . I want to find out a good algorithm that can find the string very quick . I know linear search but want to figure out other ways if possible.

The problem had already had over seventy answers when I came across it, but I wanted to check:

Had people teased more information out of the original poster?
Was there a pocket of timed goodness in there?

I read on.

The answers given were polite. Yay, here's to the python community 🍻👌🏾👍🏾!

The answers did give alternatives to linear search, but so many posts had no timings, and the OP was still pretty vague in the data he gave, considering he was wanting "faster".

I was late to this party, but wanted to write some Python, as I can find coding relaxing if I write in the spirit of helping, (and remember I too might learn something). It's not the first time I've tried to help out and decided on three points:

Get more realistic timings, or a timing framework were they can go in and alter things to get timings that more represent their data.
The answers mentioned linear search of the text, sets, and tries. I would skip tries as I have already found that its hard to get over their speed of interpretation limitations.
I'm gonna need several files to do the timings - lets try creating them from one initial python file. (Why not, keeps me interested).

Coding.

I chose to code in the Spyder IDE for this, rather than Vscode. YMMVm but I still find Spyder to support a more "dynamic" development style, great for scripting and I am not writing hundreds of lines of code for this.

I thought of the steps I would need, seemed simple enough so just started coding the first, string_in_long_list_1.py.

Create a text file of 350K sorted "strings"

What is meant by strings? I don't know, don't sweat it - I'll make each line several words with one word having an increasing count appended:

# -*- coding: utf-8 -*-
"""
Created on Sat Sep 24 10:00:51 2022

@author: paddy3118
"""

# %% Create txt file of 350_000 strings

from pathlib import Path

lines = '\n'.join(f"Fee fi word{i:06} fum" for i in range(350_000))

txtfile = Path('string_in_long_list_2.txt')
txtfile.write_text(lines, 'utf8')

Notice the # %% cell comment. It allows me to quickly run and rerun the code in the cell and check the values of variables in Spyder until I get it right. (Vscode will do this too, but I'm more comfortable with Spyder's implementation).

Sets; On your marks...

If the OP was going to test for strings in the text file multiple times for the one text file, which I think I read in the original posters replies to others, then I would like to time set inclusion, but reduce the time to create or read the set.

I decided on creating a module file that assigns the set of lines in the text file to a variable data. Another prog would load this module and check for string inclusion and compile the module so subsequent loadings would be faster.

# %% Create a module of the set of those lines assigned to name 'data'

import pprint

set_of_strings = set(txtfile.read_text('utf8').split('\n'))
moduletxt = f"""\
# -*- coding: utf-8 -*-
"Set of each line from the text file."
data = {pprint.pformat(set_of_strings, indent=2)}
"""

modulefile = Path('string_in_long_list_3.py')
modulefile.write_text(moduletxt, 'utf8')

I ran the cell and checked the generated script until it looked right.

Linear search.

Although the OP said the text file was sorted, others had already noted that trying binary search would be slowed as you would have to read every line first into a list... you might as well do a comparison as you read each line, so I needed to generate a script to do just that:

# %% Create a script to just search for a line containing its arg whilst reading.

scripttxt = """\
# -*- coding: utf-8 -*-
"Search for first line in text file matching argument."

import sys

to_find = sys.argv[1] if len(sys.argv) > 1 else ''

with open('string_in_long_list_2.txt', encoding='utf8') as txtfile:
    for line in txtfile:
        if to_find == line.rstrip('\\n'):
            print("True")
            sys.exit(0)
            break
    else:
        print("False")
        sys.exit(1)
"""

linesearchfile = Path('string_in_long_list_4.py')
linesearchfile.write_text(scripttxt, 'utf8')

When writing this, I first had no tripple quotes around what became scripttxt =, and could rerun the contained code as I debugged, then encapsulate in scripttxt = """\ and add the the last couple of lines to generate the file.

Match against set

This generates the python script that when given a string as its argument, will check against data loaded as a set from its module.

# %% Create a script to look for the arg in the modules data set

scripttxt = """\
# -*- coding: utf-8 -*-
"test if argument is in data loaded from module as a set."

from string_in_long_list_3 import data
import sys

to_find = sys.argv[1] if len(sys.argv) > 1 else ''

if to_find in data:
    print("True")
    sys.exit(0)

print("False")
sys.exit(1)
"""

modulesearchfile = Path('string_in_long_list_5.py')
modulesearchfile.write_text(scripttxt, 'utf8')

The end of the Python script!

Timing

I was always going to time runs of the scripts out in the shell of the OS, for which I, and many others are familiar with the GNU/Linux time command. I would add comments to tell a story and show I had "nothing up my sleeves".

(py310) $
(py310) $ ls
string_in_long_list_1.py
(py310) $
(py310) $ # New dir
(py310) $ mkdir work
(py310) $ cd work
(py310) $ ls -a
.  ..
(py310) $ cp ../string_in_long_list_1.py .
(py310) $ ls -a
.  ..  string_in_long_list_1.py
(py310) $
(py310) $ # Create files
(py310) $ python3 string_in_long_list_1.py
(py310) $
(py310) $ # (How many lines)
(py310) $ wc -l *
      75 string_in_long_list_1.py
  349999 string_in_long_list_2.txt
  350002 string_in_long_list_3.py
      16 string_in_long_list_4.py
      14 string_in_long_list_5.py
  700106 total
(py310) $
(py310) $ # time linear search of text file
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'
True

real    0m0.102s
user    0m0.058s
sys     0m0.000s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'
True

real    0m0.102s
user    0m0.045s
sys     0m0.015s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'
True

real    0m0.094s
user    0m0.043s
sys     0m0.009s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'
True

real    0m0.100s
user    0m0.056s
sys     0m0.000s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'
True

real    0m0.106s
user    0m0.030s
sys     0m0.030s
(py310) $
(py310) $ ls
string_in_long_list_1.py   string_in_long_list_3.py  string_in_long_list_5.py
string_in_long_list_2.txt  string_in_long_list_4.py
(py310) $
(py310) $ # time creation of module .pyc file and search of data in set
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'
True

real    0m1.578s
user    0m0.869s
sys     0m0.667s
(py310) $
(py310) $ # time search of data in set from compiled module file
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'
True

real    0m0.186s
user    0m0.120s
sys     0m0.047s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'
True

real    0m0.220s
user    0m0.178s
sys     0m0.021s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'
True

real    0m0.199s
user    0m0.155s
sys     0m0.023s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'
True

real    0m0.193s
user    0m0.136s
sys     0m0.033s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'
True

real    0m0.187s
user    0m0.127s
sys     0m0.040s
(py310) $

My linear search is faster than the set lookup, even after the set data is precompiled into a .pyc module.

The OP could adapt to use their text file I guess, and use more typical string searches, but only they have that data.

Big O

The thing about big O comparisons that cloud their use when people have real data is:

People forget that actual timings put values on the constants that are abstracted away when you compare only BigO. If the actual timings of two algorithms are 100x + 100 and 3*x**2 + 2*x + 1; then although the first algorithm is proportional to x and the second is proportional to x**2, there is a range for x, the data, where the second x**2 algorithm is faster.
Python is interpreted. It makes it harder to get the true BigO dependency for an interpreted algorithm when things like variable name accesses and repeated code interpretation speeds may be significant.

If people want faster, then they should ask with an expectation of needing to give data.

END.

Go deh!

Sunday, September 25, 2022

Answering a Reddit Question with timings.