Best viewed on larger than a phone screen.
Someone had a problem on Reddit r/python:
Hello guys, I want to find a string in a list and this list has 350K elements all they are strings . I want to find out a good algorithm that can find the string very quick . I know linear search but want to figure out other ways if possible.
The problem had already had over seventy answers when I came across it, but I wanted to check:
- Had people teased more information out of the original poster?
- Was there a pocket of timed goodness in there?
I read on.
The answers given were polite. Yay, here's to the python community π»ππΎππΎ!
The answers did give alternatives to linear search, but so many posts had no timings, and the OP was still pretty vague in the data he gave, considering he was wanting "faster".
I was late to this party, but wanted to write some Python, as I can find coding relaxing if I write in the spirit of helping, (and remember I too might learn something). It's not the first time I've tried to help out and decided on three points:
- Get more realistic timings, or a timing framework were they can go in and alter things to get timings that more represent their data.
- The answers mentioned linear search of the text, sets, and tries. I would skip tries as I have already found that its hard to get over their speed of interpretation limitations.
- I'm gonna need several files to do the timings - lets try creating them from one initial python file. (Why not, keeps me interested).
Coding.
I chose to code in the Spyder IDE for this, rather than Vscode. YMMVm but I still find Spyder to support a more "dynamic" development style, great for scripting and I am not writing hundreds of lines of code for this.
I thought of the steps I would need, seemed simple enough so just started coding the first, string_in_long_list_1.py.
Create a text file of 350K sorted "strings"
What is meant by strings? I don't know, don't sweat it - I'll make each line several words with one word having an increasing count appended:
Notice the # %% cell comment. It allows me to quickly run and rerun the code in the cell and check the values of variables in Spyder until I get it right. (Vscode will do this too, but I'm more comfortable with Spyder's implementation).
Sets; On your marks...
If the OP was going to test for strings in the text file multiple times for the one text file, which I think I read in the original posters replies to others, then I would like to time set inclusion, but reduce the time to create or read the set.
I decided on creating a module file that assigns the set of lines in the text file to a variable data. Another prog would load this module and check for string inclusion and compile the module so subsequent loadings would be faster.
I ran the cell and checked the generated script until it looked right.
Linear search.
Although the OP said the text file was sorted, others had already noted that trying binary search would be slowed as you would have to read every line first into a list... you might as well do a comparison as you read each line, so I needed to generate a script to do just that:
When writing this, I first had no tripple quotes around what became scripttxt =, and could rerun the contained code as I debugged, then encapsulate in scripttxt = """\ and add the the last couple of lines to generate the file.
Match against set
This generates the python script that when given a string as its argument, will check against data loaded as a set from its module.
The end of the Python script!
Timing
I was always going to time runs of the scripts out in the shell of the OS, for which I, and many others are familiar with the GNU/Linux time command. I would add comments to tell a story and show I had "nothing up my sleeves".
My linear search is faster than the set lookup, even after the set data is precompiled into a .pyc module.
The OP could adapt to use their text file I guess, and use more typical string searches, but only they have that data.
Big O
The thing about big O comparisons that cloud their use when people have real data is:
- People forget that actual timings put values on the constants that are abstracted away when you compare only BigO. If the actual timings of two algorithms are 100x + 100 and 3*x**2 + 2*x + 1; then although the first algorithm is proportional to x and the second is proportional to x**2, there is a range for x, the data, where the second x**2 algorithm is faster.
- Python is interpreted. It makes it harder to get the true BigO dependency for an interpreted algorithm when things like variable name accesses and repeated code interpretation speeds may be significant.
If people want faster, then they should ask with an expectation of needing to give data.
END.