Go deh!

Mainly Tech projects on Python and Electronic Design Automation.

Sunday, September 25, 2022

Answering a Reddit Question with timings.

Best viewed on larger than a phone screen.


Someone had a problem on Reddit r/python:

Hello guys, I want to find a string in a list and this list has 350K elements all they are strings . I want to find out a good algorithm that can find the string very quick . I know linear search but want to figure out other ways if possible.

The problem had already had over seventy answers when I came across it, but I wanted to check:

  1. Had people teased more information out of the original poster?
  2. Was there  a pocket of timed goodness in there?

I read on.

The answers given were polite. Yay, here's to the python community πŸ»πŸ‘ŒπŸΎπŸ‘πŸΎ! 

The answers did give alternatives to linear search, but so many posts had no timings, and the OP was still pretty vague in the data he gave, considering he was wanting "faster".

I was late to this party, but wanted to write some Python, as I can find coding relaxing if I write in the spirit of helping, (and remember I too might learn something). It's not the first time I've tried to help out and decided on three points:

  1. Get more realistic timings, or a timing framework were they can go in and alter things to get timings that more represent their data.
  2. The answers mentioned linear search of the text, sets, and tries. I would skip tries as I have already found that its hard to get over their speed of interpretation limitations.
  3.  I'm gonna need several files to do the timings - lets try creating them from one initial python file. (Why not, keeps me interested).


I chose to code in the Spyder IDE for this, rather than Vscode. YMMVm but I still find Spyder to support a more "dynamic" development style, great for scripting and I am not writing hundreds of lines of code for this.

I thought of the steps I would need, seemed simple enough so just started coding the first, string_in_long_list_1.py.

Create a text file of 350K sorted "strings"

What is meant by strings? I don't know, don't sweat it - I'll make each line several words with one word having an increasing count appended:

# -*- coding: utf-8 -*-
Created on Sat Sep 24 10:00:51 2022

@author: paddy3118

# %% Create txt file of 350_000 strings

from pathlib import Path

lines = '\n'.join(f"Fee fi word{i:06} fum" for i in range(350_000))

txtfile = Path('string_in_long_list_2.txt')
txtfile.write_text(lines, 'utf8')

Notice the # %%  cell comment. It allows me to quickly run and rerun the code in the cell and check the values of variables in Spyder until I get it right. (Vscode will do this too, but I'm more comfortable with Spyder's implementation).

Sets; On your marks...

If the OP was going to test for strings in the text file multiple times for the one text file, which I think I read in the original posters replies to others, then I would like to time set inclusion, but reduce the time to create or read the set. 

I decided on creating a module file that assigns the set of lines in the text file to a variable data. Another prog would load this module and check for string inclusion and compile the module so subsequent loadings would be faster

# %% Create a module of the set of those lines assigned to name 'data'

import pprint

set_of_strings = set(txtfile.read_text('utf8').split('\n'))
moduletxt = f"""\
# -*- coding: utf-8 -*-
"Set of each line from the text file."
data = {pprint.pformat(set_of_strings, indent=2)}

modulefile = Path('string_in_long_list_3.py')
modulefile.write_text(moduletxt, 'utf8')

I ran the cell and checked the generated script until it looked right.

Linear search.

Although the OP said the text file was sorted, others had already noted that trying binary search would be slowed as you would have to read every line first into a list... you might as well do a comparison as you read each line, so I needed to generate a script to do just that:

# %% Create a script to just search for a line containing its arg whilst reading.

scripttxt = """\
# -*- coding: utf-8 -*-
"Search for first line in text file matching argument."

import sys

to_find = sys.argv[1] if len(sys.argv) > 1 else ''

with open('string_in_long_list_2.txt', encoding='utf8') as txtfile:
    for line in txtfile:
        if to_find == line.rstrip('\\n'):

linesearchfile = Path('string_in_long_list_4.py')
linesearchfile.write_text(scripttxt, 'utf8')

When writing this, I first had no tripple quotes around what became scripttxt =, and could rerun the contained code as I debugged, then encapsulate in scripttxt = """\ and add the the last couple of lines to generate the file.

 Match against set

This generates the python script that when given a string as its argument, will check against data loaded as a set from its module.

# %% Create a script to look for the arg in the modules data set

scripttxt = """\
# -*- coding: utf-8 -*-
"test if argument is in data loaded from module as a set."

from string_in_long_list_3 import data
import sys

to_find = sys.argv[1] if len(sys.argv) > 1 else ''

if to_find in data:


modulesearchfile = Path('string_in_long_list_5.py')
modulesearchfile.write_text(scripttxt, 'utf8')

The end of the Python script!


I was always going to time runs of the scripts out in the shell of the OS, for which I, and many others are familiar with the GNU/Linux time command. I would add comments to tell a story and show I had "nothing up my sleeves".

(py310) $
(py310) $ ls
(py310) $
(py310) $ # New dir
(py310) $ mkdir work
(py310) $ cd work
(py310) $ ls -a
.  ..
(py310) $ cp ../string_in_long_list_1.py .
(py310) $ ls -a
.  ..  string_in_long_list_1.py
(py310) $
(py310) $ # Create files
(py310) $ python3 string_in_long_list_1.py
(py310) $
(py310) $ # (How many lines)
(py310) $ wc -l *
      75 string_in_long_list_1.py
  349999 string_in_long_list_2.txt
  350002 string_in_long_list_3.py
      16 string_in_long_list_4.py
      14 string_in_long_list_5.py
  700106 total
(py310) $
(py310) $ # time linear search of text file
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'

real    0m0.102s
user    0m0.058s
sys     0m0.000s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'

real    0m0.102s
user    0m0.045s
sys     0m0.015s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'

real    0m0.094s
user    0m0.043s
sys     0m0.009s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'

real    0m0.100s
user    0m0.056s
sys     0m0.000s
(py310) $ time python3 string_in_long_list_4.py 'Fee fi word175000 fum'

real    0m0.106s
user    0m0.030s
sys     0m0.030s
(py310) $
(py310) $ ls
string_in_long_list_1.py   string_in_long_list_3.py  string_in_long_list_5.py
string_in_long_list_2.txt  string_in_long_list_4.py
(py310) $
(py310) $ # time creation of module .pyc file and search of data in set
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'

real    0m1.578s
user    0m0.869s
sys     0m0.667s
(py310) $
(py310) $ # time search of data in set from compiled module file
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'

real    0m0.186s
user    0m0.120s
sys     0m0.047s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'

real    0m0.220s
user    0m0.178s
sys     0m0.021s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'

real    0m0.199s
user    0m0.155s
sys     0m0.023s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'

real    0m0.193s
user    0m0.136s
sys     0m0.033s
(py310) $ time python3 string_in_long_list_5.py 'Fee fi word175000 fum'

real    0m0.187s
user    0m0.127s
sys     0m0.040s
(py310) $

My linear search is faster than the set lookup, even after the set data is precompiled into a .pyc module.

The OP could adapt to use their text file I guess, and use more typical string searches, but only they have that data.

Big O

The thing about big O comparisons that cloud their use when people have real data is:

  • People forget that actual timings put values on the constants that are abstracted away when you compare only BigO. If the actual timings of two algorithms are 100x + 100 and 3*x**2 + 2*x + 1; then although the first algorithm is proportional to x and the second is proportional to x**2, there is a range for x, the data, where the second x**2 algorithm is faster.
  • Python is interpreted. It makes it harder to get the true BigO dependency for an interpreted algorithm when things like variable name accesses and repeated code interpretation speeds may be significant.     

If people want faster, then they should ask with an expectation of needing to give data.


Sunday, July 10, 2022

Raw Python vs Python & SQLite vs GNU Linux command line utilities!

Note: Long lines may reward those reading on larger screens than phones.


There was an interesting problem posted to reddit:

  1. A large text file has one word per line
  2. Words have many duplications on other lines
  3. Sort the file in order of lines with most duplications first.

The size of the file is in comparison with the amount of memory available to sort it - no in-memory sorts of the file can be assumed to work.

Many commenters ignored the request of producing a sort of the input file and instead considered generating only the unique words of the input in order of their decreasing occurrence. That would not be a sort as duplicate entries would be removed.

The original poster posted to the r/Python reddit group but said he would prefer a Linux command line solution.

sort, uniq, awk

 I remembered that the original sort was written when memory was in kilobytes and data being larger than memory was much more common. Original sort could use disk to sort files that could not fit in memory quite seemlessly and Gnu sort has that capability too.

I was going to:

  1. sort the file
  2. Use uniq to count the occurrences
  3. sort the word-counts numerically by order of the counts.
  4. Expand the ordered words by their counts to create the answer using awk.

Here's the one-liner working on some test input:

$ printf 'w1\nw4\nw3\nw1\nw2\nw1\nw3\nw4\nw3\nw2\n' > /tmp/word.lst
$ cat /tmp/word.lst
$ sort /tmp/word.lst | uniq -c | sort -n -r
      3 w3
      3 w1
      2 w4
      2 w2
$ sort /tmp/word.lst | uniq -c | sort -n -r| awk '{for(i=0;i<$1;i++){print$2}}'

Test data

What makes this problem is the size of the data - The input text needs to be individual words per line and have words to differing frequencies of occurrencies.

I decided to download the text to Huck` Finn and double that up until I had a file larger than my own 8Gigs of memory.

$ # Download Huck_finn:
$ curl https://gutenberg.org/files/76/76-0.txt -o huck_finn.txt
$ # One word per line using fmt:
$ fmt -w1 huck_finn.txt >huck_word_per_line.txt
$ # Double up to increase the file size:
$ cp huck_word_per_line.txt words.lst
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc  words.lst && du -h words.lst
 253464  228666 1239542 words.lst
1.2M    words.lst
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc  words.lst && du -h words.lst
 506928  457332 2479084 words.lst
2.4M    words.lst


$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc  words.lst && du -h words.lst
 2076377088  1873231872 10154328064 words.lst
9.5G    words.lst

The resultant file may have "junk" lines of just whitespace, numbers, or punctuation but it should still serve its purpose.


sort, uniq, awk: timings

I split the command line above into two and timed each half:

$ # Find the frequency of each word
$ time sort words.lst |uniq -c > words_uniq_count.txt

real    20m5.247s
user    83m33.507s
sys     1m38.863s
$ # Now to order by most frequent first and expand the counts into that many lines of words:
$ time sort -n -r words_uniq_count.txt | awk '$2~/[a-zA-Z]/{for(i=0;i<$1;i++){print$2}}' > words_ordered.txt

real    7m56.900s
user    3m34.341s
sys     0m33.422s

That is 28minutes in total.

Pure Python

When thinking of my purely Python solution, I reread the original post which emphasised the amount of duplication in the large file. I tookj a gamble in assuming that I would always have enough memory to process the frequency of each individual word.

The Python script streams the file in counting the frequency of each line then uses Counter.most_common to order the output.

import fileinput
from collections import Counter

word_freq = Counter(line for ln in fileinput.input()
                    if (line := ln.strip()))  # No blank lines
for word, count in word_freq.most_common():
    print((word + '\n') * count, end='')

Pure Python: timing

(For the same 9.5G input file)

real    21m53.181s
user    16m26.880s
sys     0m22.188s

Python and sqlite3

User bOne123 had a neat solution that used sqlite3 from Python his code is reproduced here:

# -*- coding: utf-8 -*-
@author: bOne123

import sqlite3
import time
con = sqlite3.connect('/tmp/file.db') # or 'file.db' if not enough ram
cur = con.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS data (word text, count int)''')
cur.execute('''CREATE UNIQUE INDEX IF NOT EXISTS dataidx ON data(word)''')
now = time.time()
with open("words.lst", "r") as f:
    i = 0
    words = []
    for line in f:
        i += 1
        if i % 50_000_000 == 0:
            print(time.time() - now, i)
            cur.executemany("INSERT INTO data VALUES (?, 1) ON CONFLICT(word) DO UPDATE SET count=count+1", words)
print(time.time() - now, 'index counts')
cur.execute('''CREATE INDEX IF NOT EXISTS countidx ON data(count)''')
with open("words_ordered_by_bOne123.txt", "w", newline='\n') as f:
    for row in cur.execute('SELECT word,count FROM data ORDER BY count DESC'):
        for i in range(row[1]):
print(time.time() - now, 'done')

Python and sqlite3: timing

$ time python sort_words_by_bOne123.py
22.579142570495605 50000000
102.96021580696106 100000000
3125.1875145435333 2050000000
3189.4789474010468 index counts
3583.4524500370026 done

real    59m44.886s
user    50m36.011s
sys     2m11.008s

Those timings take over 50 minutes just to insert into the DB.


User mikeblas has another solution that this time uses just sqlite3. Unfortunately I could not get it to work, (as he did). I tried his code to read my 9.5G file into an sqlite3 DB and had to kill it after it ran for nearly two hours and seemed to be doing very little after growing its DB to 27 Gigs.

Timings so far

id Algorithm Minutes
1 sort, uniq, awk 28
2 Pure Python 22
3 Python and sqlite3 60

That initial sort for the first algorithm is needed to apply the uniq command to and would take a significant amount of time.

I ruminated on this for a couple of weeks, searched and eventually found GNU datamash


I've been using Unix and Linux for several decades, but have learnt to keep an eye out for new, major, command line utilities, you know, like moving from vi to vim. Well this wqas one of those times when I thought that counting word or line frequencies might be useful enough for people like GNU to write some utility, and I found one; datamash!

Some reading and tinkering lead me to the following shell one-liner with its timing:

$ # Test on small file
$ datamash --sort --group 1 count 1 < /tmp/word.lst | sort -n -r -k 2 | awk 'NF==2{for(i=0;i<$2;i++){print$1}}'
$ # That worked, so run on 9.5G file
$ time datamash --sort --group 1 count 1 < words.lst | sort -n -r -k 2 | awk 'NF==2{for(i=0;i<$2;i++){print$1}}' > words_ordered_by_datamash.txt

real    19m15.608s
user    39m11.182s
sys     1m39.364s

All timings

id Algorithm Minutes
1 sort, uniq, awk 28
2 Pure Python 22
3 Python and sqlite3 60
4 datamash 19

 The datamash solution was fastest, but is new to me and I don't yet have a true feel for its internals.

The pure Python and the sort-uniq-awk solution I am most familiar with.

I take away from this exercise the following:

  1. When reading solutions to problems then actual code trumps number of replies and upvotes on just a suggestion of tool to use, (AWS, DB's, ...)
  2. Try and give a sample of test data and expected output with the question; this helps tie down what is needed in a solution, and in this case would help resolve the problem of those who gave a sort of the input and those that stopped with just ordering unique words.
  3. The unix command line is still a powerful tool. Those who just learn cloud tools capable of handling terabytes of data but only use them on much smaller datasets ('cos they'll scale), put a smile on those cloud vendors faces.



Friday, June 03, 2022

Python: Yield the first item of consecutive series of equal items, from an iterable.

 I was reading the source to the more_itertools.unique_unseen function. The function takes an iterator that may have regions where successive items are the same, and yields the item if next item is not the same as it.


Since the problem involves grouping, my first thought is to think of ways to apply itertools.groupby; but that is used in the more_itertools function reproduced below:

def unique_justseen(iterable, key=None):
    """Yields elements in order, ignoring serial duplicates

    >>> list(unique_justseen('AAAABBBCCDAABBB'))
    ['A', 'B', 'C', 'D', 'A', 'B']
    >>> list(unique_justseen('ABBCcAD', str.lower))
    ['A', 'B', 'C', 'A', 'D']

    return map(next, map(operator.itemgetter(1), groupby(iterable, key)))

It groups, then extracts the first of each group, and looks to be a good, tight implementation. 

I decided to find another way.

Successive items method

I thought a bit then decided:

  • The first item of the iterable is always returned.
  • Other items are returned if they are not equal to the item before it.

You need the item and the item next to it to determine if that item is yielded. I quiclky toyed with thoughts of teeing and dropping one item from a tee'd branch then comparing items from the zip of the tees; but discarded that. I thought that I could use the walrus operator in a comprehension to keep track of the previous item, (called this below), and doodled this code:

print(this := next(it), end=' ')
print([(this := item) for item in it if item != this])
# Outputs: A ['B', 'C', 'D', 'A', 'B']

It seemed to work, and I set about creating my function.

My Function first_of_series

Looking at the more_itertools.unique_justseen function above, you should note that it has a key argument. The key argument is a function used to transmute the items of the iterable prior to comparison, much like its use in the sorted function. I decided to add that functionality too.

My final function is as follows:

def first_of_series(iterable, key=None):
    Yield the first item of consecutive series of equal items from an iterable.
    Items have the key function applied for the comparison, if given.

    >>> list(first_of_series('AAAABBBCCDAABBB'))
    ['A', 'B', 'C', 'D', 'A', 'B']
    >>> list(first_of_series('ABBCcAD'))
    ['A', 'B', 'C', 'c', 'A', 'D']
    >>> list(first_of_series('ABBCcAD', key=str.lower))
    ['A', 'B', 'C', 'A', 'D']
    >>> list(first_of_series('ABBcCAD', key=str.lower))
    ['A', 'B', 'c', 'A', 'D']

    it = iter(iterable) # Change iterable into an iterator!
    yield (this := next(it))
    if key is None:
        yield from (item for item in it if this != (this := item))
        this = key(this)
        yield from (item for item in it if this != (this := key(item)))

It works like this:

  1. Form an iterator from any input iterable so you can call next() on it.
  2. Set this to the first item from the iterable and also yield it.
  3. If there is no key:
    1. yield from a comprehension yielding items from the iterable but only if the last i.e this is not equal to the current, item from the list. 
    2. The walrus assignment updates this.
  4. Else if there is a key function:
    1. this is now always the value of the key function applied to items from the iterable.
    2. The comparison in the comprehension is always of values of the key function applied to items from the iterable; although the plain items are yielded.

I have to rely on the left-to-right order of evaluation for the if condition in the comprehensions to get the updating of variable this correct; and it's more verbose than the groupby version.

I did have fun creating it though, (and it reinforces my rule of using groupby when grouping).

Thursday, June 02, 2022

Python: My nwise compared to more_itertools.sliding_window

In my last post I evolved two n-wise functions from the itertools.pairwise code to create functions that will generate a sliding window of arbitrary width from an input iterator.

Somone commented on reddit about my blog stating:

"Is there a reason why not just do it as they do it in more-itertools?"

Then gave the source from the more itertools site that I extract below:

from collections import deque, Counter
from itertools import islice
from timeit import Timer
import pandas as pd

def sliding_window(iterable, n):
    """Return a sliding window of width *n* over *iterable*.

        >>> list(sliding_window(range(6), 4))
        [(0, 1, 2, 3), (1, 2, 3, 4), (2, 3, 4, 5)]

    If *iterable* has fewer than *n* items, then nothing is yielded:

        >>> list(sliding_window(range(3), 4))

    For a variant with more features, see :func:`windowed`.
    it = iter(iterable)
    window = deque(islice(it, n), maxlen=n)
    if len(window) == n:
        yield tuple(window)
    for x in it:
        yield tuple(window)

Visual Comparison

I'll show my nwise function for comparison:

def nwise(iterable, width):
    tees = tee(iterable, width)
    skewed = [([next(t) for _ in range(position)], t)[1]
              for position, t in enumerate(tees)]
    return zip(*skewed)


sliding_window is nicely commented as it is in a public library; nwise is written to be read as part of its original blog post.

It's up to the reader to judge which code is more understandable. more_itertools does not have a blog post to help understand the code. nwise isn't packaged as a module - its a coding example.

Timing Comparison

 Yes, well, the timing methodology is set to compare run times over a range of inputs iterable sizes and output window widths. A particular application might benefit from more specific timings.

I added the following timing code:

data = dict(items=[], window=[], func=[], time=[])
fastest = []
for n in range(1,7,2):
    items = range(10**n)
    reps = None
    for w in range(9):
        wdth  = 2**w
        if wdth <= 10**n:
            func_speeds = []
            for func in (nwise, nwise_squashed, sliding_window):
                cmd = f"list({func.__name__:<15}({str(items):<12}, {wdth:10_}))"
                print(cmd, end=' # ')
                timer = Timer(cmd, globals=globals())
                if reps is None:
                    reps, secs = timer.autorange()
                    if reps <3:
                        reps = 3
                secs = min(timer.repeat(3, reps)) # min of 3 runs of timeit(number=reps)
                print("%1.2fs, %i reps" % (secs, reps))
                func_speeds.append((secs, func.__name__))


print("\nFunction most fast for a particular set of arguments:")
print("   ", Counter(fastest).most_common())

Generated Timings:

list(nwise          (range(0, 10),          1)) # 0.20s, 100000 reps
list(nwise_squashed (range(0, 10),          1)) # 0.22s, 100000 reps
list(sliding_window (range(0, 10),          1)) # 0.30s, 100000 reps

list(nwise          (range(0, 10),          2)) # 0.27s, 100000 reps
list(nwise_squashed (range(0, 10),          2)) # 0.29s, 100000 reps
list(sliding_window (range(0, 10),          2)) # 0.29s, 100000 reps

list(nwise          (range(0, 10),          4)) # 0.40s, 100000 reps
list(nwise_squashed (range(0, 10),          4)) # 0.43s, 100000 reps
list(sliding_window (range(0, 10),          4)) # 0.25s, 100000 reps

list(nwise          (range(0, 10),          8)) # 0.76s, 100000 reps
list(nwise_squashed (range(0, 10),          8)) # 0.78s, 100000 reps
list(sliding_window (range(0, 10),          8)) # 0.20s, 100000 reps

list(nwise          (range(0, 1000),          1)) # 0.21s, 5000 reps
list(nwise_squashed (range(0, 1000),          1)) # 0.21s, 5000 reps
list(sliding_window (range(0, 1000),          1)) # 0.90s, 5000 reps

list(nwise          (range(0, 1000),          2)) # 0.23s, 5000 reps
list(nwise_squashed (range(0, 1000),          2)) # 0.23s, 5000 reps
list(sliding_window (range(0, 1000),          2)) # 0.94s, 5000 reps

list(nwise          (range(0, 1000),          4)) # 0.26s, 5000 reps
list(nwise_squashed (range(0, 1000),          4)) # 0.28s, 5000 reps
list(sliding_window (range(0, 1000),          4)) # 0.96s, 5000 reps

list(nwise          (range(0, 1000),          8)) # 0.35s, 5000 reps
list(nwise_squashed (range(0, 1000),          8)) # 0.36s, 5000 reps
list(sliding_window (range(0, 1000),          8)) # 1.03s, 5000 reps

list(nwise          (range(0, 1000),         16)) # 0.56s, 5000 reps
list(nwise_squashed (range(0, 1000),         16)) # 0.56s, 5000 reps
list(sliding_window (range(0, 1000),         16)) # 1.23s, 5000 reps

list(nwise          (range(0, 1000),         32)) # 1.19s, 5000 reps
list(nwise_squashed (range(0, 1000),         32)) # 1.20s, 5000 reps
list(sliding_window (range(0, 1000),         32)) # 1.57s, 5000 reps

list(nwise          (range(0, 1000),         64)) # 2.69s, 5000 reps
list(nwise_squashed (range(0, 1000),         64)) # 2.85s, 5000 reps
list(sliding_window (range(0, 1000),         64)) # 2.41s, 5000 reps

list(nwise          (range(0, 1000),        128)) # 5.74s, 5000 reps
list(nwise_squashed (range(0, 1000),        128)) # 5.80s, 5000 reps
list(sliding_window (range(0, 1000),        128)) # 3.35s, 5000 reps

list(nwise          (range(0, 1000),        256)) # 15.83s, 5000 reps
list(nwise_squashed (range(0, 1000),        256)) # 15.87s, 5000 reps
list(sliding_window (range(0, 1000),        256)) # 4.58s, 5000 reps

list(nwise          (range(0, 100000),          1)) # 0.30s, 50 reps
list(nwise_squashed (range(0, 100000),          1)) # 0.30s, 50 reps
list(sliding_window (range(0, 100000),          1)) # 1.04s, 50 reps

list(nwise          (range(0, 100000),          2)) # 0.34s, 50 reps
list(nwise_squashed (range(0, 100000),          2)) # 0.33s, 50 reps
list(sliding_window (range(0, 100000),          2)) # 1.05s, 50 reps

list(nwise          (range(0, 100000),          4)) # 0.36s, 50 reps
list(nwise_squashed (range(0, 100000),          4)) # 0.37s, 50 reps
list(sliding_window (range(0, 100000),          4)) # 1.09s, 50 reps

list(nwise          (range(0, 100000),          8)) # 0.44s, 50 reps
list(nwise_squashed (range(0, 100000),          8)) # 0.46s, 50 reps
list(sliding_window (range(0, 100000),          8)) # 1.22s, 50 reps

list(nwise          (range(0, 100000),         16)) # 0.64s, 50 reps
list(nwise_squashed (range(0, 100000),         16)) # 0.64s, 50 reps
list(sliding_window (range(0, 100000),         16)) # 1.39s, 50 reps

list(nwise          (range(0, 100000),         32)) # 1.13s, 50 reps
list(nwise_squashed (range(0, 100000),         32)) # 1.13s, 50 reps
list(sliding_window (range(0, 100000),         32)) # 1.83s, 50 reps

list(nwise          (range(0, 100000),         64)) # 2.67s, 50 reps
list(nwise_squashed (range(0, 100000),         64)) # 2.64s, 50 reps
list(sliding_window (range(0, 100000),         64)) # 3.26s, 50 reps

list(nwise          (range(0, 100000),        128)) # 4.47s, 50 reps
list(nwise_squashed (range(0, 100000),        128)) # 4.38s, 50 reps
list(sliding_window (range(0, 100000),        128)) # 4.88s, 50 reps

list(nwise          (range(0, 100000),        256)) # 8.70s, 50 reps
list(nwise_squashed (range(0, 100000),        256)) # 8.57s, 50 reps
list(sliding_window (range(0, 100000),        256)) # 8.55s, 50 reps
Function most fast for a particular set of arguments:
    [('nwise', 11), ('sliding_window', 6), ('nwise_squashed', 5)]

  • sliding_window is only fastest 6 out of the 22 runs of the groups of three functions running with the same arguments. 
  • Either the nwise, or nwise_squashed algorithms are fastest most often, and with similar times between these two.



Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!


Blog Archive