In which I recover from complacency in speeding up a StackOverflow solution by learning (and explaining here), a new algorithm that I test using random data.

Someone on StackOverflow needed help with a routine that worked, but took nearly a week to run. On examining what was asked it seemed to be a particular adaptation of the Set consolidation task I had added to Rosetta Code.

The set consolidation task

Given two sets of items then if any item is common to any set then the result of applying consolidation to those sets is a set of sets whose contents is:

The two input sets if no common item exists between the two input sets of items.
The single set that is the union of the two input sets if they share a common item.

Given N sets of items where N>2 then the result is the same as repeatedly replacing all combinations of two sets by their consolidation until no further consolidation between set pairs is possible. If N<2 then consolidation has no strict meaning and the input can be returned.

Example 1:
Given the two sets {A,B} and {C,D} then there is no common element between the sets and the result is the same as the input.

Example 2:
Given the two sets {A,B} and {B,D} then there is a common element B between the sets and the result is the single set {B,D,A}. (Note that order of items in a set is immaterial: {A,B,D} is the same as {B,D,A} and {D,A,B}, etc).

Example 3:
Given the three sets {A,B} and {C,D} and {D,B} then there is no common element between the sets {A,B} and {C,D} but the sets {A,B} and {D,B} do share a common element that consolidates to produce the result {B,D,A}. On examining this result with the remaining set, {C,D}, they share a common element and so consolidate to the final output of the single set {A,B,C,D}

Example 4:
The consolidation of the five sets:

{H,I,K}, {A,B}, {C,D}, {D,B}, and {F,G,H}

Is the two sets:

{A, C, B, D}, and {G, F, I, H, K}

The Stack Overflow problem

Given data like this:

In [1]:

dat1 = {'a': ['1', '2', '3', '4', '5'], 
        'b': [['a', 'b', 'e'], ['a', 'g'], 
              ['c', 'f'], ['d'], ['b']]}

It's a dict always with two string keys a and b. values are lists of the same length: a's being a list of unique 'indices'; b's being a list of sub-lists. Each of b's sub-lists are of unique string items, but items may be common between some of the sub-lists.

The SO problem boils down to performing a set consolidation of the items of dat1['b'] and forming similar groupings of the indices of a.

So for the example dat1 the correct answer would be:

In [2]:

ans1 = {'a': [['1', '2', '5'],      ['3'],      ['4']],
        'b': [['a', 'b', 'e', 'g'], ['c', 'f'], ['d']]}

It states that

dat1['a']['1'], dat1['a']['2'], dat1['a']['5'] shown in ans1['a'][0] consolidate to ans1['b'][0]
dat1['a']['3'] shown in ans1['a'][1] consolidate to ans1['b'][1]
dat1['a']['4'] shown in ans1['a'][2] consolidate to ans1['b'][2]

Example 2

Note:

a's value is unique, a string, but not necessarily a count.
the length of the a and b values are the same. (But b's sub-lists are of varying length).

In [3]:

dat2 = {'a': ['1', '3', '4', '6', '9'], 
        'b': [['a', 'b', 'e'], ['a', 'g', 'f'], 
              ['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]}

In [4]:

ans2 = {'a': [['1', '3', '4', '6', '9']],
        'b': [['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']]}

Efficiency constraint

For data with a million sub-lists, taking a week to run was unacceptable.

Applying set consolidation

I replied on SO with a solution that used set consolidation on the b values, treating the b sub-lists as sets.

Most of the time and work in the solution is spent in the set consolidation routine for b.

My original set consolidation routine

In [5]:

def consolidate(sets):
    "Python sets orig."
    setlist = [set(s) for s in sets if s]   # Remove empty containers and convert to sets
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

In [6]:

blist1 = dat1['b']
consolidate(blist1)

Out[6]:

[{'c', 'f'}, {'d'}, {'a', 'b', 'e', 'g'}]

In [7]:

blist2 = dat2['b']
consolidate(blist2)

Out[7]:

[{'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'}]

This original routine finds intersections between all sets and coalesces them if they share members.

But sets can be done in other ways...

np array as sets

Usually if you want to go faster then you bring out numpy. I consulted the docs and came up with thefollowing code. (I did think it was rather odd though as numpy functions were giving results in different sized arrays, but no timing yet).

In [8]:

import numpy as np

NP_STRINGS = 'U12'  # Max size of numpy strings

def consolidate_np(lists):
    "Numpy sets"
    global NP_STRINGS
    
    np0 = np.array([], dtype=NP_STRINGS)
    setlist = [np.array(lst, dtype=NP_STRINGS) for lst in lists]
    for i, s1 in enumerate(setlist):
        if s1.size:
            for j, s2 in enumerate(setlist[i+1:], i+1):
                if np.intersect1d(s1, s2, assume_unique=True).size:
                    s1 = setlist[j] = np.union1d(s2, s1)
                    setlist[i] = np0
    return [s.tolist() for s in setlist if s.size]

In [9]:

consolidate_np(blist1)

Out[9]:

[['c', 'f'], ['d'], ['a', 'b', 'e', 'g']]

In [10]:

consolidate_np(blist2)

Out[10]:

[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']]

root-and-merge-root

There was another answer to the Stack Overflow question that on initial scan didn't use Python sets and on micro-testing seemed fast.

I needed to understand the other algorithm.
I needed to create larger tests.

The ramr algorithm

Lets start with a list of lists of strings to consolidate. (sets/lists doesn't really matter for the innermost container it works for inner-lists of strings without duplicates and is easier to explain as such).

In [11]:

grouplist = [['a', 'b', 'e'],   # 0
             ['a', 'g', 'f'],   # 1
             ['c', 'f'],        # 2
             ['d', 'h'],        # 3
             ['b', 'g', 'h']]   # 4

think of the first member of each sub-list as its "name" and all items of the sub-list are said to be "rooted" at the name.

Lets map each item to its name, just as they come.

In [12]:

root = {}
for group in grouplist:
    name = group[0]
    for item in group:
        root[item] = name
root

Out[12]:

{'a': 'a',
 'b': 'b',
 'e': 'a',
 'g': 'b',
 'f': 'c',
 'c': 'c',
 'd': 'd',
 'h': 'b'}

from grouplist[0], the root dicts key values for a, b, and e should all be 'a', but later rows such as grouplist[1] come along they later set the root of 'b' to 'b'.

The algorithm is a twist in the generation of this root dict.

At any stage, a root dict key that maps to itself is associated with the name of a group - the root name of a group.
1. Keys that don't can map their values recursively in the root dict, (key to value to key to value to ...), until they find their root node.
Construct the root dict mapping, but if an item is already present as a key in the root dict, then:
1. Trace the previous entry to its root name, or previous root (keep the item names traversed).
2. Set the root value of all items of the trace to the previous root name found.
3. Change the group name for successive items in the group to be this previous root; (ignoring for now any previous members of the group traversed).
After going through all the groups above to produce the root dict of items pointing to potential names of their end groupings, the root directory is traversed getting rid of any indirections so that items point directly to their group names.
Create the new groupings from the root directory by assigning items (keys), to lists for each name (root dir values).

In [13]:

def consolidate_ramr(data):
    "root-and-merge-roots"

    root = {}
    grouplist = (d for d in data if d)     # Remove null sublists
    for group in grouplist:
        name = group[0]
        for item in group:
            prevroot = root.setdefault(item, name)  # returns first if item isn't in root
            indirections = [name]
            while prevroot != name:    # Do we need to look back through root?
                indirections.append(prevroot)
                name, prevroot = prevroot, root[prevroot]
            #
            for ind in indirections:
                root[ind] = prevroot   # Root/re-root as necessary
                
    # This flattens any remaining indirections.
    for item in root.keys():
        parent = root[item]
        indirections = []
        while item != parent:
            indirections.append(item)
            item, parent = parent, root[parent]
        for ind in indirections:
            root[ind] = parent
    
    # re-group on root names
    regroup = {}
    for item, name in root.items():
        regroup.setdefault(name, []).append(item)

    return list(regroup.values())

In [14]:

consolidate_ramr(blist1)

Out[14]:

[['a', 'b', 'e', 'g'], ['c', 'f'], ['d']]

In [15]:

consolidate_ramr(blist2)

Out[15]:

[['a', 'b', 'e', 'g', 'f', 'c', 'd', 'h']]

Big test data

Lets randomly generate huge amounts of test data so we can do more realistic comparisons of algorithms

In [16]:

import random
from string import ascii_lowercase
import copy


def weighted_choice(item_weights, bincount=1_000):
    '''\
    Weighted choice of item from (item, weight), ... pairs.

    Puts items in bins in proportion to probs
    then uses random.choice() to select items.
     
    Larger bincount for more memory use but
    higher accuracy (on avarage).
    '''
 
    bins = []
    weights = sum(iw[1] for iw in item_weights)
    bincount = min([weights, bincount])
    for item, wt in item_weights:
        bins += [item] * int(bincount * wt / weights)
    while True:
        yield random.choice(bins)

def rand_data_gen(n=5):
    "Gen n sublists of randomised data"
    
    # Need a pool of items to randomly select for each group
    n_pool_items = int(2.5 * n)
    pool_strings = [f'I{x}' for x in range(n_pool_items)]
    
    # groups will be of ranodom but weighted size, 
    group_len_weights = ([(n, 5) for n in range(1, 6)] +    # lengths of 1..5 weights
                         [(n, 2) for n in range(6, 10)] +   # lengths of 6..9 weights
                         [(n, 1) for n in range(10, 16)])   # lengths of 10..15 weights
    group_len_weights = [(ln, wt) for ln, wt in group_len_weights if ln <= n_pool_items]
    group_len_gen = weighted_choice(group_len_weights)
    
    data = [random.sample(pool_strings, next(group_len_gen))
            for _ in range(n)]
    
    # stats
    avg = sum(len(x) for x in data) / n
    print(f"""
          GENERATED DATA STATS:
              Row count: {n}
              Pool of random items: {n_pool_items}
              Average count of items per group: {avg}""")
    
    return data

In [17]:

# This is just for repeatability and can be deleted
random.seed('RANDOM TEST DATA GENERATION')

In [18]:

rand_data_gen(5)

          GENERATED DATA STATS:
              Row count: 5
              Pool of random items: 12
              Average count of items per group: 4.8

Out[18]:

[['I2', 'I10', 'I4', 'I8', 'I7'],
 ['I9', 'I11', 'I5', 'I7', 'I10', 'I0', 'I2'],
 ['I10', 'I0', 'I1', 'I2', 'I5', 'I4', 'I3', 'I11'],
 ['I4', 'I6', 'I5'],
 ['I3']]

In [19]:

rand_data_gen(10)

          GENERATED DATA STATS:
              Row count: 10
              Pool of random items: 25
              Average count of items per group: 4.1

Out[19]:

[['I14', 'I16', 'I21', 'I18', 'I17', 'I13', 'I1', 'I12'],
 ['I13', 'I7', 'I18', 'I23', 'I1', 'I0', 'I19', 'I4', 'I6'],
 ['I23', 'I3'],
 ['I23', 'I8'],
 ['I6', 'I2', 'I3', 'I17', 'I23'],
 ['I18', 'I17'],
 ['I22', 'I3', 'I9', 'I10'],
 ['I9', 'I8', 'I13', 'I23'],
 ['I1', 'I7'],
 ['I6', 'I15', 'I8']]

Routine comparisons

Lets run the functions on the same data, time them, then compare output results.

In [20]:

import timeit
import datetime


def same_format(d):
    "Make comparable"
    return sorted(sorted(lst) for lst in d)

def runtime(methods, random_data):
    print(f"\n## \n## TIMING COMPARISONs WITH SAME RANDOM TEST DATA\n##")
    rep = 2
    results, times = [], []
    runtime.random_data = random_data
    for method in methods:
        print()
        name = method.__name__
        print(name, method.__doc__)
        runtime.result = result = []
        cmd = f"runtime.result[:] = [{name}(ddd)]"
        setup = (f"from __main__ import copy, {name}, runtime\n" 
                 "ddd = copy.deepcopy(runtime.random_data)")
        t = timeit.Timer(cmd, setup 
                         ).timeit(number=rep) / rep
        print(datetime.timedelta(seconds=t))
        results.append(same_format(result[0]))
        times.append(t) # seconds
    return results, times

In [21]:

from itertools import combinations


def results_compare(methods, ans):
    print(f"\n## \n## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA\n##")
    for data, method in zip(ans, methods):
        name  = method.__name__
        avg = sum(len(x) for x in data) / len(data)
        print(f"""\
              {name} Result stats:
                  Consolidated group count: {len(data)}
                  Average count of items per group: {avg}""")
    print()
    for c1, c2 in combinations(range(len(ans)), 2):
        n1, n2 = methods[c1].__name__, methods[c2].__name__
        print(' Data from ', n1, '== Data from', n2, '?',
              'YES' if ans[c1] == ans[c2] else 'NO!')

In [32]:

if __name__ == '__main__':
    methods = [consolidate, consolidate_np, consolidate_ramr]

    n = 100
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 100
              Pool of random items: 250
              Average count of items per group: 5.29

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate Python sets orig.
0:00:00.000385

consolidate_np Numpy sets
0:00:00.419312

consolidate_ramr root-and-merge-roots
0:00:00.000297

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate Result stats:
                  Consolidated group count: 1
                  Average count of items per group: 226.0
              consolidate_np Result stats:
                  Consolidated group count: 1
                  Average count of items per group: 226.0
              consolidate_ramr Result stats:
                  Consolidated group count: 1
                  Average count of items per group: 226.0

 Data from  consolidate == Data from consolidate_np ? YES
 Data from  consolidate == Data from consolidate_ramr ? YES
 Data from  consolidate_np == Data from consolidate_ramr ? YES

In [33]:

if __name__ == '__main__':
    methods = [consolidate, consolidate_np, consolidate_ramr]

    n = 200
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 200
              Pool of random items: 500
              Average count of items per group: 5.335

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate Python sets orig.
0:00:00.001206

consolidate_np Numpy sets
0:00:03.392443

consolidate_ramr root-and-merge-roots
0:00:00.000606

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate Result stats:
                  Consolidated group count: 4
                  Average count of items per group: 111.0
              consolidate_np Result stats:
                  Consolidated group count: 4
                  Average count of items per group: 111.0
              consolidate_ramr Result stats:
                  Consolidated group count: 4
                  Average count of items per group: 111.0

 Data from  consolidate == Data from consolidate_np ? YES
 Data from  consolidate == Data from consolidate_ramr ? YES
 Data from  consolidate_np == Data from consolidate_ramr ? YES

Dropping numpy

I am a numpy novice, but that routine consolidate_np is slow! Lets drop it from larger data runs.

In [34]:

if __name__ == '__main__':
    methods = [consolidate, consolidate_ramr]

    n = 1000
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 1000
              Pool of random items: 2500
              Average count of items per group: 5.34

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate Python sets orig.
0:00:00.024640

consolidate_ramr root-and-merge-roots
0:00:00.003451

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate Result stats:
                  Consolidated group count: 18
                  Average count of items per group: 124.0
              consolidate_ramr Result stats:
                  Consolidated group count: 18
                  Average count of items per group: 124.0

 Data from  consolidate == Data from consolidate_ramr ? YES

In [35]:

if __name__ == '__main__':
    methods = [consolidate, consolidate_ramr]

    n = 10_000
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 10000
              Pool of random items: 25000
              Average count of items per group: 5.4144

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate Python sets orig.
0:00:02.522769

consolidate_ramr root-and-merge-roots
0:00:00.041934

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate Result stats:
                  Consolidated group count: 161
                  Average count of items per group: 137.31055900621118
              consolidate_ramr Result stats:
                  Consolidated group count: 161
                  Average count of items per group: 137.31055900621118

 Data from  consolidate == Data from consolidate_ramr ? YES

In [36]:

if __name__ == '__main__':
    methods = [consolidate, consolidate_ramr]

    n = 20_000
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 20000
              Pool of random items: 50000
              Average count of items per group: 5.3902

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate Python sets orig.
0:00:12.736122

consolidate_ramr root-and-merge-roots
0:00:00.095634

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate Result stats:
                  Consolidated group count: 344
                  Average count of items per group: 128.7122093023256
              consolidate_ramr Result stats:
                  Consolidated group count: 344
                  Average count of items per group: 128.7122093023256

 Data from  consolidate == Data from consolidate_ramr ? YES

In [37]:

if __name__ == '__main__':
    methods = [consolidate_ramr]

    n = 100_000
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 100000
              Pool of random items: 250000
              Average count of items per group: 5.35226

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate_ramr root-and-merge-roots
0:00:00.621612

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate_ramr Result stats:
                  Consolidated group count: 1809
                  Average count of items per group: 121.86180210060807

In [38]:

if __name__ == '__main__':
    methods = [consolidate_ramr]

    n = 1_000_000
    data = rand_data_gen(n)

    results, times = runtime(methods, data)
    results_compare(methods, results)

          GENERATED DATA STATS:
              Row count: 1000000
              Pool of random items: 2500000
              Average count of items per group: 5.387603

## 
## TIMING COMPARISONs WITH SAME RANDOM TEST DATA
##

consolidate_ramr root-and-merge-roots
0:00:08.814060

## 
## METHOD RESULT COMPARISON WITH SAME RANDOM TEST DATA
##
              consolidate_ramr Result stats:
                  Consolidated group count: 17621
                  Average count of items per group: 125.44310765563816

Conclusion

The root-and-merge-roots method has a better time complexity and is much faster than the other algorithms shown. Whilst other methods such as the original Python set coonsolidate() function may be easier to reason about its correctness.

Go deh!

Sunday, July 26, 2020

Algorithms + Datastructures = (Faster) Programs