(Best read on more than a mobile phones screen)

In my last post on set consolidations I used random test data generation in comparing and contrasting three consolidation routines.

The test data generated was a list of thousands of sets of items with some duplication of items between sets.

I realise that whilst I extended the data to create hundreds of thousands of viable sets/set-like lists/"groups", individual sets all had less than twenty members.

I want to investigate performance when sets are much larger.

Importing previous functions

In [1]:

#!pip install import_ipynb

In [2]:

import import_ipynb
import set_consolidation_speed
from set_consolidation_speed import consolidate, consolidate_np, consolidate_ramr, \
                                    weighted_choice

importing Jupyter notebook from set_consolidation_speed.ipynb

Tall & skinny vs Short & fat

in the previous tests, no matter how many sets are generated, they all average around 5 elements in length over all sets and set lengths come from the same weighted random distribution. I can go "tall and skinny", but I also go "short and skinny"

This time I will adjust the random_data_gen function to have an extra aspect argument. The number of generated sets, n, will be divided by the aspect, whilst the length of sets to be randomly generated will be multiplied by aspect.

This way I keep the number of individual items of all the sets roughly constant when adjusting just the aspect term and keeping n constant. I should get some indication of how the consolidation functions cope with larger set sizes.
*(The term aspect is used as it affects [aspect ratio](https://en.wikipedia.org/wiki/Aspect_ratio)).*

In [3]:

import random
from pprint import pprint as pp


def rand_data_gen2(n=5, aspect=1, silent=False):
    if not silent:
        "Gen n//a sublists of randomised data with average set size multiplied by aspect"
    
    # Need a pool of items to randomly select for each group
    n_pool_items = int(2.5 * n)
    pool_strings = [f'I{x}' for x in range(n_pool_items)]
    
    # groups will be of ranodom but weighted size, 
    group_len_weights = ([(x, 5) for x in range(1, 6)] +    # lengths of 1..5 weights
                         [(x, 2) for x in range(6, 10)] +   # lengths of 6..9 weights
                         [(x, 1) for x in range(10, 16)])   # lengths of 10..15 weights
    group_len_weights = [(ln, wt) for ln, wt in group_len_weights if ln <= n_pool_items]
    group_len_gen = weighted_choice(group_len_weights)
    
    data = [random.sample(pool_strings, next(group_len_gen) * aspect)
            for _ in range(n // aspect)]
    
    # stats
    avg = sum(len(x) for x in data) / len(data)
    if not silent:
        print(f"""
          GENERATED DATA STATS:
              n:                                {n}
              Aspect ratio:                     {aspect}
              Row count:                        {len(data)}
              Pool of random items:             {n_pool_items}
              Average count of items per group: {avg}
""")
    
    return data, avg

In [4]:

# This is just for repeatability and can be deleted
random.seed('RANDOM TEST DATA GENERATION')

In [5]:

if __name__ == '__main__':
    for aspect_ratio in (1, 2, 4):
        rand_data_gen2(2**15, aspect_ratio)

          GENERATED DATA STATS:
              n:                                32768
              Aspect ratio:                     1
              Row count:                        32768
              Pool of random items:             81920
              Average count of items per group: 5.380462646484375


          GENERATED DATA STATS:
              n:                                32768
              Aspect ratio:                     2
              Row count:                        16384
              Pool of random items:             81920
              Average count of items per group: 10.81591796875


          GENERATED DATA STATS:
              n:                                32768
              Aspect ratio:                     4
              Row count:                        8192
              Pool of random items:             81920
              Average count of items per group: 21.30322265625

Note how the average group size, (approximately), doubles as the row count successively halves.

Timings

I'll use pandas for its data formatting

In [6]:

import timeit
import datetime
import pandas as pd
import copy


def runtime2(methods, random_data):
    "Time the methods acting on the data"
    rep = 2
    times = []
    runtime2.random_data = random_data
    for method in methods:
        name = method.__name__
        cmd = f"{name}(ddd)"
        setup = (f"from __main__ import copy, {name}, runtime2\n" 
                 "ddd = copy.deepcopy(runtime2.random_data)")
        t = timeit.Timer(cmd, setup 
                         ).timeit(number=rep) / rep
        times.append(datetime.timedelta(seconds=t))
    return times

def method_compare(methods, rows=[128, 256], aspects=[1, 2, 4]):
    "Compare timings of methods with data generated from specified row counts and aspect ratios"
    results = {}
    for key in 'n, aspect, rows, avg_set_len:'.split():
        results[key] = []
    method_names = [m.__name__ for m in methods]
    for key in method_names:
        results[key] = []

    for n in rows:
        for aspect in aspects:
            data, avg_set_len = rand_data_gen2(n, aspect, silent=True)
            times = runtime2(methods, data)
            for key, val in zip('n, aspect, rows, avg_set_len:'.split(),
                                (n, aspect, n // aspect, avg_set_len)):
                results[key].append(val)
            for key, val in zip(method_names, times):
                results[key].append(val)
    return pd.DataFrame(results)

Trial timings on all methods

In [7]:

methods = [consolidate, consolidate_ramr, consolidate_np]
rows = [2**4, 2**8]
aspects = [1, 2]

results = method_compare(methods, rows, aspects)
results

Out[7]:

	n,	aspect,	rows,	avg_set_len:	consolidate	consolidate_ramr	consolidate_np
0	16	1	16	5.312500	00:00:00.000048	00:00:00.000065	00:00:00.003666
1	16	2	8	7.500000	00:00:00.000021	00:00:00.000034	00:00:00.000657
2	256	1	256	5.484375	00:00:00.002135	00:00:00.000789	00:00:06.598615
3	256	2	128	10.968750	00:00:00.000834	00:00:00.000670	00:00:01.828410

Drop the numpy solution

It is obviousely extremely slow compared to the others so do a larger run without it

In [8]:

methods = [consolidate, consolidate_ramr]
rows = [2**10, 2**12]
aspects = [2**x for x in range(7)]

results2 = method_compare(methods, rows, aspects)
results2

Out[8]:

	n,	aspect,	rows,	avg_set_len:	consolidate	consolidate_ramr
0	1024	1	1024	5.351562	00:00:00.028811	00:00:00.003646
1	1024	2	512	10.480469	00:00:00.010669	00:00:00.003078
2	1024	4	256	21.234375	00:00:00.007610	00:00:00.002711
3	1024	8	128	42.687500	00:00:00.004379	00:00:00.002209
4	1024	16	64	83.250000	00:00:00.002478	00:00:00.001867
5	1024	32	32	222.000000	00:00:00.001834	00:00:00.002240
6	1024	64	16	392.000000	00:00:00.001109	00:00:00.002076
7	4096	1	4096	5.544678	00:00:00.506794	00:00:00.014725
8	4096	2	2048	10.454102	00:00:00.239052	00:00:00.012017
9	4096	4	1024	20.722656	00:00:00.131128	00:00:00.010961
10	4096	8	512	42.406250	00:00:00.071203	00:00:00.008084
11	4096	16	256	91.687500	00:00:00.037783	00:00:00.008381
12	4096	32	128	178.250000	00:00:00.022336	00:00:00.009287
13	4096	64	64	352.000000	00:00:00.011133	00:00:00.014599

Graphed runtimes

In [9]:

methods = [consolidate, consolidate_ramr]
rows = [2**x for x in range(10, 14)]
aspects = [1, 4, 16, 64, 256]
for n in rows:
    res = method_compare(methods, [n], aspects)
    res.plot.line(y=['consolidate', 'consolidate_ramr'], x='avg_set_len:', title=f'RUNTIME. n={n} for all aspect changes')

In [10]:

display(res)

	n,	aspect,	rows,	avg_set_len:	consolidate	consolidate_ramr
0	8192	1	8192	5.386963	00:00:01.876308	00:00:00.029510
1	8192	4	2048	21.519531	00:00:00.684390	00:00:00.024539
2	8192	16	512	88.187500	00:00:00.210603	00:00:00.019065
3	8192	64	128	343.500000	00:00:00.062432	00:00:00.015940
4	8192	256	32	1320.000000	00:00:00.019854	00:00:00.016067

As the average set length rises in each graph, the number of sets to compare in the consolidate Python sets method comes down and its run times comes down to that of the consolidate_ramr method but the consolidate method is clearly slower. The Python sets based consolidate clearly shows the cost associated with more sets to process.

The runtimes for consolidate_ramr stay low, and fairly constant as the aspect ratio is changed (but the overall number of items in sets stays roughly the same).

How run times change with the number of sets

remembering that the number of items in the sets stays roughly constant but the number of sets changes from n x sets of size s to n/256 sets of size s*256

In [11]:

# consolidate RUNTIME CHANGES OVER 1-to-256 ASPECT CHANGE
res['consolidate'][0].value / res['consolidate'][1].value

Out[11]:

2.7415771709113224

In [12]:

# consolidate_ramr RUNTIME CHANGES OVER 1-to-256 ASPECT CHANGE
res['consolidate_ramr'][0].value / res['consolidate_ramr'][1].value

Out[12]:

1.2025754920738416

root-and-merge-roots with bigdata

Lets drop the slower consolidate routine and investigate the ramr routine with bigger data

In [13]:

methods = [consolidate_ramr]
rows = [2**x for x in [16, 20]]
aspects = [1, 4, 16, 64, 256]
for n in rows:
    res = method_compare(methods, [n], aspects)
    res.plot.line(y=['consolidate_ramr'], x='avg_set_len:', title=f'RUNTIME. n={n} for all aspect changes')
    display(res)

	n,	aspect,	rows,	avg_set_len:	consolidate_ramr
0	65536	1	65536	5.349091	00:00:00.365121
1	65536	4	16384	21.604736	00:00:00.289843
2	65536	16	4096	87.074219	00:00:00.247693
3	65536	64	1024	339.437500	00:00:00.208501
4	65536	256	256	1269.000000	00:00:00.194705

	n,	aspect,	rows,	avg_set_len:	consolidate_ramr
0	1048576	1	1048576	5.384340	00:00:09.082441
1	1048576	4	262144	21.552216	00:00:06.862761
2	1048576	16	65536	86.581055	00:00:05.594431
3	1048576	64	16384	344.222656	00:00:05.245034
4	1048576	256	4096	1391.875000	00:00:04.957202

With only the one method being graphed, we get a scale change that shows that the consolidate_ramr method also shows longer run times when processing more (but smaller) sets.

In [14]:

# consolidate_ramr RUNTIME CHANGES OVER 1-to-256 ASPECT CHANGE
res['consolidate_ramr'][0].value / res['consolidate_ramr'][1].value

Out[14]:

1.323438336261455

In [ ]:

Go deh!

Tuesday, July 28, 2020

More constrained random testing of set consolidation routines.