Constrained Random Test-data Generation

When problem solving, I have often been left with program solutions (one, or more), but without adequate data to test them with. In these circumstances you need to understand the data you have, its form, what changes, what stays the same, and relationships between fields.

If you have too much data then a knowledge of the form can often give ways to extract shorter data subsets from what you have.
If you have bland data, then it's a slog to create interesting data for testing.
If you have too little data then you need to create more.

This is about a method and concrete example of generating data to test functions when more was needed, and might also be used to extract less bland data.

I was looking at this Stack Overflow problem where they wanted a faster routine to process their data. The user and one other had a routine, and I created a routine, but the only data available were the two very small examples which were useful in creating the routines, but not in assessing their speed as they were waay too small.

The two data samples

dat1 = {'a': ['1', '2', '3', '4', '5'], 
        'b': [['a', 'b', 'e'], ['a', 'g'], 
              ['c', 'f'], ['d'], ['b']]}

dat2 = {'a': ['1', '3', '4', '6', '9'], 
        'b': [['a', 'b', 'e'], ['a', 'g', 'f'], 
              ['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]}

Data is help in a Dict with only two str keys 'a', and 'b'.
The 'a' keys value is a list of distinct str representations of integers
The length of the 'a' values list is equal to the length of the (outer) 'b' values list.
The 'b' keys value is a list of lists of str.
- The inner-most lists of the 'b' value are of varying lengths.
- strings within an innermost member of a 'b' value list are unique.
- There is some repetition of 'b' value strings between innermost lists. (The task is about coalescing inner-most lists that share one or more string members).

The examples only have value lists of 5 members. The original poster of the question noted that his data may have a million members which took days to run. I needed a sample data generator to create arbitrary length data, with controllable duplication of 'b' strings with inner lists.

Constrained (weighted) generation.

This will give me the ability to vary the length of the inner 'b' lists randomly, but with some control.
Here's the function:

import random

def weighted_choice(item_weights, bincount=None):
    '''\
    Weighted choice of item from (item, weight), ... pairs.
    
    weights and bincount assumed to be positive integers. 
    bincount assumed >0 (Or None)

    Puts items in bins in proportion to probs
    then uses random.choice() to select items.
     
    Larger bincount for more memory use but
    higher accuracy (on avarage).
    bincount of None will set the bincount to sum(weights).
    
    If sum(weights) > bincount then every item with a weight > 0
    is given at least one bin. (Which may adjust probabilities)
    '''
 
    weights = sum(iw[1] for iw in item_weights)
    assert weights, " Zero weight sum?"
    if bincount is None or weights < bincount:
        bincount = weights
    bins = []
    for item, wt in item_weights:
        if wt:
            item_bin_count = int(bincount * wt / weights)
        bins += [item] * (item_bin_count if item_bin_count else 1)
    while True:
        yield random.choice(bins)

Lets say you want to generate 10_000 sweets, randomly but constrain the generation so that

nuts have a weight of 1
fruit have a weight of 2
chocolate has a weight of 3
and candyfloss has a weight of 4 We can generate and count the actual number of each sweet generated.

# This should be omitted but is there to make it easier to get reproducible random data for the document.
random.seed('RANDOM TEST DATA GENERATION')

from collections import Counter

sweet_weights = [('nuts', 1), ('fruit', 2), ('chocolate', 3), ('candyfloss', 4)]
sweet_gen = weighted_choice(sweet_weights)

sweets = [next(sweet_gen) for _ in range(10_000)]   # Get random sweets to given weighting
Counter(sweets).most_common()                       # Show counts of sweet types randomly generated

[('candyfloss', 4014), ('chocolate', 2990), ('fruit', 1977), ('nuts', 1019)]

The original weights add up to 10 which makes it easier to see that the random generation takes note of the weights. They are generated randomly though:

sweets[:12]

['chocolate',
 'fruit',
 'chocolate',
 'candyfloss',
 'candyfloss',
 'candyfloss',
 'candyfloss',
 'candyfloss',
 'chocolate',
 'candyfloss',
 'candyfloss',
 'chocolate']

Pool of 'b' strings

I will generate a pool of arbitrary strings of length N_STRINGS. if the length of 'a' and 'b''s lists is N_ROWS then if the string members of 'b' values inner lists are taken from that pool then the degree of duplication of strings between 'b's inner lists can be controlled by the ratio N_STRINGS to N_ROWS.

from string import ascii_lowercase

N_ROWS, N_STRINGS = 1_000, 5_000

rand_string_pool = set()
while len(rand_string_pool) < N_STRINGS:
    charcount = random.randint(1, 7)
    rstring = ''.join(random.sample(ascii_lowercase, charcount))
    rand_string_pool.add(rstring)

list(rand_string_pool)[:5]  # Sample

['dnil', 'ez', 'zrxtlj', 'vkxln', 'laxv']

Length of the 'b' list's sub-lists.

These inner-most lists of strings from the pool will have there length controlled randomly from a set of sizes in which I will weight so there is greater chance of smaller length sub-strings using B_SUBLIST_LEN_WT.

#                    ln, Wt
B_SUBLIST_LEN_WT = ([(n, 5) for n in range(1, 6)] +   # lengths of 0..5 most common
                    [(n, 2) for n in range(6, 10)] +
                    [(n, 1) for n in range(10, 15)])  # Longer lengths less common

b_list_len_gen = weighted_choice(B_SUBLIST_LEN_WT)
b_list = [random.sample(rand_string_pool,      # Sample from pool
                        next(b_list_len_gen))  # Number of strings to take
          for _ in range(N_ROWS)]              # Total number of 'b' sub-lists.

b_list[:3]  # Sample

[['upbaw', 'ycvd', 'xkmnbg', 'zxuy'], ['jbdi'], ['hlczyqf']]

'a' list is just a count

a_list = [str(i + 1) for i in range(N_ROWS)]

a_list[:3]  # Sample

['1', '2', '3']

Data generator function

Lets put it all together as a parameterisable function

def data_gen(n_rows, n_strings=None):
    if n_strings is None:
        n_strings = n_rows * 4
    
    rand_string_pool = set()
    while len(rand_string_pool) < n_strings:
        charcount = random.randint(1, 7)
        rstring = ''.join(random.sample(ascii_lowercase, charcount))
        rand_string_pool.add(rstring)
    
    #                    ln, Wt
    B_SUBLIST_LEN_WT = ([(n, 5) for n in range(1, 6)] +   # lengths of 0..5 most common
                        [(n, 2) for n in range(6, 10)] +
                        [(n, 1) for n in range(10, 15)])  # Longer lengths less common

    b_list_len_gen = weighted_choice(B_SUBLIST_LEN_WT)
    b_list = [random.sample(rand_string_pool,      # Sample from pool
                            next(b_list_len_gen))  # Number of strings to take
              for _ in range(n_rows)]              # Total number of 'b' sub-lists.

    a_list = [str(i + 1) for i in range(n_rows)]

    return dict(a=a_list, b=b_list)

data_gen(5)

{'a': ['1', '2', '3', '4', '5'],
 'b': [['gtn', 'yg', 'k', 'mhxuk', 'h'],
  ['mhxuk', 'k', 'yg', 'xki', 'h', 'gtn', 'lu', 'xo'],
  ['yg', 'e', 'xki'],
  ['egbi', 'zkm', 'hw'],
  ['gtn']]}

data_gen(4)

{'a': ['1', '2', '3', '4'],
 'b': [['jxm', 'tdlj', 'o', 'csjd'],
  ['jxm', 'zuxj', 'ca', 'nawdh', 'gk'],
  ['ig', 'ixrj'],
  ['kdmis', 'uw']]}

data_gen(3)

{'a': ['1', '2', '3'],
 'b': [['t', 'xrfwq'], ['nq', 'ubntarm', 'uqjinay'], ['nq', 'gi', 'nayetm']]}

Summary.

the data_gen function generates larger data that I was then able to do more meaningful timings. With more information from the question poser on the kind of lengths of 'b' substrings, and some measure of the reuse of strings between the sub-lists pf 'b', then B_SUBLIST_LEN_WT could be tuned to give even more realistic data for the timing, tuning, and comparisons of functions acting on that data.

Go deh!

Friday, July 17, 2020