Constrained Random Test-data Generation
When problem solving, I have often been left with program solutions (one, or more), but without adequate data to test them with. In these circumstances you need to understand the data you have, its form, what changes, what stays the same, and relationships between fields.- If you have too much data then a knowledge of the form can often give ways to extract shorter data subsets from what you have.
- If you have bland data, then it's a slog to create interesting data for testing.
- If you have too little data then you need to create more.
I was looking at this Stack Overflow problem
where they wanted a faster routine to process their data.
The user and one other had a routine, and I created a routine, but the
only data available were the two very small examples which were useful
in creating the routines, but not in assessing their speed as they were waay too small.
The two data samples
dat1 = {'a': ['1', '2', '3', '4', '5'],
'b': [['a', 'b', 'e'], ['a', 'g'],
['c', 'f'], ['d'], ['b']]}
dat2 = {'a': ['1', '3', '4', '6', '9'],
'b': [['a', 'b', 'e'], ['a', 'g', 'f'],
['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]}
- Data is help in a Dict with only two str keys 'a', and 'b'.
- The 'a' keys value is a list of distinct str representations of integers
- The length of the 'a' values list is equal to the length of the (outer) 'b' values list.
- The 'b' keys value is a list of lists of str.
- The inner-most lists of the 'b' value are of varying lengths.
- strings within an innermost member of a 'b' value list are unique.
- There is some repetition of 'b' value strings between innermost lists. (The task is about coalescing inner-most lists that share one or more string members).
Constrained (weighted) generation.
This will give me the ability to vary the length of the inner 'b' lists randomly, but with some control.Here's the function:
import random
def weighted_choice(item_weights, bincount=None):
'''\
Weighted choice of item from (item, weight), ... pairs.
weights and bincount assumed to be positive integers.
bincount assumed >0 (Or None)
Puts items in bins in proportion to probs
then uses random.choice() to select items.
Larger bincount for more memory use but
higher accuracy (on avarage).
bincount of None will set the bincount to sum(weights).
If sum(weights) > bincount then every item with a weight > 0
is given at least one bin. (Which may adjust probabilities)
'''
weights = sum(iw[1] for iw in item_weights)
assert weights, " Zero weight sum?"
if bincount is None or weights < bincount:
bincount = weights
bins = []
for item, wt in item_weights:
if wt:
item_bin_count = int(bincount * wt / weights)
bins += [item] * (item_bin_count if item_bin_count else 1)
while True:
yield random.choice(bins)
Lets say you want to generate 10_000 sweets, randomly but constrain the generation so that
- nuts have a weight of 1
- fruit have a weight of 2
- chocolate has a weight of 3
- and candyfloss has a weight of 4 We can generate and count the actual number of each sweet generated.
# This should be omitted but is there to make it easier to get reproducible random data for the document.
random.seed('RANDOM TEST DATA GENERATION')
from collections import Counter
sweet_weights = [('nuts', 1), ('fruit', 2), ('chocolate', 3), ('candyfloss', 4)]
sweet_gen = weighted_choice(sweet_weights)
sweets = [next(sweet_gen) for _ in range(10_000)] # Get random sweets to given weighting
Counter(sweets).most_common() # Show counts of sweet types randomly generated
The original weights add up to 10 which makes it easier to see that the random generation takes note of the weights. They are generated randomly though:
sweets[:12]
Pool of 'b' strings
I will generate a pool of arbitrary strings of lengthN_STRINGS
. if the length of 'a' and 'b''s lists is N_ROWS
then if the string members of 'b' values inner lists are taken from
that pool then the degree of duplication of strings between 'b's inner
lists can be controlled by the ratio N_STRINGS to N_ROWS
.
from string import ascii_lowercase
N_ROWS, N_STRINGS = 1_000, 5_000
rand_string_pool = set()
while len(rand_string_pool) < N_STRINGS:
charcount = random.randint(1, 7)
rstring = ''.join(random.sample(ascii_lowercase, charcount))
rand_string_pool.add(rstring)
list(rand_string_pool)[:5] # Sample
Length of the 'b' list's sub-lists.
These inner-most lists of strings from the pool will have there length controlled randomly from a set of sizes in which I will weight so there is greater chance of smaller length sub-strings usingB_SUBLIST_LEN_WT
.
# ln, Wt
B_SUBLIST_LEN_WT = ([(n, 5) for n in range(1, 6)] + # lengths of 0..5 most common
[(n, 2) for n in range(6, 10)] +
[(n, 1) for n in range(10, 15)]) # Longer lengths less common
b_list_len_gen = weighted_choice(B_SUBLIST_LEN_WT)
b_list = [random.sample(rand_string_pool, # Sample from pool
next(b_list_len_gen)) # Number of strings to take
for _ in range(N_ROWS)] # Total number of 'b' sub-lists.
b_list[:3] # Sample
'a' list is just a count
a_list = [str(i + 1) for i in range(N_ROWS)]
a_list[:3] # Sample
Data generator function
Lets put it all together as a parameterisable function
def data_gen(n_rows, n_strings=None):
if n_strings is None:
n_strings = n_rows * 4
rand_string_pool = set()
while len(rand_string_pool) < n_strings:
charcount = random.randint(1, 7)
rstring = ''.join(random.sample(ascii_lowercase, charcount))
rand_string_pool.add(rstring)
# ln, Wt
B_SUBLIST_LEN_WT = ([(n, 5) for n in range(1, 6)] + # lengths of 0..5 most common
[(n, 2) for n in range(6, 10)] +
[(n, 1) for n in range(10, 15)]) # Longer lengths less common
b_list_len_gen = weighted_choice(B_SUBLIST_LEN_WT)
b_list = [random.sample(rand_string_pool, # Sample from pool
next(b_list_len_gen)) # Number of strings to take
for _ in range(n_rows)] # Total number of 'b' sub-lists.
a_list = [str(i + 1) for i in range(n_rows)]
return dict(a=a_list, b=b_list)
data_gen(5)
data_gen(4)
data_gen(3)
Summary.
thedata_gen
function generates larger data that I was then able to do more
meaningful timings. With more information from the question poser on the
kind of lengths of 'b'
substrings, and some measure of the reuse of strings between the sub-lists pf 'b'
, then B_SUBLIST_LEN_WT
could be tuned to give even more realistic data for the timing, tuning, and comparisons of functions acting on that data.
No comments:
Post a Comment