# Go deh!

Mainly Tech projects on Python and Electronic Design Automation.

## Friday, May 10, 2024

### Predicting results from small samples.

I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.

I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.

To some, that's all gobledygook so I'll try to explain with some code.

## Explanation

Lets say you have a large randomised population of pass/fails or ones and zeroes:

from random import sample

population_size = 100_000    # must be large, > 65K?
sample_size = 123           # Actual
p_hat = 0.5                  # Population pass rate, 0.5 == 50%

_ones = int(population_size*p_hat)
_zeroes = population_size - _ones
population = [1] * _ones + [0] * _zeroes

And that we take a sample from it and compute the pass rate of that single, smaller sample

def random_sample() -> list[int]:
return sample(population, k=sample_size)

pass_rate = (sum(random_sample())  # how many ones
/ sample_size)      # convert to a pass rate

print(pass_rate)  # e.g. 0.59027552674230146

Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:

runs_per_sample = 10_000     # each sample size is run this many times ~2500

pass_rates = [sum(random_sample()) / sample_size
for _ in range(runs_per_sample)]

I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time

epsilon = 0.05               # target margin of error in sample pass rate. 0.05 == +/-5%

p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)
in_range_count = sum(p_hat_min <= pass_rate < p_hat_max
for pass_rate in pass_rates)
sample_confidence_level = in_range_count / runs_per_sample
print(f"{sample_confidence_level = }")  # = 0.4054

So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!

## We need more!

What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.

There are calculators that will then give you n, the size of the sample needed to satisfy those condition.

## Doing it myself

I calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.

That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.

When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.

range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...]  # confidence levels

The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.

Here's a sample of the output I get when searching

### Sample output

\$ time python3 sample_search.py
2098 <= n <= 2610
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(50, 5000, 512), 250)

2013 <= n <= 2525
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)

1501 <= n <= 2781
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)

1757 <= n <= 2013
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)

1714 <= n <= 1970
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)

1458 <= n <= 2098
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)

1586 <= n <= 1714
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)

1564 <= n <= 1692
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)

1500 <= n <= 1564
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)

1553 <= n <= 1585
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)

1547 <= n <= 1579
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)

1547 <= n <= 1579
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)

1541 <= n <= 1581
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)

1501 <= n <= 1533
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)

1501 <= n <= 1533
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)

1503 <= n <= 1575
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)

1503 <= n <= 1535
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)

Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.

real    3m49.023s
user    3m49.022s
sys     0m1.361s

### My Code

# -*- coding: utf-8 -*-
"""
Created on Wed May  8 14:04:17 2024

"""
# %%
from random import sample
import numpy as np

def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range:
"""
Arguments with example values:
population_size = 100_000           # must be large, > 65K?
sample_size = range(1400, 1750, 16) # (+min, +max, +step)
p_hat = 0.5                         # Population pass rate, 0.5 == 50%
epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%
confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
runs_per_sample = 10_000            # each sample size is run this many times ~2500

Return:
min,max range for the sample size, n, satisfying inputs.

"""

def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray:
"Create numpy array of ones and zeroes with p_hat% as ones"
ones = int(population_size*p_hat + 0.5)
array10 = np.zeros(population_size, dtype=np.uint8)
array10[:ones] = 1

return array10

def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int
) -> list[list[float]]:
"Pass rates for range of sample sizes repeated runs_per_sample times."
# many_samples_many_rates = [( np.random.shuffle(population),         # shuffle every *run*
#                             [population[:s_count].sum() / s_count  # The pass rate for samples
#                             for s_count in sample_size_range]
#                         )[1]                                     # drop the shuffle
#                         for _ in range(runs_per_sample)]         # Every run

many_samples_many_rates = [[ np.random.shuffle(population),         # shuffle every *sample*
[population[:s_count].sum() / s_count  # The pass rate for samples
for s_count in sample_size_range]
][1]                                     # drop the shuffle
for _ in range(runs_per_sample)]         # Every run

return list(zip(*many_samples_many_rates))      # Transpose to by_sample_size_then_runs

population = create_1_0_array(population_size, p_hat)
by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)

# Pass rates within target

target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon)  # Looking for rates within the range

range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size)
for single_sample_size in by_sample_size_then_runs]
runs_for_confidence_level = confidence_level * runs_per_sample

for n_min, conf in zip(sample_size, range_hits):
if conf >= runs_for_confidence_level:
break
else:
n_min = sample_size.start

for n_max, conf in list(zip(sample_size, range_hits))[::-1]:
if conf <= runs_for_confidence_level:
n_max += sample_size.step  # back a step
break
else:
n_max = sample_size.stop

if (n_min + sample_size.step) >= n_max and sample_size.step > 1:
# Widen
n_max = n_max + sample_size.step + 1

return range(n_min, n_max, sample_size.step)

def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]:
"Extract from **increasing** from_range the min, max, middle, step"
mn, st = from_range.start, from_range.step

# Handle range where start == stop
mx = from_range.stop
for mx in from_range:
pass

md = (mn + mx) / 2

return mn, mx, md, st

def next_sample_size(new_samples, last_samples,
runs_per_sample,
widener=1.33  # Widen range by
):
n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples)
l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples)

# Next range of samples computes in names with prefix s_

increase_runs = True
if n_max == l_max:
# Major expand of high end
s_max = l_max + (l_max - l_min)
increase_runs = False
else:
s_max = (n_mid +( n_max - n_mid)* widener)

if n_min == l_min:
# Major expand of low end
s_min = max(1, l_min + (l_min - l_max))
increase_runs = False
else:
s_min = (n_mid +( n_min - n_mid)* widener)

s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5))
s_step = n_step
if s_min == s_max:
if s_min > 2:
s_min -= 1
s_max += 1

if increase_runs or n_max == n_min:
runs_per_sample *= 2
if n_max == n_min:
s_step = 1
else:
s_step = max(1, (s_step + 1) // 2)  # Go finer

next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step)

return next_sample_range, runs_per_sample

# %%

if __name__ == '__main__':

population_size = 100_000           # must be large, > 65K?
sample_size = range(50, 5_000, 512) # Increasing!
p_hat = 0.50                        # Population pass rate, 0.5 == 50%
epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%
confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
runs_per_sample = 250               # each sample size is run this many time at start, ~250
max_runs_per_sample = 35_000

while runs_per_sample < max_runs_per_sample:
new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample)
n_min, n_max, n_mid, n_step = min_max_mid_step(new_range)
print(f"{n_min} <= n <= {n_max}")
print(f"  For {p_hat, epsilon, confidence_level =}\n"
f"  Using {population_size, sample_size, runs_per_sample =}\n")

sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)

print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% "
f"with a confidence level of {confidence_level*100.:.0f}%.")

END.