I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.
I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.
To some, that's all gobledygook so I'll try to explain with some code.
Explanation
Lets say you have a large randomised population of pass/fails or ones and zeroes:
from random import sample
population_size = 100_000 # must be large, > 65K?
sample_size = 123 # Actual
p_hat = 0.5 # Population pass rate, 0.5 == 50%
_ones = int(population_size*p_hat)
_zeroes = population_size - _ones
population = [1] * _ones + [0] * _zeroes
And that we take a sample from it and compute the pass rate of that single, smaller sample
def random_sample() -> list[int]:
return sample(population, k=sample_size)
pass_rate = (sum(random_sample()) # how many ones
/ sample_size) # convert to a pass rate
print(pass_rate) # e.g. 0.59027552674230146
Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:
runs_per_sample = 10_000 # each sample size is run this many times ~2500
pass_rates = [sum(random_sample()) / sample_size
for _ in range(runs_per_sample)]
I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time
epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5%
p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)
in_range_count = sum(p_hat_min <= pass_rate < p_hat_max
for pass_rate in pass_rates)
sample_confidence_level = in_range_count / runs_per_sample
print(f"{sample_confidence_level = }") # = 0.4054
So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!
We need more!
What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.
There are calculators that will then give you n, the size of the sample needed to satisfy those condition.
Doing it myself
I calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.
That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.
When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.
range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...] # confidence levels
The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.
Here's a sample of the output I get when searching
Sample output
$ time python3 sample_search.py
2098 <= n <= 2610
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(50, 5000, 512), 250)
2013 <= n <= 2525
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)
1501 <= n <= 2781
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)
1757 <= n <= 2013
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)
1714 <= n <= 1970
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)
1458 <= n <= 2098
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)
1586 <= n <= 1714
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)
1564 <= n <= 1692
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)
1500 <= n <= 1564
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)
1553 <= n <= 1585
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)
1547 <= n <= 1579
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)
1547 <= n <= 1579
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)
1541 <= n <= 1581
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)
1501 <= n <= 1533
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)
1501 <= n <= 1533
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)
1503 <= n <= 1575
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)
1503 <= n <= 1535
For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)
Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.
real 3m49.023s
user 3m49.022s
sys 0m1.361s
My Code
# -*- coding: utf-8 -*-
"""
Created on Wed May 8 14:04:17 2024
@author: paddy
"""
# %%
from random import sample
import numpy as np
def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range:
"""
Arguments with example values:
population_size = 100_000 # must be large, > 65K?
sample_size = range(1400, 1750, 16) # (+min, +max, +step)
p_hat = 0.5 # Population pass rate, 0.5 == 50%
epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5%
confidence_level = 0.95 # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
runs_per_sample = 10_000 # each sample size is run this many times ~2500
Return:
min,max range for the sample size, n, satisfying inputs.
"""
def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray:
"Create numpy array of ones and zeroes with p_hat% as ones"
ones = int(population_size*p_hat + 0.5)
array10 = np.zeros(population_size, dtype=np.uint8)
array10[:ones] = 1
return array10
def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int
) -> list[list[float]]:
"Pass rates for range of sample sizes repeated runs_per_sample times."
# many_samples_many_rates = [( np.random.shuffle(population), # shuffle every *run*
# [population[:s_count].sum() / s_count # The pass rate for samples
# for s_count in sample_size_range]
# )[1] # drop the shuffle
# for _ in range(runs_per_sample)] # Every run
many_samples_many_rates = [[ np.random.shuffle(population), # shuffle every *sample*
[population[:s_count].sum() / s_count # The pass rate for samples
for s_count in sample_size_range]
][1] # drop the shuffle
for _ in range(runs_per_sample)] # Every run
return list(zip(*many_samples_many_rates)) # Transpose to by_sample_size_then_runs
population = create_1_0_array(population_size, p_hat)
by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)
# Pass rates within target
target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon) # Looking for rates within the range
range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size)
for single_sample_size in by_sample_size_then_runs]
runs_for_confidence_level = confidence_level * runs_per_sample
for n_min, conf in zip(sample_size, range_hits):
if conf >= runs_for_confidence_level:
break
else:
n_min = sample_size.start
for n_max, conf in list(zip(sample_size, range_hits))[::-1]:
if conf <= runs_for_confidence_level:
n_max += sample_size.step # back a step
break
else:
n_max = sample_size.stop
if (n_min + sample_size.step) >= n_max and sample_size.step > 1:
# Widen
n_max = n_max + sample_size.step + 1
return range(n_min, n_max, sample_size.step)
def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]:
"Extract from **increasing** from_range the min, max, middle, step"
mn, st = from_range.start, from_range.step
# Handle range where start == stop
mx = from_range.stop
for mx in from_range:
pass
md = (mn + mx) / 2
return mn, mx, md, st
def next_sample_size(new_samples, last_samples,
runs_per_sample,
widener=1.33 # Widen range by
):
n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples)
l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples)
# Next range of samples computes in names with prefix s_
increase_runs = True
if n_max == l_max:
# Major expand of high end
s_max = l_max + (l_max - l_min)
increase_runs = False
else:
s_max = (n_mid +( n_max - n_mid)* widener)
if n_min == l_min:
# Major expand of low end
s_min = max(1, l_min + (l_min - l_max))
increase_runs = False
else:
s_min = (n_mid +( n_min - n_mid)* widener)
s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5))
s_step = n_step
if s_min == s_max:
if s_min > 2:
s_min -= 1
s_max += 1
if increase_runs or n_max == n_min:
runs_per_sample *= 2
if n_max == n_min:
s_step = 1
else:
s_step = max(1, (s_step + 1) // 2) # Go finer
next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step)
return next_sample_range, runs_per_sample
# %%
if __name__ == '__main__':
population_size = 100_000 # must be large, > 65K?
sample_size = range(50, 5_000, 512) # Increasing!
p_hat = 0.50 # Population pass rate, 0.5 == 50%
epsilon = 0.05 # target margin of error in sample pass rate. 0.05 == +/-5%
confidence_level = 0.95 # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
runs_per_sample = 250 # each sample size is run this many time at start, ~250
max_runs_per_sample = 35_000
while runs_per_sample < max_runs_per_sample:
new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample)
n_min, n_max, n_mid, n_step = min_max_mid_step(new_range)
print(f"{n_min} <= n <= {n_max}")
print(f" For {p_hat, epsilon, confidence_level =}\n"
f" Using {population_size, sample_size, runs_per_sample =}\n")
sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)
print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% "
f"with a confidence level of {confidence_level*100.:.0f}%.")
END.
No comments:
Post a Comment