Friday, May 10, 2024

Predicting results from small samples.

  A black Englishman doing statistics on a blackboard. Image 1 of 4

I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.

I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.

To some, that's all gobledygook so I'll try to explain with some code.

Explanation

Lets say you have a large randomised population of pass/fails or ones and zeroes:

from random import sample

population_size = 100_000    # must be large, > 65K?
sample_size = 123           # Actual
p_hat = 0.5                  # Population pass rate, 0.5 == 50%

_ones = int(population_size*p_hat)
_zeroes = population_size - _ones
population = [1] * _ones + [0] * _zeroes

And that we take a sample from it and compute the pass rate of that single, smaller sample

def random_sample() -> list[int]:
    return sample(population, k=sample_size)

pass_rate = (sum(random_sample())  # how many ones
             / sample_size)      # convert to a pass rate

print(pass_rate)  # e.g. 0.59027552674230146

Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:

runs_per_sample = 10_000     # each sample size is run this many times ~2500

pass_rates = [sum(random_sample()) / sample_size
              for _ in range(runs_per_sample)]

I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time

epsilon = 0.05               # target margin of error in sample pass rate. 0.05 == +/-5%

p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)
in_range_count = sum(p_hat_min <= pass_rate < p_hat_max
                     for pass_rate in pass_rates)
sample_confidence_level = in_range_count / runs_per_sample
print(f"{sample_confidence_level = }")  # = 0.4054

So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!

We need more!

What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.

There are calculators that will then give you n, the size of the sample needed to satisfy those condition.

Doing it myself

 I calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.

That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.

When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.

range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...]  # confidence levels

The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.

Here's a sample of the output I get when searching

Sample output

$ time python3 sample_search.py
2098 <= n <= 2610
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(50, 5000, 512), 250)

2013 <= n <= 2525
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)

1501 <= n <= 2781
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)

1757 <= n <= 2013
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)

1714 <= n <= 1970
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)

1458 <= n <= 2098
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)

1586 <= n <= 1714
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)

1564 <= n <= 1692
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)

1500 <= n <= 1564
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)

1553 <= n <= 1585
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)

1547 <= n <= 1579
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)

1547 <= n <= 1579
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)

1541 <= n <= 1581
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)

1501 <= n <= 1533
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)

1501 <= n <= 1533
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)

1503 <= n <= 1575
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)

1503 <= n <= 1535
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)

 Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.

real    3m49.023s
user    3m49.022s
sys     0m1.361s


My Code

# -*- coding: utf-8 -*-
"""
Created on Wed May  8 14:04:17 2024

@author: paddy
"""
# %%
from random import sample
import numpy as np


def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range:
    """
    Arguments with example values:
        population_size = 100_000           # must be large, > 65K?
        sample_size = range(1400, 1750, 16) # (+min, +max, +step)
        p_hat = 0.5                         # Population pass rate, 0.5 == 50%
        epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%
        confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
        runs_per_sample = 10_000            # each sample size is run this many times ~2500
   
    Return:
        min,max range for the sample size, n, satisfying inputs.
   
    """
   
    def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray:
        "Create numpy array of ones and zeroes with p_hat% as ones"
        ones = int(population_size*p_hat + 0.5)
        array10 = np.zeros(population_size, dtype=np.uint8)
        array10[:ones] = 1
       
        return array10


    def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int
                        ) -> list[list[float]]:
        "Pass rates for range of sample sizes repeated runs_per_sample times."
        # many_samples_many_rates = [( np.random.shuffle(population),         # shuffle every *run*
        #                             [population[:s_count].sum() / s_count  # The pass rate for samples
        #                             for s_count in sample_size_range]  
        #                         )[1]                                     # drop the shuffle
        #                         for _ in range(runs_per_sample)]         # Every run

        many_samples_many_rates = [[ np.random.shuffle(population),         # shuffle every *sample*
                                     [population[:s_count].sum() / s_count  # The pass rate for samples
                                     for s_count in sample_size_range]  
                                   ][1]                                     # drop the shuffle
                                   for _ in range(runs_per_sample)]         # Every run

        return list(zip(*many_samples_many_rates))      # Transpose to by_sample_size_then_runs


    population = create_1_0_array(population_size, p_hat)
    by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)

    # Pass rates within target

    target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon)  # Looking for rates within the range

    range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size)
                for single_sample_size in by_sample_size_then_runs]
    runs_for_confidence_level = confidence_level * runs_per_sample

    for n_min, conf in zip(sample_size, range_hits):
        if conf >= runs_for_confidence_level:
            break
    else:
        n_min = sample_size.start

    for n_max, conf in list(zip(sample_size, range_hits))[::-1]:
        if conf <= runs_for_confidence_level:
            n_max += sample_size.step  # back a step
            break
    else:
        n_max = sample_size.stop
   
    if (n_min + sample_size.step) >= n_max and sample_size.step > 1:
        # Widen
        n_max = n_max + sample_size.step + 1

    return range(n_min, n_max, sample_size.step)


def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]:
    "Extract from **increasing** from_range the min, max, middle, step"
    mn, st = from_range.start, from_range.step
   
    # Handle range where start == stop
    mx = from_range.stop
    for mx in from_range:
        pass
   
    md = (mn + mx) / 2
   
    return mn, mx, md, st
   
def next_sample_size(new_samples, last_samples,
                     runs_per_sample,
                     widener=1.33  # Widen range by
                    ):
    n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples)
    l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples)
   
    # Next range of samples computes in names with prefix s_
   
    increase_runs = True
    if n_max == l_max:
        # Major expand of high end
        s_max = l_max + (l_max - l_min)
        increase_runs = False
    else:
        s_max = (n_mid +( n_max - n_mid)* widener)

    if n_min == l_min:
        # Major expand of low end
        s_min = max(1, l_min + (l_min - l_max))
        increase_runs = False
    else:
        s_min = (n_mid +( n_min - n_mid)* widener)
   
    s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5))
    s_step = n_step
    if s_min == s_max:
        if s_min > 2:
            s_min -= 1
        s_max += 1
       
    if increase_runs or n_max == n_min:
        runs_per_sample *= 2
        if n_max == n_min:
            s_step = 1
        else:
            s_step = max(1, (s_step + 1) // 2)  # Go finer
   
    next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step)
   
    return next_sample_range, runs_per_sample

# %%

if __name__ == '__main__':

    population_size = 100_000           # must be large, > 65K?
    sample_size = range(50, 5_000, 512) # Increasing!
    p_hat = 0.50                        # Population pass rate, 0.5 == 50%
    epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%
    confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
    runs_per_sample = 250               # each sample size is run this many time at start, ~250
    max_runs_per_sample = 35_000

    while runs_per_sample < max_runs_per_sample:
        new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample)
        n_min, n_max, n_mid, n_step = min_max_mid_step(new_range)
        print(f"{n_min} <= n <= {n_max}")
        print(f"  For {p_hat, epsilon, confidence_level =}\n"
            f"  Using {population_size, sample_size, runs_per_sample =}\n")
           
        sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)

print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% "
      f"with a confidence level of {confidence_level*100.:.0f}%.")

END.


 

No comments:

Post a Comment