Go deh!: 2024

Friday, May 10, 2024

Predicting results from small samples.

A black Englishman doing statistics on a blackboard. Image 1 of 4

I've run simulations, tens of thousands of them at a time, over and over as we developed chips. In one project I noticed that I could predict the final result after only a small number of results were in which allowed me to halt the rest of the simulations, or make advance preparations for the final result.

I looked it up at the time and, indeed, there is an equation where if you want the pass rate of a "large" population to within a given accuracy, it will give you the minimum sample size to use.

To some, that's all gobledygook so I'll try to explain with some code.

Explanation

Lets say you have a large randomised population of pass/fails or ones and zeroes:

from random import sample

population_size = 100_000    # must be large, > 65K?
sample_size = 123            # Actual
p_hat = 0.5                  # Population pass rate, 0.5 == 50%

_ones = int(population_size*p_hat)
_zeroes = population_size - _ones
population = [1] * _ones + [0] * _zeroes

And that we take a sample from it and compute the pass rate of that single, smaller sample

def random_sample() -> list[int]:
    return sample(population, k=sample_size)

pass_rate = (sum(random_sample())  # how many ones
             / sample_size)        # convert to a pass rate

print(pass_rate)  # e.g. 0.59027552674230146

Every time we run that pass_rate expression we get a different value. It is random. We need to run that pass_rate calculation many times to get an idea of what the likely pass_rate would be when using the given sample size:

runs_per_sample = 10_000     # each sample size is run this many times ~2500

pass_rates = [sum(random_sample()) / sample_size
              for _ in range(runs_per_sample)]

I have learnt that the question to ask is not how many times the sample pass rate was the population pass rate, but to define an acceptable margin of error, (say 5%) and say we want to be confident that pass rates be within that margin a certain percentage of the time

epsilon = 0.05               # target margin of error in sample pass rate. 0.05 == +/-5%

p_hat_min, p_hat_max = p_hat * (1 - epsilon), p_hat * (1 + epsilon)
in_range_count = sum(p_hat_min <= pass_rate < p_hat_max
                     for pass_rate in pass_rates)
sample_confidence_level = in_range_count / runs_per_sample
print(f"{sample_confidence_level = }")  # = 0.4054

So for a sample size of 123 we could expect the pass rate of the sample to be within 5% of the actual pass rate of the population, 0.5 or 50%, onlr o.4 or 40% of the time!

We need more!

What is actually done is we state what we think the population pass rate is, p_hat, (choose closer to 50% if you are unsure); the margin of error around p_hat we want, epsilon, usually +/-5% or +/-3%; and the confidence_level in hitting within the pass_rates margin of error.

There are calculators that will then give you n, the size of the sample needed to satisfy those condition.

Doing it myself

I calculated for one specific sample size, above. Obviousely, if I calculated pass_rates over a range of sample_sizes, and increqasing runs_per_sample, I could search out the sample size needed.

That is done in my next program. I have to switch to using the numpy library for its speed and sample_size becomes a range.

When the pass rate confidence levels are calculated I end up with a list of confidence levels for increasing sample sizes that are usually not increasing due to the randomness, e.g.

range_hits = [...0.94, 0.95, 0.93, 0.954, ... 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, ...] # confidence levels

The range of sample_size corresponding to the first occurrence>= the requested confidence level, and the last occorrence of a confidence level < the requested confidence level in then slightly widened and the runs_per_sample increased on another iteration to get a better result.

Here's a sample of the output I get when searching

Sample output

$ time python3 sample_search.py 
2098 <= n <= 2610
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(50, 5000, 512), 250)

2013 <= n <= 2525
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(2013, 2694, 256), 500)

1501 <= n <= 2781
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1501, 3037, 256), 500)

1757 <= n <= 2013
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(221, 4061, 256), 500)

1714 <= n <= 1970
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1714, 2055, 128), 1000)

1458 <= n <= 2098
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1458, 2226, 128), 1000)

1586 <= n <= 1714
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(818, 2738, 128), 1000)

1564 <= n <= 1692
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1564, 1735, 64), 2000)

1500 <= n <= 1564
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1436, 1820, 64), 2000)

1553 <= n <= 1585
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1489, 1575, 32), 4000)

1547 <= n <= 1579
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1547, 1590, 16), 8000)

1547 <= n <= 1579
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1515, 1611, 16), 8000)

1541 <= n <= 1581
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1541, 1584, 8), 16000)

1501 <= n <= 1533
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1501, 1621, 8), 16000)

1501 <= n <= 1533
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1389, 1538, 8), 16000)

1503 <= n <= 1575
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1495, 1677, 8), 16000)

1503 <= n <= 1535
  For p_hat, epsilon, confidence_level =(0.5, 0.05, 0.95)
  Using population_size, sample_size, runs_per_sample =(100000, range(1491, 1587, 4), 32000)

 Use a sample n = 1535 to predict population pass rate of 50.0% +/-5% with a confidence level of 95%.

real    3m49.023s
user    3m49.022s
sys     0m1.361s

My Code

# -*- coding: utf-8 -*-
"""
Created on Wed May  8 14:04:17 2024

@author: paddy
"""
# %%
from random import sample
import numpy as np


def sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample) -> range:
    """
    Arguments with example values:
        population_size = 100_000           # must be large, > 65K?
        sample_size = range(1400, 1750, 16) # (+min, +max, +step)
        p_hat = 0.5                         # Population pass rate, 0.5 == 50%
        epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%
        confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
        runs_per_sample = 10_000            # each sample size is run this many times ~2500
    
    Return:
        min,max range for the sample size, n, satisfying inputs.
    
    """
    
    def create_1_0_array(population_size=100_000, p_hat=0.5) -> np.ndarray:
        "Create numpy array of ones and zeroes with p_hat% as ones"
        ones = int(population_size*p_hat + 0.5)
        array10 = np.zeros(population_size, dtype=np.uint8)
        array10[:ones] = 1
        
        return array10


    def rates_of_samples(population: np.ndarray, sample_size_range: range, runs_per_sample: int
                        ) -> list[list[float]]:
        "Pass rates for range of sample sizes repeated runs_per_sample times."
        # many_samples_many_rates = [( np.random.shuffle(population),         # shuffle every *run*
        #                             [population[:s_count].sum() / s_count  # The pass rate for samples
        #                             for s_count in sample_size_range]   
        #                         )[1]                                     # drop the shuffle
        #                         for _ in range(runs_per_sample)]         # Every run

        many_samples_many_rates = [[ np.random.shuffle(population),         # shuffle every *sample*
                                     [population[:s_count].sum() / s_count  # The pass rate for samples
                                     for s_count in sample_size_range]   
                                   ][1]                                     # drop the shuffle
                                   for _ in range(runs_per_sample)]         # Every run

        return list(zip(*many_samples_many_rates))      # Transpose to by_sample_size_then_runs


    population = create_1_0_array(population_size, p_hat)
    by_sample_size_then_runs = rates_of_samples(population, sample_size, runs_per_sample)

    # Pass rates within target

    target_pass_range = tmin, tmax = p_hat * (1 - epsilon), p_hat * (1 + epsilon)  # Looking for rates within the range

    range_hits = [sum(tmin <= sample_pass_rate < tmax for sample_pass_rate in single_sample_size)
                for single_sample_size in by_sample_size_then_runs]
    runs_for_confidence_level = confidence_level * runs_per_sample

    for n_min, conf in zip(sample_size, range_hits):
        if conf >= runs_for_confidence_level:
            break
    else:
        n_min = sample_size.start

    for n_max, conf in list(zip(sample_size, range_hits))[::-1]:
        if conf <= runs_for_confidence_level:
            n_max += sample_size.step  # back a step
            break
    else:
        n_max = sample_size.stop
    
    if (n_min + sample_size.step) >= n_max and sample_size.step > 1:
        # Widen
        n_max = n_max + sample_size.step + 1

    return range(n_min, n_max, sample_size.step)


def min_max_mid_step(from_range: range) -> tuple[int, int, float, int]:
    "Extract from **increasing** from_range the min, max, middle, step"
    mn, st = from_range.start, from_range.step
    
    # Handle range where start == stop
    mx = from_range.stop
    for mx in from_range:
        pass
    
    md = (mn + mx) / 2
    
    return mn, mx, md, st
    
def next_sample_size(new_samples, last_samples,
                     runs_per_sample,
                     widener=1.33  # Widen range by
                    ):
    n_min, n_max, n_mid, n_step = min_max_mid_step(new_samples)
    l_min, l_max, l_mid, l_step = min_max_mid_step(last_samples)
    
    # Next range of samples computes in names with prefix s_
    
    increase_runs = True
    if n_max == l_max:
        # Major expand of high end
        s_max = l_max + (l_max - l_min)
        increase_runs = False
    else:
        s_max = (n_mid +( n_max - n_mid)* widener)

    if n_min == l_min:
        # Major expand of low end
        s_min = max(1, l_min + (l_min - l_max))
        increase_runs = False
    else:
        s_min = (n_mid +( n_min - n_mid)* widener)
    
    s_min, s_max = (max(1, int(s_min)), int(s_max + 0.5))
    s_step = n_step
    if s_min == s_max:
        if s_min > 2:
            s_min -= 1
        s_max += 1
        
    if increase_runs or n_max == n_min:
        runs_per_sample *= 2
        if n_max == n_min:
            s_step = 1
        else:
            s_step = max(1, (s_step + 1) // 2)  # Go finer
    
    next_sample_range = range(max(1, int(s_min)), int(s_max + 0.5), s_step)
    
    return next_sample_range, runs_per_sample

# %%

if __name__ == '__main__':

    population_size = 100_000           # must be large, > 65K?
    sample_size = range(50, 5_000, 512) # Increasing!
    p_hat = 0.50                        # Population pass rate, 0.5 == 50%
    epsilon = 0.05                      # target margin of error in sample pass rate. 0.05 == +/-5%
    confidence_level = 0.95             # sample to be within p_hat +/- epsilon, 0.95 == 95% of the time.
    runs_per_sample = 250               # each sample size is run this many time at start, ~250
    max_runs_per_sample = 35_000

    while runs_per_sample < max_runs_per_sample:
        new_range = sample_search(population_size, sample_size, p_hat, epsilon, confidence_level, runs_per_sample)
        n_min, n_max, n_mid, n_step = min_max_mid_step(new_range)
        print(f"{n_min} <= n <= {n_max}")
        print(f"  For {p_hat, epsilon, confidence_level =}\n"
            f"  Using {population_size, sample_size, runs_per_sample =}\n")
            
        sample_size, runs_per_sample = next_sample_size(new_range, sample_size, runs_per_sample)

print(f" Use a sample n = {n_max} to predict population pass rate of {p_hat*100.:.1f}% +/-{epsilon*100.:.0f}% "
      f"with a confidence level of {confidence_level*100.:.0f}%.")

END.

Sunday, April 21, 2024

Searching OEIS tables

a Jamaican showing how to search a table by row in the Jamaican hills. Image 4 of 4

A few months ago I submitted a series to OEIS* that was accepted; yes, but OEIS does not seem to leave my series searchable!

*OEIS is the Online Encyclopedia of Integer Series. I guess table is not in the name, but...

(best viewed on larger than a portrait phone)

Let me explain.

The documentation for OEIS, explains that if you have a 2D triangle or table of values rather than a one dimensional strict series, then one should antidiagonalise the data and submit the series produced.

They give as an example A003987 . This gives this table:

Table begins
   0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, ...
   1,  0,  3,  2,  5,  4,  7,  6,  9,  8, 11, 10, ...
   2,  3,  0,  1,  6,  7,  4,  5, 10, 11,  8, ...
   3,  2,  1,  0,  7,  6,  5,  4, 11, 10, ...
   4,  5,  6,  7,  0,  1,  2,  3, 12, ...
   5,  4,  7,  6,  1,  0,  3,  2, ...
   6,  7,  4,  5,  2,  3,  0, ...
   7,  6,  5,  4,  3,  2, ...
   8,  9, 10, 11, 12, ...
   9,  8, 11, 10, ...
  10, 11,  8, ...
  11, 10, ...
  12, ...
  ...

The above gets antidiagonalised to the series beginning:

  0, 1, 1, 2, 0, 2, 3, 3, 3, 3, 4, 2, 0, 2, 4, 5, 5, 1, 1, 5, 5, 6, 4,
  6, 0, 6, 4, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 6, 4, 6, 0, 6, 4, 6, 8, 9,
  9, 5, 5, 1, 1, 5, 5, 9, 9, 10, 8, 10, 4, 2, 0, 2, 4, 10, 8, 10, 11,
  11, 11, 11, 3, 3, 3, 3, 11, 11, 11, 11, 12, 10, 8, 10, 12, 2, 0, 2,
  12, 10, 8, 10, 12, 13, 13, 9, 9, 13, 13, 1, 1, 13, 13, 9, 9, 13, 13

Searching...

If we search for a sub-sequence of the antidiagonalised table, we can find the correct entry.

If, however, we search for a row of the values from the table, A003987 is not found!
The values chosen: to search for: 5,4,7,6,1,0,3 appear near the end of the table which shows that that row of numbers should be followed by a 2.
The table shows 13*13 / 2 ~ 85 values. OEIS has a list of 104 values, so it has the data to search through.

No intuitive search of OEIS tables

It seems to me that the most intuitive way to search a table of values is by row, left to right. There are other ways to search a table, (assuming an origin at top left and the table extends to the right and down):

By row, L2R. →, R2L ←
By Column Top2Bottom, ↓, B2T ↑
By 45 degree diagonals, ↘, ↖, ↙, ↗

OEIS doesn't seem to do these searches on tabular data.

Regenerating a 2D table from antidiagonalised data.

I did play around and created some code to recreate a table as a list of row-lists, in Python, given an OEIS B-file. The options handling is a work in progress, but the main part was being able to generate the table.

# -*- coding: utf-8 -*-
# %%
"""
adia_to_table.py file

Generate 2D table from OEIS type anti-diagonalised sequence held in file


Created on Thu Apr  4 18:16:07 2024

@author: paddy3118
"""

from itertools import zip_longest
import math
# from pprint import pp
from antidiagonals import antidiag_triangle_indices


def read_bfile(bfname: str) -> tuple[int | None,  # first index
                                     list[int]]:   # anti-diag values
    """
Read B-file bfname

## B-File format:
* Ref: https://oeis.org/SubmitB.html
* Blank lines ignored
* Lines beginning  with # ignored
* Lines of two, space-separated integers <index> <indexed-value>

It is assumed the index increments by one on subsequent lines.
    """
    first_index, values = None, []
    with open(bfname) as bfile:
        for line in bfile:
            ln = line.strip()
            if not ln or ln.startswith('#'):
                continue
            index, value = [int(field) for field in ln.split()[:2]]
            if first_index is None:
                first_index = index
            values.append(value)

    return first_index, values


def antidiag_to_table(sequence: list[int]) -> list[list[int]]:
    """
    Convert anti-diagonalised sequence back to infinite 2D table.

    Parameters
    ----------
    sequence : list[int]
        Anti-diagonalised values from table.

    Returns
    -------
    list[list[int]]
        Table of rows of ints.

    Table rows will fill in from successive sequence values like this:

     1  2  4  7 11 16 ...
     3  5  8 12 17
     6  9 13 18
    10 14 19
    15 20
    21
    .
    .
    .
    """

    # 1, 3, 6, 10, 15, 21, ... rows*(rows+1) / 2

    # min columns in triangular table generation. ~= min rows
    size = len(sequence)                                  # = rows*(rows+1)/2
    rows = math.ceil((-1 + math.sqrt(4 * 2 * size)) / 2)  # solve for rows
    # cols = rows  # last row may be deleted if last anti-diag is part filled.
    # print(f"{(size, cols) = }")

    # Empty (triangular) table of None's
    table = [[None] * (rows - i)
             for i in range(rows)]

    indices = antidiag_triangle_indices()
    col = 0  # for if sequence is empty
    for val, (row, col) in zip(sequence, indices):
        table[row][col] = val

    # Remove unfilled part of last anti-diag of table
    while col > 0:
        row, col = next(indices)
        table[row].pop(-1)
    # remove last row if present and empty
    if table and not table[-1]:
        table.pop(-1)

    return table


def pp_table(table: list[list[int]]) -> None:
    "Pretty-print table of differring row lengths"
    if not table:
        return
    col_width = max(max(len(str(val)) for val in row) for row in table)
    for row in table:
        print(''.join(f"{val:{col_width}}" for val in row))


def transpose(table: list[list[int]]) -> list[list[int]]:
    "Table of rows to x<>y transposed table of new rows"
    fv = math.nan
    tr = [list(new_row)
          for new_row in zip_longest(*table, fillvalue=fv)]
    # remove fillvalues in triangular transposition
    for row in tr:
        try:
            row[row.index(fv):] = []
        except ValueError:
            continue

    return tr


if __name__ == "__main__":
    print("# TEST FILL BY ANTI-DIAGONAL\n")
    for n in range(0, 8):
        print(f"{n = }:\n")
        ans = antidiag_to_table(list(range(1, n+1)))
        pp_table(ans)
        print()

    fname, m = 'b365096.txt', 505
    print(f"\n\n# Data from {fname}, first {m} values:\n")
    ad = read_bfile(fname)
    ans = antidiag_to_table(ad[1][:m])
    pp_table(ans)
    print("\n## Transposed:\n")
    pp_table((tr:=transpose(ans)))

And antidiagonals.py is this:

# -*- coding: utf-8 -*-
"""
Anti-diagonals:

0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3

Of Square:
0,0  0,1 1,0  0,2 1,1 2,0  0,3 1,2 2,1 3,0   1,3 2,2 3,1  2,3 3,2  3,3

of Infinite table:
0,0  0,1 1,0  0,2 1,1 2,0  0,3 1,2 2,1 3,0   0,4 1,3 2,2 3,1 4,0 ...


Created on Mon Aug 21 13:36:31 2023

@author: paddy

"""
# %% Triangles

from itertools import islice


def antidiag_triangle_indices() -> tuple[int, int]:
    x = y = 0
    while True:
        yield (x, y)
        x, y = (x+1, y-1) if y else (0, x+1)


list(islice(antidiag_triangle_indices(), 15))

# %% Rectangles

from itertools import islice


def antidiag_rectangle_indices(sizex: int=4, sizey: int=4) -> tuple[int, int]:
    x = y = 0
    while True:
        yield (x, y)
        if (x, y) == (sizex - 1, sizey - 1):
            break
        x, y = (x+1, y-1)
        if x == sizex or y < 0:
            u = x + y + 1
            x, y = (0, u) if u < sizey else (u - sizey + 1, sizey - 1)

list(antidiag_rectangle_indices(3, 4))

END.

Sunday, March 31, 2024

Finding a sub-list within a list, in Python

a left-handed Jamaican woman finding a sublist in a list at a beach-front home-office

Existing?

As part of a larger project, I thought I might need to search for a sub-list within a given list, and because I am lazy i did a quick google and did not like the answers I found.I started with the thought that the best algorithm for me would be to start searching from the index of the first item in the sublist and so on, but none of the googled answers used list.index.

I decided then to create my own

My version

Well I want to use list.index. If the item is not in the list then it raises an error, so I'll need a try-except block too.

I look for successive first item from the sub-list in the list and if found, accumulate the index in the answer and move on to search for the next match.

It seemed easy to add flags to:

Stop after finding a first index of the sub-list in the list.
Allow overlapping matches or not. [1,0,1] is found twice in [1,0,1,0,1] at indices 0 and 2, but only once if overlapping is not allowed

#!/bin/env python3
#%%
from typing import Any


"""
Find instance of a sub-list in a list
"""

def index_sublist(lst: list[Any],
                  sublst: list[Any],
                  only_first: bool=False,
                  non_overlapping=False,
                  ) -> list[int]:
    "Find instance of a (non-empty), sub-list in a list"
    if not sublst:
        raise ValueError("Empty sub-list")
    if not lst:
        return []
    
    first, ln = sublst[0], len(sublst)
    
    if len(lst) < ln:  # Reddit: check sizes
        return []
    
    ans, i = [], 0
    while True:
        try:
            i = lst.index(first, i)
        except ValueError:
            break
        if lst[i: i+ln] == sublst:
            ans.append(i)
            if only_first:  #Reddit: Only_first calc now fixed 
                break
        i += ln if non_overlapping else 1
    
    return ans

#%%
def test():
    assert index_sublist([], [1], only_first=False) == []
    assert index_sublist([1], [1], only_first=False) == [0]
    assert index_sublist([1,0,1], [1], only_first=False) == [0, 2]
    assert index_sublist([2,1,0,1], [1], only_first=True) == [1]
    assert index_sublist([2,1,0,1], [1, 3], only_first=False) == []
    
    assert index_sublist([1,0,1,0,1], [1,0,1],
                         only_first=False,
                         non_overlapping=False) == [0, 2]
    assert index_sublist([1,0,1,0,1], [1,0,1],
                         only_first=False,
                         non_overlapping=True) == [0]

    # Reddit: Extra checks
    assert index_sublist([2,1,0,1,2,1,2], [1, 2],
                         only_first=True) == [3]
    assert index_sublist([2,1,0,1,2,1,2], [1, 2],
                         only_first=False) == [3,5]


#%%
if __name__ == '__main__':
    test()

# %%

STOP PRESS!

A Reddit commenter spotted a bug that I fixed and added a couple of extra tests for.

End.

Go deh!