Sunday, November 10, 2019

Quasi-random Sobol sequence



In which I revisit low-discrepancy series, Hunt for an understanding of the Sobol sequence, then translate and optimise a C++ program into Python.

Re-visit VdC

I had left a mental note to get back to the Van der Corput sequence after creating the Rosetta Code task some years ago. Van der Corput is a type of low-discrepancy or quasi random sequence often used in Financial simulations; and with ever larger digital systems on automotive System on Chips, I keep a look-out for Verification techniques to add to the toolbox.

I decided to investigate the Sobol Sequence, understand the maths and write some Python!

Sobol Maths

I followed references; I googled; I read; and I just could not gain an understanding of what is happening, enough to write some code. What code I initially browsed gave me only snippets - Those snippets were bit twiddling so I coded those up for fun.

Fun bit twiddles

Highest 1 bit in binary number

def binhi1(b):
    "highest 1 bit of binary b. (== ceil(log2(b)))"
    pos = 0
    while b > 0:
        pos, b = pos + 1, b // 2
    return pos

Example 1
for n in range(18):
    print(f"{n:>5b} {binhi1(n)}")
    
    0 0
    1 1
   10 2
   11 2
  100 3
  101 3
  110 3
  111 3
 1000 4
 1001 4
 1010 4
 1011 4
 1100 4
 1101 4
 1110 4
 1111 4
10000 5
10001 5

Lowest 0 bit in binary number

def binlo0(b):
    "lowest 0 bit from right of binary b. Always >=1"
    pos, b2 = 1, b // 2
    while b2 * 2 != b:
        pos, b, b2 = pos + 1, b2, b2 // 2
    return pos

Example 2

for n in range(18):
    print(f"{n:<05b binlo0="" font="" n="">)
00000 1
00001 2
00010 1
00011 3
00100 1
00101 2
00110 1
00111 4
01000 1
01001 2
01010 1
01011 3
01100 1
01101 2
01110 1
01111 5
10000 1
10001 2


Code to translate

One of the Sobol references was this one by S. Joe and F. Y. Kuo. They had written academic papers on Sobol and had generated constants for the use of the sequence in up to 21201 dimensions. They also had a C++ program with BSD-style license and, crucially, sample outputs from their C++ code that could aid me in writing  Python conversion of their code.

I downloaded their C++ and their latest set of direction numbers and took a look. Eager to code something, I coded up a reader for their directions file:

Initial dirctions file reader


def direction_file_reader(fname="Sobol_new-joe-kuo-6.21201"):
    """\
    Reads direction file of format:
        d       s       a       m_i     
        2       1       0       1 
        3       2       1       1 3 
        4       3       1       1 3 1 
        5       3       2       1 1 1 
        6       4       1       1 1 3 3
        ...
    """
    with open(fname) as f:
        headers = f.readline().strip().split()
        assert headers == ['d', 's', 'a', 'm_i']
        directions = []
        for line in f:
            d, s, a, *m = [int(cell) for cell in line.strip().split()]
            directions.append((d, s, a, tuple([0] + m)))
        return directions

The direction file was three single columns then all the other columns made up the list m.

C++ code direct translation.

I decided to translate the C++ code near verbatim to Python, then run the example and ensure that the Python gave the same output as stated for the C++ program.

I cut-n-pasted sections of the C++ code as comments/unused triple-quoted strings into my Python then slowly translated from C++.
The C++ used several single-character, uppercase, variable names. I took them to be references to their papers or maybe earlier Fortran programs and either kept them, or made note of their re-naming.

I finally got my first working Python code which is below, but without the C++ code comments I followed:


1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def sobol_points(N, D):
    """From: C++ Source of Frances Y. Kuo "sobol.cc" """
    pow2_32 = 2**32

    L = binhi1(N)
    C = [binlo0(i) for i in range(N)]
    points= [[0 for j in range(D)] for i in range(N)]
    V = [1 << (32- i) for i in range(L+1)]

    X = [0]
    for i in range(1, N):
        X.append(X[i-1] ^ V[C[i-1]])
        points[i][0] = X[i]/pow2_32;
    del V, X
    
    directions = direction_file_reader()
    for j in range(1, D):
        d, s, a, m = directions[j-1]
        #print(d, s, a, m)
        
        if L <= s:
            V = [0] + [m[i] << (32 - i) for i in range(1, L+1)]
        else:
            V = [0] + [m[i] << (32 - i) for i in range(1, s+1)]
            for i in range(s+1, L+1):
                V.append(V[i-s] ^ (V[i-s] >> s))
                for k in range(1, s):
                    V[i] ^= (((a >> (s-1-k)) & 1) * V[i-k])
        X = [0]
        for i in range(1, N):
            X.append(X[i-1] ^ V[C[i-1]])
            points[i][j] = X[i]/pow2_32
    
    return points

Issues with the directly translated Python/C++:

  • The calculation ov V[] in lines 21 to 28 I thought should become a function.
  • All points are computed and returned at once. It would be better as a generator function.
  • The calculation of X[] in lines 29 to 31 seems like a very inefficient loop within a loop. Furthermore, if turned into a generator; Iwould need to unscramble the loops to try and recalculate X for each point without another loop.

Final Python generator function version


1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# -*- coding: utf-8 -*-
"""sobol_point_gen.py

 Usage:
    sobol_point_gen.py N D <dimensions_file>
    sobol_point_gen.py -h | --help

 Options:
     -h --help     Show this screen.     

 Arguments:
     N  Number of points to generate (<2**32)
     D  Number of dimensions for each point. (D <= max D in dimensions_file)
     dimensions_file    See: https://web.maths.unsw.edu.au/~fkuo/sobol/

 Created from an original c++ source by Frances Y. Kuo and Stephen Joe

Created on Fri Nov  1 09:13:12 2019

@author: Paddy3118
"""

from functools import lru_cache

#%%
def binhi1(b):
    "highest 1 bit of binary b. (== ceil(log2(b)))"
    pos = 0
    while b > 0:
        pos, b = pos + 1, b // 2
    return pos

def binlo0(b):
    "lowest 0 bit from right of binary b. Always >=1"
    pos, b2 = 1, b // 2
    while b2 * 2 != b:
        pos, b, b2 = pos + 1, b2, b2 // 2
    return pos

#%%
def direction_file_reader(d, fname="Sobol_new-joe-kuo-6.21201"):
    """\
    Reads d+header lines from direction file of format:
        d       s       a       m_i     
        2       1       0       1 
        3       2       1       1 3 
        4       3       1       1 3 1 
        5       3       2       1 1 1 
        6       4       1       1 1 3 3
        7       4       4       1 3 5 13 
        8       5       2       1 1 5 5 17 
        9       5       4       1 1 5 5 5 
        10      5       7       1 1 7 11 19 
        11      5       11      1 1 5 1 1 
        12      5       13      1 1 1 3 11 
        13      5       14      1 3 5 5 31 
        14      6       1       1 3 3 9 7 49 
        15      6       13      1 1 1 15 21 21 
        16      6       16      1 3 1 13 27 49 
        17      6       19      1 1 1 15 7 5 
        18      6       22      1 3 1 15 13 25 
        19      6       25      1 1 5 5 19 61 
        20      7       1       1 3 7 11 23 15 103 
        21      7       4       1 3 7 13 13 15 69 
        22      7       7       1 1 3 13 7 35 63 
        23      7       8       1 3 5 9 1 25 53 
        24      7       14      1 3 1 13 9 35 107 
        25      7       19      1 3 1 5 27 61 31 
        26      7       21      1 1 5 11 19 41 61 
        27      7       28      1 3 5 3 3 13 69 
        28      7       31      1 1 7 13 1 19 1 
        29      7       32      1 3 7 5 13 19 59 
        30      7       37      1 1 3 9 25 29 41 
        31      7       41      1 3 5 13 23 1 55 
        32      7       42      1 3 7 3 13 59 17 
        ...
    """
    with open(fname) as f:
        headers = f.readline().strip().split()
        assert headers == ['d', 's', 'a', 'm_i']
        directions = []
        for i, line in zip(range(d), f):
            d, s, a, *m = [int(cell) for cell in line.strip().split()]
            directions.append((d, s, a, tuple([0] + m)))
        return directions

#%%
@lru_cache(128)
def calc_dnum(s, a, m, mxbits):                      # V, direction numbers
    if mxbits <= s:
        dirnum = [0] + [m[i] << (32 - i) for i in range(1, mxbits+1)]
    else:
        dirnum = [0] + [m[i] << (32 - i) for i in range(1, s+1)]
        for i in range(s+1, mxbits+1):
            dirnum.append(dirnum[i-s] ^ (dirnum[i-s] >> s))
            for k in range(1, s):
                dirnum[i] ^= (((a >> (s-1-k)) & 1) * dirnum[i-k])
    return dirnum

def sobol_point_gen(N, D, fname="Sobol_new-joe-kuo-6.21201"):
    pow2_32 = 2**32
    mxbits = binhi1(N)                              # L, max bits needed
    dnum = [1 << (32- i) for i in range(mxbits+1)]  # V, direction numbers
    directions = direction_file_reader(D, fname)
    xd = [0 for j in range(D)]                      # X
    pt = [0 for j in range(D)]                      # POINTS
    yield pt
    for n in range(1, N):
        nfirst0 = binlo0(n-1)                       # C
        pt = [0 for j in range(D)]
        xd[0] = xd[0] ^ dnum[nfirst0]
        pt[0] = xd[0] / pow2_32
        for jj in range(1, D):
            _, s, a, m = directions[jj-1]
            dnum2 = calc_dnum(s, a, m, mxbits)      # V, direction numbers
            xd[jj] = xd[jj] ^ dnum2[nfirst0]
            pt[jj] = xd[jj]/pow2_32
        yield pt

#%%
if __name__ == '__main__':
    first_sobol_3D_points = [
         [0, 0, 0],
         [0.5, 0.5, 0.5],
         [0.75, 0.25, 0.25],
         [0.25, 0.75, 0.75],
         [0.375, 0.375, 0.625],
         [0.875, 0.875, 0.125],
         [0.625, 0.125, 0.875],
         [0.125, 0.625, 0.375],
         [0.1875, 0.3125, 0.9375],
         [0.6875, 0.8125, 0.4375]]
    generated = list(sobol_point_gen(10, 3))
    assert generated == first_sobol_3D_points


Function def direction_file_reader is changed to only read as much of thedirections file as is necessary.
Function def sobol_point_gen is now a generator function in which successive calls to next give successive points. The loops are swapped inside with an outer loop yielding a point every iteration, and an inner loop over dimensions > 1.
Function def calc_dnum is cached. Initial suspicions of a lot of recalculation proved true:

calc_dnum cache stats


In [1]: g = list(sobol_point_gen(10000, 3))
In [2]: calc_dnum.cache_info()
Out[2]: CacheInfo(hits=20012, misses=4, maxsize=128, currsize=4)



1 comment: