Wednesday, March 24, 2021

Using Sympy for linear interpolation.

Wednesday, February 17, 2021

Longest Common Substring: An investigation using Python.

(Best browsed on a larger than phone screen)

Someone updated the RC task, and I couldn't remember attempting it so took a closer look at the Python solution, (at least the more idiomatic of the two).

Task

The task is: Given two character strings, find the longest string of characters present anywhere in both strings.

I should point out that there is a different, well known, task which is to find the longest common subsequence between two strings.

A sub**sequence** if a string is any characters from the string, in the order that they appear in the string, *but intervening characters can be skipped*.
There is no such skipping allowed in a sub**string**

The "Rote search" method

I took a look at the Python on Rosetta Code, it is by user Halilmal:

s1 = "thisisatest"
s2 = "testing123testing"
len1, len2 = len(s1), len(s2)
ir, jr = 0, -1
for i1 in range(len1):
    i2 = s2.find(s1[i1])
    while i2 >= 0:
        j1, j2 = i1, i2
        while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
            if j1-i1 >= jr-ir:
                ir, jr = i1, j1
            j1 += 1; j2 += 1
        i2 = s2.find(s1[i1], i2+1)
print (s1[ir:jr+1])

test

The code works. To understand it I decided to take the liberty of renaming and extracting a function from the original. I also named the input strings from zero rather than 1.

s0 = "thisisatest"
s1 = "testing123testing"

def lcs_rs(s0, s1):
    "Commented/longer names, RosettaCode entry by user Halilmali"
    len0, len1 = len(s0), len(s1)
    max_sub_lh, max_sub_rh = 0, -1   # Start LH and RH indices of max substring
    for i0 in range(len0):      # Start LH index of s0 char
        i1 = s1.find(s0[i0])    # Start LH index of (first) s1 char match
        while i1 >= 0:
            sub_end_s0, sub_end_s1 = i0, i1     # End indices of this substring match
            while (sub_end_s0 < len0 and
                   sub_end_s1 < len1 and
                   s1[sub_end_s1] == s0[sub_end_s0]):
                if sub_end_s0 - i0 >= max_sub_rh-max_sub_lh:
                    max_sub_lh, max_sub_rh = i0, sub_end_s0     # New Max
                sub_end_s0 += 1
                sub_end_s1 += 1
            i1 = s1.find(s0[i0], i1+1)   # Start LH index of next s1 char match
    return s0[max_sub_lh:max_sub_rh+1]

print (lcs_rs(s0, s1))

test

In this algorithm we have the two strings:

          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6

  s0[i0]: t h i s i s a t e s t

  s1[i1]: t e s t i n g 1 2 3 t e s t i n g

For every character in s0 starting from the left, it looks up that character in s1 then tries to extend the matching string to the right as much as possible; moving to second and subsequent matches in s1; all the while keeping track of the largest substring match so far.

It seems to do a lot of traversal of the strings!

Dynamic Programming.

I searched and found a site that had a dynamic programming solution.

I liked the algorithm. In this you start with an X-Y array of zeroes, what I call the dp array:

#s1[j] t  e  s  t  i  n  g  1  2  3  t  e  s  t  i  n  g;  # s0[i]  
dp = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # t
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # h
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # i
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # i
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # a
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # t
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # e
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] # t

You compare every character of s0 with every character of s1 in order of increasing i and j and fill in the table like this:

if s0[i] == s1[j] then dp[i][j] = dp[i-1][j-1] + 1 You are comparing two characters from each string. If they are equal, then accumulate how many characters have been equal by adding one to the equality count for the dp coordinate corresponding to accumulated run of equivalences for the "previous" character.

There is an issue with what to do at the "edges". When i or jare zero then set dp[i][j] = 1

After filling the dp array, to get the longest substring search dp for a maximum value, mx, then mx is the number of characters in the longest substring and the coordinates of mx give the end index iof the string in s0 and s1 respectively.

An example:

s0 = 'ABBA'
s1 = 'AABB'

#s1[j] A  A  B  B;   # s0[i]

dp = [[1, 1, 0, 0],  # A
      [0, 0, 2, 1],  # B
      [0, 0, 1, 3],  # B
      [1, 1, 0, 0]]  # A

The above fills in dpafter using the equivalent to for i in range(len(s0)): for j in range(len(s1)): ... to traverse the array. The maximum value is clearly 3 at i,j == (2, 3). Using the i coordinate corresponding to the end of the longest substring being the first B in s0 knowing the length, the longest substring must be 'ABB'.

You can cross check that the same 3 character substring appears in s1 ending at s1[3]

Code

from itertools import product

def dp_array_print(s0, s1, dp, ans):
    print(f"\ns0 = {repr(s0)}\ns1 = {repr(s1)}\n")
    print(f"#s1[j] {'  '.join(c for c in s1)};   # s0[i]\n")
    for i, (dp_row, ch0) in enumerate(zip(dp, s0)):
        txt = f"      {dp_row},  # {ch0}"
        if i == 0:
            txt = "dp = [" + txt[6:]
        if i + 1 == len(s0):
            txt = txt.replace(',  #', ']  #')
        print(txt)
    print('\nAnswer = ', repr(ans),'\n=====================')

def lcs_dp(s0, s1):
    len0, len1 = len(s0), len(s1)

    # Bottom up Dynamic Programming table gen
    dp = [[0] * len1 for _ in range(len0)]
    for i, j in product(range(len0), range(len1)):
        dp[i][j] = (1 + (dp[i-1][j-1] if i and j else 0)   # 0 if off "edge"
                    if s0[i] == s1[j]   # Only increment if same chars
                    else 0)

    # Find max
    mx, ij_mx = 0, (-1, -1)
    for i, j in product(range(len0), range(len1)):
        if dp[i][j] > mx:
            mx, ij_mx = dp[i][j], (i, j)

    return s0[1 + ij_mx[0] - mx: 1 + ij_mx[0]], dp


s0, s1 = 'ABBA', 'AABB'

ans01, dp01 = lcs_dp(s0, s1)
dp_array_print(s0, s1, dp01, ans01)

s0 = 'ABBA'
s1 = 'AABB'

#s1[j] A  A  B  B;   # s0[i]

dp = [[1, 1, 0, 0],  # A
      [0, 0, 2, 1],  # B
      [0, 0, 1, 3],  # B
      [1, 1, 0, 0]]  # A

Answer =  'ABB' 
=====================

The code isn't too bad, but there is that nested if-expression needed to calculate the dp[i][j] value under all cases.

Dynamic Programming for more than two strings?

For three strings I thought of filling in a dp cube in a similar way to the dp array, should work.

After trying a few two-string examples I noticed that there could be a fair amount of zeroes in the dp array. I decided to therefore make the dp-cube a sparse array held in a dict with a tuple of the string indices of any point and only keeping non-zero counts in the dp sparse dict.
I chose not to use a defaultdict(int) for the dp dict as it would accumulate zero values for accesses to any tuple coordinate having a -1 in it.

It prooved to be easy enough to move from first doing a three-string function to actually going ahead and doing a two or more string version straight away.

def lcs_dpx(*strings):
    "Longest Common Substring: Dynamic Programming solution for many strings"
    index_ranges  = [range(len(s)) for s in strings]

    # Bottom up Dynamic Programming table gen
    dp = {}  
    for indices in product(*index_ranges):
        string_chars = (s[i] for i, s in zip(indices, strings))
        first_char = next(string_chars)
        if all(next_char == first_char for next_char in string_chars):
            prev = tuple(i-1 for i in indices) # Index previous chars
            dp[indices] = 1 + (dp[prev] if prev in dp else 0)

    # Find max
    mx, mx_ind = 0, tuple(-1 for _ in strings)
    for indices, val in dp.items():
        if val > mx:
            mx, mx_ind = val, indices

    return strings[0][1 + mx_ind[0] - mx: 1 + mx_ind[0]], dict(dp)


s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

ans, dp = lcs_dpx(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")

'thisisatest_stinger', 'testing123testingthing' ->> 'sting';  len(dp) = 48;

'thisisatest_stinger', 'thisis' ->> 'thisis';  len(dp) = 19;

'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 16;

'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 55;

This works too!
The code uses product to perform arbitrarily nested for loops. The logic for incrementing dp is simplified as dp starts without, and cannot contain an indices tuple with a -1 in it as that is generated in prev and dp[prev]is never assigned to.

Strings containing long runs of the samwe single character will have the largest number of matches and so the largest dpsize for this algorithm.

The longest of the common substrings of all substrings of each string

This algorithm, I saw mentioned somewhere. From just the menton of sets, i worked out what would be needed in Python, without bothering to understand the Perl original.

This method

Works out the set of all the possible sub-strings for each individual string.
Finds the set of substrings common to all the strings
Returns the common substring that is the longest.

The description of the algorithm "does exactly what it says on the tin". I decided to create a version that worked with two and more strings as it seemed the most straight-forward of extensions.

Code

function _set_of_substringsreturns the set of all possible substrings of its argument. Rather than continue to just generate all the substrings of all strings, function _set_of_common_substringsis used for subsequent strings to accumulate and return the set of only the common substrings. This latter function also makes use of the walrus operator to assign to this.

def _set_of_substrings(s:str) -> set:
    "_set_of_substrings('ABBA') == {'A', 'AB', 'ABB', 'ABBA', 'B', 'BA', 'BB', 'BBA'}"
    len_s = len(s)
    return {s[m: n] for m in range(len_s) for n in range(m+1, len_s +1)}

def _set_of_common_substrings(s:str, common: set) -> set:
    "Substrings of s that are also in common"
    len_s = len(s)
    return {this for m in range(len_s) for n in range(m+1, len_s +1)
            if (this := s[m:n]) in common}

def lcs_ss(*strings):
    "longest of the common substrings of all substrings of each string"
    strings_iter  = (s for s in strings)
    common = _set_of_substrings(next(strings_iter)) # First string substrings
    for s in strings_iter:
        if not common:
            break
        common = _set_of_common_substrings(s, common) # Accumulate the common

    return max(common, key= lambda x: len(x)) if common else ''


s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

ans = lcs_ss(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)}")
ans = lcs_ss(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)}")
ans = lcs_ss(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)}")
ans = lcs_ss(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)}")

'thisisatest_stinger', 'testing123testingthing' ->> 'sting'

'thisisatest_stinger', 'thisis' ->> 'thisis'

'testing123testingthing', 'thisis' ->> 'thi'

'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi'

Comparative Timings

Simplistic data so have a pinch of salt to hand:

s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

%timeit _ = lcs_rs(s0, s1)

122 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit _ = lcs_dp(s0, s1)

315 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = lcs_dpx(s0, s1)

1.4 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = lcs_ss(s0, s1)

265 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now for timing 3-string runs:

%timeit _ = lcs_dpx(s0, s1, s2)

8.4 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit _ = lcs_ss(s0, s1, s2)

284 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Stop Press! Incremental DP

It seems to me that in the multi-string, dynamic programming case, you should be able to create the dpsparse array for the first two strings then add in subsequent strings to successive dpsparse arrays incrementing dimensionality. When adding a dimension/string to dp, the search for string equality only needs to be done where there were matches for all previous strings/lower dimentionality dp.

def lcs_idp(*strings):
    "Longest Common Substring: Incremental Dynamic Programming for many strings"
    index_ranges  = [range(len(s)) for s in strings]

    ## Incremental Dynamic Programming table gen
    # First two strings
    dp = {}
    for indices in product(*index_ranges[:2]):
        string_chars = (s[i] for i, s in zip(indices, strings[:2]))
        first_char = next(string_chars)
        if all(next_char == first_char for next_char in string_chars):
            prev = tuple(i-1 for i in indices) # Index previous chars
            dp[indices] = 1 + (dp[prev] if prev in dp else 0)
    # Next strings
    for index_range, string in zip(index_ranges[2:], strings[2:]):
        dp_last, dp = dp, {}
        for last_match_index in dp_last:      # These are non-zero from before
            last_match_char = strings[0][last_match_index[0]]
            for index, this_char in enumerate(string):
                if last_match_char == this_char:
                    indices = tuple(list(last_match_index) + [index])
                    prev = tuple([i-1 for i in last_match_index] + [index - 1])
                    dp[indices] = 1 + (dp[prev] if prev in dp else 0)

    # Find max
    mx, mx_ind = 0, tuple(-1 for _ in strings)
    for indices, val in dp.items():
        if val > mx:
            mx, mx_ind = val, indices

    return strings[0][1 + mx_ind[0] - mx: 1 + mx_ind[0]], dict(dp)


s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

ans, dp = lcs_idp(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_idp(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_idp(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_idp(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")

'thisisatest_stinger', 'testing123testingthing' ->> 'sting';  len(dp) = 48;

'thisisatest_stinger', 'thisis' ->> 'thisis';  len(dp) = 19;

'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 16;

'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 55;

Timings

%timeit _ = lcs_idp(s0, s1)

1.5 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = lcs_idp(s0, s1, s2)

1.73 ms ± 59.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This new Incremental Dynamic Programming approach seems to be best the straight DP algorithm for three and more strings as it prunes the (hyper) cube of comparisons.

Conclusion.

The timings are based on a laughably small example. You really need to try the routines on your own data. lcs_ssand lcs_idp should be checked with your own sample data to see what is right for you. They both try and cut down on memoryand computations, but your type of data might make one outclas the other.

Note: There could be another algorithm that makes use of suffix trees...

END.

Saturday, February 13, 2021

Python: Counting without Counter

Normally, if you are counting things then you need to first think of collections.Counter. It's in the standard library and I use it a lot myself, but lately, I've had two instances where I was counting, and Counter was not the best fit!

1. When the keys to count are known, adjacent, integers.

Or can be easily manipulated to be

In this case you can replace counter by an array of zeroes and increment array[key] whenever you see key.

I used this in code to bin numbers. The bin limits are set and sorted, so you know exactly how many bins are needed. numbers to bin are first compared with the limits to work out which bin they occupy and then that bin is incremented:

from bisect import bisect_right
 
def bin_it(limits: list, data: list) -> list:
    "Bin data according to (ascending) limits."
    bins = [0] * (len(limits) + 1)      # adds under/over range bins too
    for d in data:
        bins[bisect_right(limits, d)] += 1
    return bins

I have replaced hash lookup in Counter with array indexing.

2. When defaultdict(int) is better

Another way to count arbitrary objects is to use defaultdict(int) and just increment item counts with the defauldict automatically starting the count at zero from its initialising call of int().

I was splitting words in a book then working out the frequencies of what words follow others with an aim to randomly create sentencies where the following word would by a weighted random choice of the words in the book that followed a word.

So if the book were just two sentences it would be broken down as:

>>> book_text = 'The cat sat on the mat. The mat is that upon which the cat sat'
>>> words = remove_punctuation(book_text).split()
>>> words
['the', 'cat', 'sat', 'on', 'the', 'mat', '.', 'the', 'mat', 'is', 'that', 'upon', 'which', 'the', 'cat', 'sat', '.']
>>>

I need to count how many times 'cat' comes after 'the'; 'sat' comes after 'cat', ... I also want to be able to to access 'the' and find all the words that it precedes, together with a count of how many times that word directly succeeds 'the'.

I want a dictionary with key 'word' that returns a dictionary who's keys are all the immediate successor words, and who's values are the counts of how many times the successor word is found to the original word.

Something like the following nested dict:

{'the': {'cat': 2, 'mat': 2},
 'cat': {'sat': 2},
 'sat': {'on': 1, '.': 1},
 'on': {'the': 1},
 'mat': {'.': 1, 'is': 1},
 '.': {'the': 1},
 'is': {'that': 1},
 'that': {'upon': 1},
 'upon': {'which': 1},
 'which': {'the': 1}}

See how it shows that 'the' is followed twice by 'cat' and once by'mat'; 'sat' is followed once by 'on' and once by a full stop '.', etc.

The data structure needs to be an outer dict that will automatically create an inner counter dict. I use lowercase counter because that inner dict can be either a collections.Counter or a defaultdict(int) used as a counter. Since I need a defaultdict at the outer level, and don't need any of the Counter special methods, I chose to use defaultdict for the nested dicts too.

That leads to the following definition and population of the datastructure:

>>> word2next = defaultdict(lambda :defaultdict(int))
>>> for lh, rh in zip(words, words[1:]):
    word2next[lh][rh] += 1

    
>>> assert word2next == {'the': {'cat': 2, 'mat': 2},
                     'cat': {'sat': 2},
                     'sat': {'on': 1, '.': 1},
                     'on': {'the': 1},
                     'mat': {'.': 1, 'is': 1},
                     '.': {'the': 1},
                     'is': {'that': 1},
                     'that': {'upon': 1},
                     'upon': {'which': 1},
                     'which': {'the': 1}}
>>>

zip(*word2next[some_word].items()) then returns a list of successor words and a list of successor_word_counts that can then be used as arguments to random.choices() to randomly select the next word in a Markov Chain sentence generator based on the book text.

Conclusion

When counting odd stuff in Python then collections.Counter should always be considered, but it isn't the only way to count; arrays initialised to zero, and sometimes defaultdict(int), might also be worth considering.

End.

Tuesday, February 02, 2021

Pythone'er abuses walrus

(... the operator, that is 😊).

I haven't used the new walrus operator, := much in code. This is an example of a working mis-use of it.

A new task

I wrote a new task description on Rosetta Code, together with Python recursive and iterative solutions.

The gist of the task is:

Given a list of integers, >0, representing nesting levels, e.g. [1, 2, 4]
Generate a tree of those values, in order, in the stated depth of nesting. i.e. [1, [2, [[4]]]]

(Someone on Stackoverflow had that problem, and it seemed like it could form the basis of a good RC task).

Iterative solution

I posted the following code as my iterative solution.

def to_tree(x: list) -> list:
   nested = []
   stack = [nested]
   for this in x:
       while this != len(stack):
           if this > len(stack):
               innermost = []               # new level
               stack[-1].append(innermost)  # nest it
               stack.append(innermost)      # push it
           else: # this < stack:
               stack.pop(-1)
       stack[-1].append(this)

   return nested

It works using:

len(stack) to capture nesting depth; and
stack[-1] as a pointer into the active nest at that level; whilst
nested accumulates the whole tree.

The walrus hack.

I was admiring lines seven to nine above and was initially rthinking about name innermost. It is a kind of temporary variable, but is instrumental in modifying stack. I thought: "could I do lines seven to eight in one statement"?

It is awkward as appends return None, but then the walrus operator came to my rescue and I produced the equivalent:

def to_tree(x: list) -> list:
   nested = []
   stack = [nested]
   for this in x:
       while this != len(stack):
           if this > len(stack):
               stack.append(None if stack[-1].append(innermost:=[])
                            else innermost)
           else: # this < stack:
               stack.pop(-1)
       stack[-1].append(this)

   return nested

The one statement at lines seven and eight is equivalent. The whole expression is evaluated inside out:

An empty list is created and assigned to name innermost.
that same list is nested by appending it to the list at the top of the stack.
The inner append always returns None which is False in the boolean context of the if condition.
The else clause is always triggered, returning innermost - the newly created and nested list.
That newly created and nested list is appended to the head of the stack.

This use of the walrus operator, a.k.a the assignment expression is bad Python. It is much harder to read and maintain than the previous code.

Other uses are Pythonic!

STOP PRESS!

Steven D'Aprano gives valuable insight on this in a comment here.

Sunday, January 31, 2021

Counting types of neighbour - in Python.

(I tried using the Spyder IDE to write a more literal Python program. Output samples needed to be pasted in to the source, but I didn't particularly want many of Jupyters strengths, so tried Spyder together with http://hilite.me/)

Best viewed on larger than 7" screens.

# -*- coding: utf-8 -*-
"""
Created on Sat Jan 30 10:24:54 2021

@author: Paddy3118
"""

#from conway_cubes_2_plus_dims import *
#from conway_cubes_2_plus_dims import neighbours
from itertools import product


# %% #Full neighbourhood
"""
In my [previous post][1] I showed how the number of neighbours to a point in
N-Dimensional space increases exponentially:

[1]: https://paddy3118.blogspot.com/2021/01/conways-game-of-life-in-7-dimensions.html

"""

# %% ##Function

def neighbours(point):
    return {n
            for n in product(*((n-1, n, n+1)
                               for n in point))
            if n != point}

for dim in range(1,10):
    n = neighbours(tuple([0] * dim))
    print(f"A {dim}-D grid point has {len(n):_} neighbours.")
    assert len(n) == 3**dim - 1

# %% ##Output
"""
A 1-D grid point has 2 neighbours.
A 2-D grid point has 8 neighbours.
A 3-D grid point has 26 neighbours.
A 4-D grid point has 80 neighbours.
A 5-D grid point has 242 neighbours.
A 6-D grid point has 728 neighbours.
A 7-D grid point has 2_186 neighbours.
A 8-D grid point has 6_560 neighbours.
A 9-D grid point has 19_682 neighbours.
"""

# %% ##Visuals
"""
Here we define neighbours as all cells, n whose indices differ by at most 1,
i.e. by -1, 0, or +1 from the point X itself; apart from the point X.

In 1-D:

    .nXn.

In 2-D:

    .....
    .nnn.
    .nXn.
    .nnn.
    .....

# 'Axial' neighbours

 There is another definition of neighbour that "cuts out the diagonals" in 2-D
 to form:

     .....
     ..n..
     .nXn.
     ..n..
     .....

In 3-D this would add two extra cells, one directly above and one below X.

In 1-D this would be the two cells one either side of X.

I call this axial because if X is translated to the origin then axial
neighbours are the 2N cells in which only one coordinate can differ by either
+/- 1.

# Let's calclate them.
"""

# %% #Axial neighbourhood
from typing import Tuple, Set, Dict

Point = Tuple[int, int, int]
PointSet = Set[Point]


def axial_neighbours(point: Point) -> PointSet:
    dim, pt = len(point), list(point)
    return {tuple(pt[:d] + [pt[d] + delta] + pt[d+1:])
            for d in range(dim) for delta in (-1, +1)}

print("\n" + '='*60 + '\n')
for dim in range(1,10):
    n = axial_neighbours(tuple([0] * dim))
    text = ':\n  ' + str(sorted(n)) if dim <= 2 else ''
    print(f"A {dim}-D grid point has {len(n):_} axial neighbours{text}")
    assert len(n) == 2*dim

# %% ##Output
"""
============================================================

A 1-D grid point has 2 axial neighbours:
  [(-1,), (1,)]
A 2-D grid point has 4 axial neighbours:
  [(-1, 0), (0, -1), (0, 1), (1, 0)]
A 3-D grid point has 6 axial neighbours
A 4-D grid point has 8 axial neighbours
A 5-D grid point has 10 axial neighbours
A 6-D grid point has 12 axial neighbours
A 7-D grid point has 14 axial neighbours
A 8-D grid point has 16 axial neighbours
A 9-D grid point has 18 axial neighbours
"""

# %% #Another way of classifying...
"""
The 3-D axial neighbourhood around X can be described as:
    Think of X as a cube with six sides oriented with the axes of the 3
    dimensions. The axial neighbourhood is six similar cubes with one face
    aligned with each of X's faces. A kind of 3-D cross of cubes.

In 3-D, the "full" neighbourhood around point X describes a 3x3x3 units cube
centered on X.

In 3-D: Thinking of that 3x3x3 cube around X:

    What if we excluded the six corners of that cube?
    * Those excluded corner points have all their three coordinates different
      from those of X,
      i.e. if excluded e = Point(e_x, e_y, e_z) and X = Point(x_x, x_y, x_z):
          e_x != x_x and e_y != x_y and e_z != x_z
      Also:
          |e_x - x_x| == 1 and |e_y - x_y| == 1 and |e_z - x_z| == 1

    * We _Keep_ all points in the cube that have 1 and 2 different coords to
      those of X

    Another definition of the _axial_ neighbourhood case is
    * We _Keep_ all points in the cube that have only 1 different coord to
      those of X

This can be generalised to thinking in N dimensions of neighbourhood types
keeping from 1 to _up to_ N differences in coordinates (DinC).
"""
def axial_neighbours(point: Point) -> PointSet:
    dim, pt = len(point), list(point)
    return {tuple(pt[:d] + [pt[d] + delta] + pt[d+1:])
            for d in range(dim) for delta in (-1, +1)}

#%% #Neighbourhood by Differences in Coordinates

from collections import defaultdict
from pprint import pformat
from math import comb, factorial as fact


def d_in_c_neighbourhood(dim: int) -> Dict[int, PointSet]:
    """
    Split neighbourhood cube around origin point in dim-dimensions mapping
    count of coords not-equal to origin -> those neighbours
    """
    origin = tuple([0] * dim)
    cube = {n
            for n in product(*((n-1, n, n+1)
                               for n in origin))
            if n != origin}
    d_in_c = defaultdict(set)
    for n in cube:
        d_in_c[dim - n.count(0)].add(n)
    return dict(d_in_c)

def _check_counts(d: int, c: int, n: int) -> None:
    """
    Some checks on counts

    Parameters
    ----------
    d : int
        Dimensionality.
    c : int
        Count of neighbour coords UNequal to origin.
    n : int
        Number of neighbour points with exactly c off-origin coords.

    Returns
    -------
    None.

    """
    # # From OEIS
    # if   c == 1: assert n == d* 2
    # elif c == 2: assert n == d*(d-1)* 2
    # elif c == 3: assert n == d*(d-1)*(d-2)* 4 / 3
    # elif c == 4: assert n == d*(d-1)*(d-2)*(d-3)* 2 / 3
    # elif c == 5: assert n == d*(d-1)*(d-2)*(d-3)*(d-4)* 4 / 15
    # # Getting the hang of it
    # elif c == 6: assert n == d*(d-1)*(d-2)*(d-3)*(d-4)*(d-5)* 4 / 45

    # Noticed some powers of 2 leading to
    # if c == 6: assert n == d*(d-1)*(d-2)*(d-3)*(d-4)*(d-5)* 2**6 / fact(6)
    # if c == 6: assert n == fact(d) / fact(c) / fact(d - c) * 2**c

    # Finally:
    assert n == fact(d) / fact(c) / fact(d - c) * 2**c


print("\n" + '='*60 + '\n')
for dim in range(1,10):
    d = d_in_c_neighbourhood(dim)
    tot = sum(len(n_set) for n_set in d.values())
    print(f"A {dim}-D point has a total of {tot:_} full neighbours of which:")
    for diff_count in sorted(d):
        n_count = len(d[diff_count])
        print(f"  {n_count:4_} have exactly {diff_count:2} different coords.")
        _check_counts(dim, diff_count, n_count)

# %% ##Output
"""
============================================================

A 1-D point has a total of 2 full neighbours of which:
     2 have exactly  1 different coords.
A 2-D point has a total of 8 full neighbours of which:
     4 have exactly  1 different coords.
     4 have exactly  2 different coords.
A 3-D point has a total of 26 full neighbours of which:
     6 have exactly  1 different coords.
    12 have exactly  2 different coords.
     8 have exactly  3 different coords.
A 4-D point has a total of 80 full neighbours of which:
     8 have exactly  1 different coords.
    24 have exactly  2 different coords.
    32 have exactly  3 different coords.
    16 have exactly  4 different coords.
A 5-D point has a total of 242 full neighbours of which:
    10 have exactly  1 different coords.
    40 have exactly  2 different coords.
    80 have exactly  3 different coords.
    80 have exactly  4 different coords.
    32 have exactly  5 different coords.
A 6-D point has a total of 728 full neighbours of which:
    12 have exactly  1 different coords.
    60 have exactly  2 different coords.
   160 have exactly  3 different coords.
   240 have exactly  4 different coords.
   192 have exactly  5 different coords.
    64 have exactly  6 different coords.
A 7-D point has a total of 2_186 full neighbours of which:
    14 have exactly  1 different coords.
    84 have exactly  2 different coords.
   280 have exactly  3 different coords.
   560 have exactly  4 different coords.
   672 have exactly  5 different coords.
   448 have exactly  6 different coords.
   128 have exactly  7 different coords.
A 8-D point has a total of 6_560 full neighbours of which:
    16 have exactly  1 different coords.
   112 have exactly  2 different coords.
   448 have exactly  3 different coords.
  1_120 have exactly  4 different coords.
  1_792 have exactly  5 different coords.
  1_792 have exactly  6 different coords.
  1_024 have exactly  7 different coords.
   256 have exactly  8 different coords.
A 9-D point has a total of 19_682 full neighbours of which:
    18 have exactly  1 different coords.
   144 have exactly  2 different coords.
   672 have exactly  3 different coords.
  2_016 have exactly  4 different coords.
  4_032 have exactly  5 different coords.
  5_376 have exactly  6 different coords.
  4_608 have exactly  7 different coords.
  2_304 have exactly  8 different coords.
   512 have exactly  9 different coords.
"""

# %% #Manhattan
"""
What, up until now, I have called the Differences in Coordinates is better
known as the Manhattan distance, or [Taxicab geometry][2].


[2]: https://en.wikipedia.org/wiki/Taxicab_geometry
"""

# %% #Formula for Taxicab counts of cell neighbours
"""
Function _check_counts was originally set up as a crude check of taxicab
distance of 1 as the formula was obvious.

The formulea for taxcab distances 2, 3, and 4 were got from a search on [OEIS][3]

The need for factorials is obvious. The count for a taxicab distance of N in
N-D space is 2**N which allowed me to work out a final formula:

If:
    d is the number of spatial dimiensions.
    t is the taxicab distance of a neighbouring point to the origin.
Then:
    n, the count of neighbours with exactly this taxicab distance is

        n = f(d, t)
          = d! / t! / (d-t)! * 2**t


We can use the assertion from the exercising of function neighbours() at the
beginning to state that:

        sum(f(d, t)) for t from 0 .. d
          = g(d)
          = 3**d - 1


[3]: https://oeis.org/
"""

# %% ##Taxicab visualisation
"""
Lets create a visualisation of the difference in coords for neighbours in <= 4-D.
(The function is general, but I'm lost after 3-D)!

The origin will show as zero, 0; and neighbours surround it as digits which are
the taxicab distance from 0.
"""

def to_str(taxi: Dict[int, PointSet], indent: int=4) -> str:
    """


    Parameters
    ----------
    taxi : Dict[int, PointSet]
        Map taxicab distance from origin
        -> set off neighbours with that difference.
    indent : int
        indent output with spaces

    Returns
    -------
    str
        Visusalisation of region.

    """

    if not taxi:
        return ''

    ap = set()              # all points
    neighbour2taxi = {}
    for taxi_distance, nbrs in taxi.items():
        ap |= nbrs
        for n in nbrs:
            neighbour2taxi[n] = taxi_distance

    # Dimensionality
    dims = len(n)

    # Add in origin showing as 0
    origin = tuple([0] * dims)
    ap.add(origin)
    neighbour2taxi[origin] = 0

    # Plots neighbourhood of origin (plus extra space)
    space = 1
    minmax = [range(-(1 + space), (1 + space) + 1)
              for dim in range(dims)]

    txt = []
    indent_txt = ' ' * indent
    for plane_coords in product(*minmax[2:]):
        ptxt = ['\n' + indent_txt
                + ', '.join(f"dim{dim}={val}"
                            for dim, val in enumerate(plane_coords, 2))]

        ptxt += [''.join((str(neighbour2taxi[tuple([x, y] + list(plane_coords))])
                          if tuple([x, y] + list(plane_coords)) in ap
                          else '.')
                         for x in minmax[0])
                 for y in minmax[1]]

        # Don't plot planes with no neighbours (due to extra space).
        if ''.join(ptxt).count('.') < (3 + space*2)**2:
            txt += ptxt

    return '\n'.join(indent_txt + t for t in txt)


print("\n" + '='*60 + '\n')
for dim in range(2,5):
    d = d_in_c_neighbourhood(dim)
    tot = sum(len(n_set) for n_set in d.values())
    print(f"\nA {dim}-D point has a total of {tot:_} full neighbours of which:")
    for diff_count in sorted(d):
        n_count = len(d[diff_count])
        print(f"  {n_count:4_} have taxicab distance {diff_count:2} from the origin.")
        _check_counts(dim, diff_count, n_count)
    print(to_str(d))

# %% ##Output
"""
============================================================


A 2-D point has a total of 8 full neighbours of which:
     4 have taxicab distance  1 from the origin.
     4 have taxicab distance  2 from the origin.


    .....
    .212.
    .101.
    .212.
    .....

A 3-D point has a total of 26 full neighbours of which:
     6 have taxicab distance  1 from the origin.
    12 have taxicab distance  2 from the origin.
     8 have taxicab distance  3 from the origin.

    dim2=-1
    .....
    .323.
    .212.
    .323.
    .....

    dim2=0
    .....
    .212.
    .101.
    .212.
    .....

    dim2=1
    .....
    .323.
    .212.
    .323.
    .....

A 4-D point has a total of 80 full neighbours of which:
     8 have taxicab distance  1 from the origin.
    24 have taxicab distance  2 from the origin.
    32 have taxicab distance  3 from the origin.
    16 have taxicab distance  4 from the origin.

    dim2=-1, dim3=-1
    .....
    .434.
    .323.
    .434.
    .....

    dim2=-1, dim3=0
    .....
    .323.
    .212.
    .323.
    .....

    dim2=-1, dim3=1
    .....
    .434.
    .323.
    .434.
    .....

    dim2=0, dim3=-1
    .....
    .323.
    .212.
    .323.
    .....

    dim2=0, dim3=0
    .....
    .212.
    .101.
    .212.
    .....

    dim2=0, dim3=1
    .....
    .323.
    .212.
    .323.
    .....

    dim2=1, dim3=-1
    .....
    .434.
    .323.
    .434.
    .....

    dim2=1, dim3=0
    .....
    .323.
    .212.
    .323.
    .....

    dim2=1, dim3=1
    .....
    .434.
    .323.
    .434.
    .....

"""

Go deh!

Wednesday, March 24, 2021

Using Sympy for linear interpolation.

Wednesday, February 17, 2021

Longest Common Substring: An investigation using Python.

Task

The "Rote search" method

Dynamic Programming.

An example:

Code

Dynamic Programming for more than two strings?

The longest of the common substrings of all substrings of each string

Code

Comparative Timings

Now for timing 3-string runs:

Stop Press! Incremental DP

Timings

Conclusion.

Saturday, February 13, 2021

Python: Counting without Counter

1. When the keys to count are known, adjacent, integers.

2. When defaultdict(int) is better

Conclusion

End.

Tuesday, February 02, 2021

Pythone'er abuses walrus

A new task

Iterative solution

The walrus hack.

STOP PRESS!

Sunday, January 31, 2021

Counting types of neighbour - in Python.

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive