Sunday, October 31, 2021

Python: *args, **kwargs is hard work!

Normally one would start such a post with why parameters *args and **kwargs shouldn't be used, but that is incorrect I think you should use them as much as you can document their use!

Everywhere they are used, you should document:

Just what the expected values are.
What they do;.
And how they interact.

That's the information usually missing when it comes to maintenance time that should be present.

If your use comes from incorporating a library then at least document the args and kwargs that you use in your source code. This should put a damper on routines that pass the same *args and **kwargs to two different modules as the effort to trace used values and ensure no negative interactions between the two called modules is a lot of work and would take some time to document.

If you are not willing to document properly than that, there, is an indication that you are adding technical debt to your code, and that thought should have the effect of limiting their use.

Friday, October 08, 2021

Converting an algorithm to use Pythons' new Structural Pattern Matching

Python 3.10 is out and browsing what's new, I decided to convert one of my existing routines to use the new Structural Pattern Matching statement to try it out.

I used fgrep -c elif, as part of a long unix command to find my algorithms with a lot of elif statements, and then chose my routine that evaluates Ackermann's function without explicit recursive function calls and with a lot of optimisations.

from collections import deque

def ack_ix(m, n):
    "Paddy3118's iterative with optimisations on m"

    stack = deque([])
    stack.extend([m, n])

    while  len(stack) > 1:
        n, m = stack.pop(), stack.pop()

        if   m == 0:
            stack.append(n + 1)
        elif m == 1:
            stack.append(n + 2)
        elif m == 2:
            stack.append(2*n + 3)
        elif m == 3:
            stack.append(2**(n + 3) - 3)
        elif n == 0:
            stack.extend([m-1, 1])
        else:
            stack.extend([m-1, m, n-1])

    return stack[0]

Note that first n, then m is popped off the stack in the first line within the while loop.

Checking the match statement examples PEP-634, I create a match subject expression as a list of first the n then the m value and replace the if statement with case statements covering each of the if conditionals and also assigning to n and m as appropriate, to create:

def ack_im(m, n):
    "Paddy3118's iterative with optimisations on m using match of lists"

    stack = deque([])
    stack.extend([m, n])

    while  len(stack) > 1:
        match  [stack.pop(), stack.pop()]:
            case [n, 0]:
                stack.append(n + 1)
            case [n, 1]:
                stack.append(n + 2)
            case [n, 2]:
                stack.append(2*n + 3)
            case [n, 3]:
                stack.append(2**(n + 3) - 3)
            case [0, m]:
                stack.extend([m-1, 1])
            case [n, m]:
                stack.extend([m-1, m, n-1])

    return stack[0]

I had thought that it would be more natural to use tuples rather than lists for the subject expressions and case patterns as there are always only two things being matched, so created a third version:

def ack_imt(m, n):
    "Paddy3118's iterative with optimisations on m using match of tuples"

    stack = deque([])
    stack.extend([m, n])

    while  len(stack) > 1:
        match  (stack.pop(), stack.pop()):
            case (n, 0):
                stack.append(n + 1)
            case (n, 1):
                stack.append(n + 2)
            case (n, 2):
                stack.append(2*n + 3)
            case (n, 3):
                stack.append(2**(n + 3) - 3)
            case (0, m):
                stack.extend([m-1, 1])
            case (n, m):
                stack.extend([m-1, m, n-1])

    return stack[0]

Checks/Timings

if __name__ == "__main__":
    import timeit
    assert (ix := ack_ix(4, 2)) == (im := ack_im(4, 2)) == (im := ack_imt(4, 2)), "Whoops!"
    print(f"Answer has {len(str(ix))} decimal digits")
    n = 100_000
    print(f"Timings for {n:_} repetitions of ack(4, 2)")
    for func in 'ack_ix ack_im ack_imt'.split():
        print(f"  {func+':':8}",
              timeit.timeit(f"{func}(4, 2)",
                            number=n,
                            setup=f"from __main__ import {func}"))

Running the code produces the following output:

Answer has 19729 decimal digits
Timings for 100_000 repetitions of ack(4, 2)
  ack_ix:  15.179398099950049
  ack_im:  15.368259999959264
  ack_imt: 15.472230200015474

Personal conclusion

There isn't much difference in runtime between if and match statements. This example only touches on the match statements capabilities, and so seems equally expressive as the if statement variant to me.

Wednesday, June 09, 2021

Python hack: Creating local variables in a comprehension

Lets say you have data and want to create a list of the running sum of the data. For example if the data is [1,2,3] the running sum is [1, 3, 6]

Doing this outside a comprehension:

data = [1, 2, 3]

s = 0  # accumulator
sums = []

for x in data:
    s += x
    sums.append(s)
print(sums)  # [1, 3, 6]

Now this needs the accumulator s, initialised to zero, but comprehensions create their own local variables and the syntax does not allow you to write a simple assignment within it, and each item from the comprehension is stated first in its syntax .

The Hack

I'll just show the hack then work through it afterwards.

In [1]: data = [1, 2, 3]

In [2]: [s for s in [0] for x in data for s in [s + x]]
Out[2]: [1, 3, 6]

When converting a comprehension into similar for statements then the output expression at the beginning of the comprehension is thought of as moving to inside the rightmost if or for section of the comprehension, so we get:

# Comprehension over many lines
[s                          # output expression
 for s in [0]               # For clauses (nested)
     for x in data
         for s in [s + x]]

# Is similar too...
for s in [0]:                   # For clauses (nested)
    for x in data:
        for s in [s + x]:
            print(s, end=' ')   # output expresion: 1 3 6

Explanation

In the comprehension, the initial

[s for s in [0] ...

says:

Individual items of the comprehension will be the expression s.
(Remember the output expression is stated first , but from the environment at the right of the comprehension).
In the comprehensions local scope we use the one-entry outer for loop to set local s to zero.

The middle for loop of the comprehension just iterates over the data

The final for loop of the comprehension is special:

... for s in [s + x]]

s is set to itself plus the next item of data, x, using iteration over a one element list [s + x]:

For the first x, s was initialised to zero in the local scope via the outermost for.
s becomes 0 + data[0] in the inner loop and becomes the first output expression value, 1.
For the second iteration of the middle loop, x = data[1], so s then becomes 0 + data[0] + data[1]. The second evaluation of the the output expression for the comprehension, 3.
And so on...

Multiple local variables

We can generalise this. Here we generate running sums, and running sums of the squares which needs two local variables s and s2:

In [3]: data = [1, 2, 3]

In [4]: [(s, s2) for s, s2 in [(0, 0)] for x in data for s, s2 in [(s + x, s2 + x**2)]]
Out[4]: [(1, 1), (3, 5), (6, 14)]

Summary

You can satisfy the need for local variables in comprehensions.
Its a hard to understand hack!

UPDATE: Added Walrus:

The walrus operator can now be used to give a more readable equivalent.

This introduces the external initialised variable into the comprehension as well as keeping the running sums.

In [5]: # We had:

In [6]: data = [1, 2, 3]

In [7]: [s for s in [0] for x in data for s in [s + x]]
Out[7]: [1, 3, 6]

In [9]: # With :=

In [10]: s = 0

In [11]: [s := (s + x) for x in data]
Out[11]: [1, 3, 6]

In [12]: 

In [13]: # We then had:

In [14]: del s

In [15]: [(s, s2) for s, s2 in [(0, 0)] for x in data for s, s2 in [(s + x, s2 + x**2)]]
Out[15]: [(1, 1), (3, 5), (6, 14)]

In [16]: # Which becomes:

In [17]: s = s2 = 0

In [18]: [(s := s + x, s2 := s2 + x**2) for x in data]
Out[18]: [(1, 1), (3, 5), (6, 14)]

(Those external variables s and s2 are updated to the last running sum).

Source

This all came about because I re-read the "Whats new in Python 3.9" doc after upgrading Anaconda and came across code I couldn't initially fathom.

End.

Tuesday, April 27, 2021

Read PEP 636 on Structural Pattern Matching: The Tutorial

Hi, I am just reading through PEP 636 -- Structural Pattern Matching: Tutorial. I'll give my views as I go along...

I really liked the choice of example - a text adventure. I am old enough to relate to it, and I guess a quick google would inform others.

Section "Matching sequences" serves as a gentle introduction to a match statement with one case

Subsequent sections expand on aspects of match: literals seem to match literally; names, (also called variables), when matched are also bind-to for use in the stastements, (plueral), of the case block.

skipping a few sections to something that is stated, that is nevertheless a departure from previous uses, that is the treatment of the underscore.

Underscores

In section "Adding a wildcard" underscore becomes more than just a valid name, as it is outside the match statement.

Underscore always matches an object - just like any other identifier name; but does not get assigned the object it matches - like any other name.

Their example:

match command.split():
    case ["quit"]: ... # Code omitted for brevity
    case ["go", direction]: ...
    case ["drop", *objects]: ...
    ... # Other cases
    case _:
        print(f"Sorry, I couldn't understand {command!r}")

You cannot write the following

match command.split():
    case ["quit"]: ... # Code omitted for brevity
    case ["go", direction]: ...
    case ["drop", *objects]: ...
    ... # Other cases
    case _:
        print(f"Sorry, I couldn't understand {_!r}")

As _ is not assigned. Is it de-assigned in this block or will an earlier assignment outside the match statement still exist?

I am curious as to the decision for this extra non-assignment behaviour for underscore?

Testing Scope and Assignment

Lets expand on their example assigning names globally and trying different commands to match different case clauses:

_, direction, objects = "_ global", "direction global", "objects global"
print(f"{(_, direction, objects)=}")

command = "Nothing to match"
for command in ["quit", "quit me", "go south", "drop anvil arrows"]:
    print(f"\n## COMMAND IS: {command!r}\n")
    match command.split():
        case ["quit"]:
            print("quit")
            print(f"{(_, direction, objects)=}")
        case ["go", direction]:
            print("go", direction)
            print(f"{(_, direction, objects)=}")
        case ["drop", *objects]:
            print("drop", objects)
            print(f"{(_, direction, objects)=}")
        case _:
            print(f"Sorry, I couldn't understand {command!r}")
            print(f"{(_, direction, objects)=}")

Output:

Python 3.10.0a7 (tags/v3.10.0a7:53e5529, Apr  6 2021, 10:20:47) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> 
======== RESTART: C:/Users/paddy/Google Drive/Code/python_3_10_match.py ========
(_, direction, objects)=('_ global', 'direction global', 'objects global')

## COMMAND IS: 'quit'

quit
(_, direction, objects)=('_ global', 'direction global', 'objects global')

## COMMAND IS: 'quit me'

Sorry, I couldn't understand 'quit me'
(_, direction, objects)=('_ global', 'direction global', 'objects global')

## COMMAND IS: 'go south'

go south
(_, direction, objects)=('_ global', 'south', 'objects global')

## COMMAND IS: 'drop anvil arrows'

drop ['anvil', 'arrows']
(_, direction, objects)=('_ global', 'south', ['anvil', 'arrows'])
>>>

Concentrating on underscore, we see that it always has the initially assigned global value.

Once direction is changed in the case statement, however, it is this global value that is changed and the next iteration of the loop that does not match to it shows its new changed value.

The special treatment of underscore seems odd to me? Why not assign to it like normal and have only the special treatment of checking that if it is the only pattern then that pattern is the last pattern?

As

Reading section "Capturing matched sub-patterns" I kind of thought ahead and was thinking "another use for the walrus operator", until the use of the as keyword revealed itself :-)

"as" reads well, I like it.

Type and attribute matching

Yay! Duck type friendly matching of attribute names.

Outside of a match statement a call of MyObject might require arguments, but it seems that MyObject() is valid in a match statement, regardless - something to remember.

END?

I'll stop here for now although there is more of the doc to read.

Tuesday, April 06, 2021

Practical numbers

Best viewed on larger screens.

Wednesday, March 24, 2021

Using Sympy for linear interpolation.

Wednesday, February 17, 2021

Longest Common Substring: An investigation using Python.

(Best browsed on a larger than phone screen)

Someone updated the RC task, and I couldn't remember attempting it so took a closer look at the Python solution, (at least the more idiomatic of the two).

Task

The task is: Given two character strings, find the longest string of characters present anywhere in both strings.

I should point out that there is a different, well known, task which is to find the longest common subsequence between two strings.

A sub**sequence** if a string is any characters from the string, in the order that they appear in the string, *but intervening characters can be skipped*.
There is no such skipping allowed in a sub**string**

The "Rote search" method

I took a look at the Python on Rosetta Code, it is by user Halilmal:

s1 = "thisisatest"
s2 = "testing123testing"
len1, len2 = len(s1), len(s2)
ir, jr = 0, -1
for i1 in range(len1):
    i2 = s2.find(s1[i1])
    while i2 >= 0:
        j1, j2 = i1, i2
        while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
            if j1-i1 >= jr-ir:
                ir, jr = i1, j1
            j1 += 1; j2 += 1
        i2 = s2.find(s1[i1], i2+1)
print (s1[ir:jr+1])

test

The code works. To understand it I decided to take the liberty of renaming and extracting a function from the original. I also named the input strings from zero rather than 1.

s0 = "thisisatest"
s1 = "testing123testing"

def lcs_rs(s0, s1):
    "Commented/longer names, RosettaCode entry by user Halilmali"
    len0, len1 = len(s0), len(s1)
    max_sub_lh, max_sub_rh = 0, -1   # Start LH and RH indices of max substring
    for i0 in range(len0):      # Start LH index of s0 char
        i1 = s1.find(s0[i0])    # Start LH index of (first) s1 char match
        while i1 >= 0:
            sub_end_s0, sub_end_s1 = i0, i1     # End indices of this substring match
            while (sub_end_s0 < len0 and
                   sub_end_s1 < len1 and
                   s1[sub_end_s1] == s0[sub_end_s0]):
                if sub_end_s0 - i0 >= max_sub_rh-max_sub_lh:
                    max_sub_lh, max_sub_rh = i0, sub_end_s0     # New Max
                sub_end_s0 += 1
                sub_end_s1 += 1
            i1 = s1.find(s0[i0], i1+1)   # Start LH index of next s1 char match
    return s0[max_sub_lh:max_sub_rh+1]

print (lcs_rs(s0, s1))

test

In this algorithm we have the two strings:

          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6

  s0[i0]: t h i s i s a t e s t

  s1[i1]: t e s t i n g 1 2 3 t e s t i n g

For every character in s0 starting from the left, it looks up that character in s1 then tries to extend the matching string to the right as much as possible; moving to second and subsequent matches in s1; all the while keeping track of the largest substring match so far.

It seems to do a lot of traversal of the strings!

Dynamic Programming.

I searched and found a site that had a dynamic programming solution.

I liked the algorithm. In this you start with an X-Y array of zeroes, what I call the dp array:

#s1[j] t  e  s  t  i  n  g  1  2  3  t  e  s  t  i  n  g;  # s0[i]  
dp = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # t
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # h
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # i
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # i
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # a
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # t
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # e
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] # t

You compare every character of s0 with every character of s1 in order of increasing i and j and fill in the table like this:

if s0[i] == s1[j] then dp[i][j] = dp[i-1][j-1] + 1 You are comparing two characters from each string. If they are equal, then accumulate how many characters have been equal by adding one to the equality count for the dp coordinate corresponding to accumulated run of equivalences for the "previous" character.

There is an issue with what to do at the "edges". When i or jare zero then set dp[i][j] = 1

After filling the dp array, to get the longest substring search dp for a maximum value, mx, then mx is the number of characters in the longest substring and the coordinates of mx give the end index iof the string in s0 and s1 respectively.

An example:

s0 = 'ABBA'
s1 = 'AABB'

#s1[j] A  A  B  B;   # s0[i]

dp = [[1, 1, 0, 0],  # A
      [0, 0, 2, 1],  # B
      [0, 0, 1, 3],  # B
      [1, 1, 0, 0]]  # A

The above fills in dpafter using the equivalent to for i in range(len(s0)): for j in range(len(s1)): ... to traverse the array. The maximum value is clearly 3 at i,j == (2, 3). Using the i coordinate corresponding to the end of the longest substring being the first B in s0 knowing the length, the longest substring must be 'ABB'.

You can cross check that the same 3 character substring appears in s1 ending at s1[3]

Code

from itertools import product

def dp_array_print(s0, s1, dp, ans):
    print(f"\ns0 = {repr(s0)}\ns1 = {repr(s1)}\n")
    print(f"#s1[j] {'  '.join(c for c in s1)};   # s0[i]\n")
    for i, (dp_row, ch0) in enumerate(zip(dp, s0)):
        txt = f"      {dp_row},  # {ch0}"
        if i == 0:
            txt = "dp = [" + txt[6:]
        if i + 1 == len(s0):
            txt = txt.replace(',  #', ']  #')
        print(txt)
    print('\nAnswer = ', repr(ans),'\n=====================')

def lcs_dp(s0, s1):
    len0, len1 = len(s0), len(s1)

    # Bottom up Dynamic Programming table gen
    dp = [[0] * len1 for _ in range(len0)]
    for i, j in product(range(len0), range(len1)):
        dp[i][j] = (1 + (dp[i-1][j-1] if i and j else 0)   # 0 if off "edge"
                    if s0[i] == s1[j]   # Only increment if same chars
                    else 0)

    # Find max
    mx, ij_mx = 0, (-1, -1)
    for i, j in product(range(len0), range(len1)):
        if dp[i][j] > mx:
            mx, ij_mx = dp[i][j], (i, j)

    return s0[1 + ij_mx[0] - mx: 1 + ij_mx[0]], dp


s0, s1 = 'ABBA', 'AABB'

ans01, dp01 = lcs_dp(s0, s1)
dp_array_print(s0, s1, dp01, ans01)

s0 = 'ABBA'
s1 = 'AABB'

#s1[j] A  A  B  B;   # s0[i]

dp = [[1, 1, 0, 0],  # A
      [0, 0, 2, 1],  # B
      [0, 0, 1, 3],  # B
      [1, 1, 0, 0]]  # A

Answer =  'ABB' 
=====================

The code isn't too bad, but there is that nested if-expression needed to calculate the dp[i][j] value under all cases.

Dynamic Programming for more than two strings?

For three strings I thought of filling in a dp cube in a similar way to the dp array, should work.

After trying a few two-string examples I noticed that there could be a fair amount of zeroes in the dp array. I decided to therefore make the dp-cube a sparse array held in a dict with a tuple of the string indices of any point and only keeping non-zero counts in the dp sparse dict.
I chose not to use a defaultdict(int) for the dp dict as it would accumulate zero values for accesses to any tuple coordinate having a -1 in it.

It prooved to be easy enough to move from first doing a three-string function to actually going ahead and doing a two or more string version straight away.

def lcs_dpx(*strings):
    "Longest Common Substring: Dynamic Programming solution for many strings"
    index_ranges  = [range(len(s)) for s in strings]

    # Bottom up Dynamic Programming table gen
    dp = {}  
    for indices in product(*index_ranges):
        string_chars = (s[i] for i, s in zip(indices, strings))
        first_char = next(string_chars)
        if all(next_char == first_char for next_char in string_chars):
            prev = tuple(i-1 for i in indices) # Index previous chars
            dp[indices] = 1 + (dp[prev] if prev in dp else 0)

    # Find max
    mx, mx_ind = 0, tuple(-1 for _ in strings)
    for indices, val in dp.items():
        if val > mx:
            mx, mx_ind = val, indices

    return strings[0][1 + mx_ind[0] - mx: 1 + mx_ind[0]], dict(dp)


s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

ans, dp = lcs_dpx(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")

'thisisatest_stinger', 'testing123testingthing' ->> 'sting';  len(dp) = 48;

'thisisatest_stinger', 'thisis' ->> 'thisis';  len(dp) = 19;

'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 16;

'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 55;

This works too!
The code uses product to perform arbitrarily nested for loops. The logic for incrementing dp is simplified as dp starts without, and cannot contain an indices tuple with a -1 in it as that is generated in prev and dp[prev]is never assigned to.

Strings containing long runs of the samwe single character will have the largest number of matches and so the largest dpsize for this algorithm.

The longest of the common substrings of all substrings of each string

This algorithm, I saw mentioned somewhere. From just the menton of sets, i worked out what would be needed in Python, without bothering to understand the Perl original.

This method

Works out the set of all the possible sub-strings for each individual string.
Finds the set of substrings common to all the strings
Returns the common substring that is the longest.

The description of the algorithm "does exactly what it says on the tin". I decided to create a version that worked with two and more strings as it seemed the most straight-forward of extensions.

Code

function _set_of_substringsreturns the set of all possible substrings of its argument. Rather than continue to just generate all the substrings of all strings, function _set_of_common_substringsis used for subsequent strings to accumulate and return the set of only the common substrings. This latter function also makes use of the walrus operator to assign to this.

def _set_of_substrings(s:str) -> set:
    "_set_of_substrings('ABBA') == {'A', 'AB', 'ABB', 'ABBA', 'B', 'BA', 'BB', 'BBA'}"
    len_s = len(s)
    return {s[m: n] for m in range(len_s) for n in range(m+1, len_s +1)}

def _set_of_common_substrings(s:str, common: set) -> set:
    "Substrings of s that are also in common"
    len_s = len(s)
    return {this for m in range(len_s) for n in range(m+1, len_s +1)
            if (this := s[m:n]) in common}

def lcs_ss(*strings):
    "longest of the common substrings of all substrings of each string"
    strings_iter  = (s for s in strings)
    common = _set_of_substrings(next(strings_iter)) # First string substrings
    for s in strings_iter:
        if not common:
            break
        common = _set_of_common_substrings(s, common) # Accumulate the common

    return max(common, key= lambda x: len(x)) if common else ''


s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

ans = lcs_ss(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)}")
ans = lcs_ss(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)}")
ans = lcs_ss(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)}")
ans = lcs_ss(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)}")

'thisisatest_stinger', 'testing123testingthing' ->> 'sting'

'thisisatest_stinger', 'thisis' ->> 'thisis'

'testing123testingthing', 'thisis' ->> 'thi'

'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi'

Comparative Timings

Simplistic data so have a pinch of salt to hand:

s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

%timeit _ = lcs_rs(s0, s1)

122 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit _ = lcs_dp(s0, s1)

315 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = lcs_dpx(s0, s1)

1.4 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = lcs_ss(s0, s1)

265 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now for timing 3-string runs:

%timeit _ = lcs_dpx(s0, s1, s2)

8.4 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit _ = lcs_ss(s0, s1, s2)

284 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Stop Press! Incremental DP

It seems to me that in the multi-string, dynamic programming case, you should be able to create the dpsparse array for the first two strings then add in subsequent strings to successive dpsparse arrays incrementing dimensionality. When adding a dimension/string to dp, the search for string equality only needs to be done where there were matches for all previous strings/lower dimentionality dp.

def lcs_idp(*strings):
    "Longest Common Substring: Incremental Dynamic Programming for many strings"
    index_ranges  = [range(len(s)) for s in strings]

    ## Incremental Dynamic Programming table gen
    # First two strings
    dp = {}
    for indices in product(*index_ranges[:2]):
        string_chars = (s[i] for i, s in zip(indices, strings[:2]))
        first_char = next(string_chars)
        if all(next_char == first_char for next_char in string_chars):
            prev = tuple(i-1 for i in indices) # Index previous chars
            dp[indices] = 1 + (dp[prev] if prev in dp else 0)
    # Next strings
    for index_range, string in zip(index_ranges[2:], strings[2:]):
        dp_last, dp = dp, {}
        for last_match_index in dp_last:      # These are non-zero from before
            last_match_char = strings[0][last_match_index[0]]
            for index, this_char in enumerate(string):
                if last_match_char == this_char:
                    indices = tuple(list(last_match_index) + [index])
                    prev = tuple([i-1 for i in last_match_index] + [index - 1])
                    dp[indices] = 1 + (dp[prev] if prev in dp else 0)

    # Find max
    mx, mx_ind = 0, tuple(-1 for _ in strings)
    for indices, val in dp.items():
        if val > mx:
            mx, mx_ind = val, indices

    return strings[0][1 + mx_ind[0] - mx: 1 + mx_ind[0]], dict(dp)


s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"

ans, dp = lcs_idp(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_idp(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_idp(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")
ans, dp = lcs_idp(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)};  len(dp) = {len(dp)};")

'thisisatest_stinger', 'testing123testingthing' ->> 'sting';  len(dp) = 48;

'thisisatest_stinger', 'thisis' ->> 'thisis';  len(dp) = 19;

'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 16;

'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi';  len(dp) = 55;

Timings

%timeit _ = lcs_idp(s0, s1)

1.5 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit _ = lcs_idp(s0, s1, s2)

1.73 ms ± 59.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This new Incremental Dynamic Programming approach seems to be best the straight DP algorithm for three and more strings as it prunes the (hyper) cube of comparisons.

Conclusion.

The timings are based on a laughably small example. You really need to try the routines on your own data. lcs_ssand lcs_idp should be checked with your own sample data to see what is right for you. They both try and cut down on memoryand computations, but your type of data might make one outclas the other.

Note: There could be another algorithm that makes use of suffix trees...

END.

Sunday, October 31, 2021

Friday, October 08, 2021

Checks/Timings

Personal conclusion

Wednesday, June 09, 2021

The Hack

Explanation

Multiple local variables

Summary

UPDATE: Added Walrus:

Source

End.

Tuesday, April 27, 2021

Underscores

Testing Scope and Assignment

Type and attribute matching

END?

Tuesday, April 06, 2021

Wednesday, March 24, 2021

Wednesday, February 17, 2021

Task

The "Rote search" method

Dynamic Programming.

An example:

Code

Dynamic Programming for more than two strings?

The longest of the common substrings of all substrings of each string

Code

Comparative Timings

Now for timing 3-string runs:

Stop Press! Incremental DP

Timings

Conclusion.

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive