Go deh!

Mainly Tech projects on Python and Electronic Design Automation.

Saturday, May 21, 2022

Big nO, How the interpreted nature of Python could skew quoted Big-O use.

 Or: If time is an issue then measure!

I was recently introduced to the Boyer-Moore MJRTY algorithm for determining which, if any, of an arbitrary
number of candidates has received a majority of the votes cast in an election.

The linkedin post  by Prof. Wissam Fawaz gave an implementation of that algorithm and I gave my own version, reproduced below. 

from typing import Any
from collections import Counter
import random


def boyer_moore(seq: list[Any]) -> tuple[bool, Any]:
    "Returns IF there is a majority vote winner, and the winner."
    m, i = None, 0
    for x in seq:
        m, i = ((x, 1) if i == 0
                else ((m, i+1) if m == x
                      else (m, i-1)))
    return seq.count(m) > len(seq) // 2, m


 The algorithm, given a list of peoples votes such as if thirteen people are voting between parties A, B, and C then their votes might be represented by the list AAACCBBCCCBCC. The MJRTY algorithm takes just two passes through the list to determine if some party has more than half the votes, and who that is.

I decided to also code  a solution using collections.Counter, which is the first consideration when a problem includes counting in Python:

def counted(seq: list[Any]) -> tuple[bool, Any]:
    "Uses collections.Counter"
    m, count = Counter(seq).most_common(1)[0]
    return count > len(seq) // 2, m

Given the problem statement, it is more evident how the counted version works:

  1. Find the most common vote
  2. See if it amounts to more than half of all votes.
The description of how and why the MJRTY is given in the book and is a little more involved.

Testing.

Like all tests, it has holes, but I decided to concentrate on the run time for longer and longer lists that were randomised, and also had a 50/50 chance of producing a winning candidate.

First a generator of suitable data:

def list_gen(odd_count=11) -> list[int]:
    """
    Generates a list with either one item appearing more
    than all the rest or not.
    """
    half_count = odd_count // 2
    if random.choice((True, False)):
        lst = random.choices([2, 3], k=half_count)     + [1] * (half_count + 1)
    else:
        lst = random.choices([2, 3], k=half_count + 1) + [1] * half_count
    random.shuffle(lst)
    return lst

for i in range(5):
    print((lst := list_gen()), boyer_moore(lst), counted(lst))

"""
Sample output from loop above:
[1, 2, 1, 2, 1, 3, 1, 3, 1, 2, 2] (False, 2) (False, 1)
[2, 2, 3, 1, 1, 1, 1, 1, 2, 1, 3] (True, 1) (True, 1)
[1, 1, 1, 2, 1, 2, 2, 3, 1, 1, 3] (True, 1) (True, 1)
[1, 1, 1, 2, 1, 2, 3, 1, 2, 2, 1] (True, 1) (True, 1)
[2, 1, 2, 1, 1, 2, 1, 3, 3, 2, 1] (False, 1) (False, 1)
"""

Then wrap each implementation and timeit (Thanks IPython/Spyder).

#%% Timings

def timed_bm(count=1_000):
    boyer_moore(list_gen(count))

def timed_c(count=1_000):
    counted(list_gen(count))

for i in range(1, 7):
    print('\n##',10**i)
    print("# Boyer-Moore")
    %timeit timed_bm(10**i)  # Spyder/IPython IDE specific
    print("# Counted")
    %timeit timed_c(10**i)

"""
Sample timed output:
## 10
# Boyer-Moore
10.8 µs ± 44.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# Counted
13.6 µs ± 44.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

## 100
# Boyer-Moore
73.7 µs ± 89.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Counted
69.9 µs ± 518 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

## 1000
# Boyer-Moore
739 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Counted
661 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

## 10000
# Boyer-Moore
7.58 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Counted
6.91 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

## 100000
# Boyer-Moore
75.3 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Counted
67.8 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

## 1000000
# Boyer-Moore
780 ms ± 5.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Counted
710 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""

Results

The run times are comparable over from ten to a million voters given the restrictions of the test strategy.

the vote generator restricts the number of parties to three which favours the dict underlying Counter. A more accurate comparison might be made by taking different sized samples from actual data.

END.




Sunday, April 03, 2022

Easier Regexps

 Python, although a scripting language, does not bake regular expressions into the language syntax like other scripting languages such as AWK, Perl and Ruby. This can make people pause before using the re library that Python supplies, but it has its place, and regular expressions are a handy tool in results.

Over the decades, I have refined my use of regular expressions and want to pass on a useful style, and tools that I find very useful.

1. Use Regex101.com!

  • Open the site in another tab.
  • Reading down the main textual menu to the left, select the  Python language.

2. Make regexps readable.

  • In the  "REGULAR EXPRESSIONS" section, top centre:
    • Hover over the r" text to the left of the entry box and it will say "Change delimiter". Click and change the delimiter to r""".
    • Hover over the """gm text to the left of the central window which should then say "Set Regex Options". Select the g, m and x options. 

The above allows you to enter a regexp over multiple lines and with comments. Spaces are ignored on the whole, and you should use the pattern \s or \n etc to represent whatever whitespace you wish to match.

3. Get a representative selection of what you wish to match.

Know your data! Read it, and select sections that convey all the intricacies that your regexp needs to match. Careful when taking several excerpts from, say, a log, to ensure that the excerpts, together are valid.

  • Paste your sample text in the main "TEST STRING" section.

4. Use named capture groups.

When creating your exression in the top centre section any capture-group should be a named capture-group.

  • Use uppercase group names - this usually contrasts with other parts of the regexp aiding readability.
  • Use short, meaningful group names.

These names might become lowercased variable names in a program.

5. You can comment in the regexp!!!

Yes you can.

6. Debug on site.

Regex101 explains the regexp and all the matches of the regexp on the test string as you hover on parts of the regexp or test string. There is extra info in the right hand side sections too.

7. Export the code.

  • Under the lower LHS TOOLS menu select Code Generator.

Cut-n-paste the code into your favourite editor.

I tend to take only parts of the generated code: the regexp, the test_str and the re compile options are useful.

An Example: Harvesting colours.

I was looking for a list of distinctive colours and found this page: List of 20 Simple, Distinct Colors

If you click on the convenient button, the colours adjust to a nice order, but selecting a region around the colour table and pasting into an editor gives one long  concatenation of the table information amongst surrounding text.

I wanted something more readable so pasted the long string into regex101 here and created the regular expression to parse it. (Oh, that's another feature - you can create a regex101 account and save regexps).

The final script creates a nicely formatted Python list of the data:

# -*- coding: utf-8 -*-
"""
Parse Table grab from https://sashamaps.net/docs/resources/20-colors/

Created on Sat Apr  2 19:36:07 2022

@author: paddy
"""

_initial_screen_cut = """
Accessibility: Red#e6194B1Green#3cb44b2Yellow#ffe1193Blue#4363d84Orange#f582315Purple#911eb46Cyan#42d4f47Magenta#f032e68Lime#bfef459Pink#fabed410Teal#46999011Lavender#dcbeff12Brown#9A632413Beige#fffac814Maroon#80000015Mint#aaffc316Olive#80800017Apricot#ffd8b118Navy#00007519Grey#a9a9a920White#ffffff21Black#00000022
'#e6194B', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#42d4f4', '#f032e6', '#bfef45', '#fabed4', '#469990', '#dcbeff', '#9A6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#a9a9a9', '#ffffff', '#000000'
Test:

"""
import re

regex = r"""
	# Parse Table grab from https://sashamaps.net/docs/resources/20-colors/
	(?:
	  (?P<NAME>[A-Za-z]+)
	  \# (?P<HEX>[0-9a-fA-F]{6})
	  (?P<ID>\d+)
	)
"""

test_str = ("Red#e6194B1Green#3cb44b2Yellow#ffe1193Blue#4363d84Orange#f582315Pu"
            "rple#911eb46Cyan#42d4f47Magenta#f032e68Lime#bfef459Pink#fabed410Te"
            "al#46999011Lavender#dcbeff12Brown#9A632413Beige#fffac814Maroon#800"
            "00015Mint#aaffc316Olive#80800017Apricot#ffd8b118Navy#00007519Grey#"
            "a9a9a920White#ffffff21Black#00000022")

distinct = [(int(match.groupdict()['ID']), match.groupdict()['NAME'], int(match.groupdict()['HEX'], 16))
            for match in  re.finditer(regex, test_str, re.MULTILINE | re.VERBOSE)]

#print(f"{distinct=}")

print('\ndistinct_colours = [\n# NAME          HEX_CODE')
for id, name, num in distinct:
    print(f" ({repr(name)+',':<12}  0x{num:06x}),")
print(']')

The output:

distinct_colours = [
# NAME          HEX_CODE
 ('Red',        0xe6194b),
 ('Green',      0x3cb44b),
 ('Yellow',     0xffe119),
 ('Blue',       0x4363d8),
 ('Orange',     0xf58231),
 ('Purple',     0x911eb4),
 ('Cyan',       0x42d4f4),
 ('Magenta',    0xf032e6),
 ('Lime',       0xbfef45),
 ('Pink',       0xfabed4),
 ('Teal',       0x469990),
 ('Lavender',   0xdcbeff),
 ('Brown',      0x9a6324),
 ('Beige',      0xfffac8),
 ('Maroon',     0x800000),
 ('Mint',       0xaaffc3),
 ('Olive',      0x808000),
 ('Apricot',    0xffd8b1),
 ('Navy',       0x000075),
 ('Grey',       0xa9a9a9),
 ('White',      0xffffff),
 ('Black',      0x000000),
]

 

END.

Sunday, October 31, 2021

Python: *args, **kwargs is hard work!

 Normally one would start such a post with why parameters *args and **kwargs shouldn't be used, but that is incorrect I think you should use them as much as you can document their use!

Everywhere they are used, you should document:

  1. Just what the expected values are.
  2. What they do;.
  3. And how they interact.
That's the information usually missing when it comes to maintenance time that should be present. 

If your use comes from incorporating a library then at least document the args and kwargs that you use in your source code. This should put a damper on routines that pass the same *args and **kwargs to two different modules as the effort to trace used values and ensure no negative interactions between the two called modules is a lot of work and would take some time to document.

If you are not willing to document properly than that, there, is an indication that you are adding technical debt to your code, and that thought should have the effect of limiting their use.

Friday, October 08, 2021

Converting an algorithm to use Pythons' new Structural Pattern Matching

 Python 3.10 is out and browsing what's new, I decided to convert one of my existing routines to use the new Structural Pattern Matching  statement  to try it out.

I used fgrep -c elif,  as part of a long unix command to find my algorithms with a lot of elif statements, and then chose my routine that evaluates Ackermann's function without explicit recursive function calls and with a lot of optimisations.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from collections import deque

def ack_ix(m, n):
    "Paddy3118's iterative with optimisations on m"

    stack = deque([])
    stack.extend([m, n])

    while  len(stack) > 1:
        n, m = stack.pop(), stack.pop()

        if   m == 0:
            stack.append(n + 1)
        elif m == 1:
            stack.append(n + 2)
        elif m == 2:
            stack.append(2*n + 3)
        elif m == 3:
            stack.append(2**(n + 3) - 3)
        elif n == 0:
            stack.extend([m-1, 1])
        else:
            stack.extend([m-1, m, n-1])

    return stack[0]

Note that first n, then m is popped off the stack in the first line within the while loop.

Checking the match statement examples PEP-634, I create a match subject expression as a list of first the n then the m value and replace the if statement with case statements covering each of the if conditionals and also assigning to n and m as appropriate, to create:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def ack_im(m, n):
    "Paddy3118's iterative with optimisations on m using match of lists"

    stack = deque([])
    stack.extend([m, n])

    while  len(stack) > 1:
        match  [stack.pop(), stack.pop()]:
            case [n, 0]:
                stack.append(n + 1)
            case [n, 1]:
                stack.append(n + 2)
            case [n, 2]:
                stack.append(2*n + 3)
            case [n, 3]:
                stack.append(2**(n + 3) - 3)
            case [0, m]:
                stack.extend([m-1, 1])
            case [n, m]:
                stack.extend([m-1, m, n-1])

    return stack[0]

I had thought that it would be more natural to use tuples rather than lists for the subject expressions and case patterns as there are always only two things being matched, so created a third version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def ack_imt(m, n):
    "Paddy3118's iterative with optimisations on m using match of tuples"

    stack = deque([])
    stack.extend([m, n])

    while  len(stack) > 1:
        match  (stack.pop(), stack.pop()):
            case (n, 0):
                stack.append(n + 1)
            case (n, 1):
                stack.append(n + 2)
            case (n, 2):
                stack.append(2*n + 3)
            case (n, 3):
                stack.append(2**(n + 3) - 3)
            case (0, m):
                stack.extend([m-1, 1])
            case (n, m):
                stack.extend([m-1, m, n-1])

    return stack[0]


Checks/Timings

if __name__ == "__main__":
    import timeit
    assert (ix := ack_ix(4, 2)) == (im := ack_im(4, 2)) == (im := ack_imt(4, 2)), "Whoops!"
    print(f"Answer has {len(str(ix))} decimal digits")
    n = 100_000
    print(f"Timings for {n:_} repetitions of ack(4, 2)")
    for func in 'ack_ix ack_im ack_imt'.split():
        print(f"  {func+':':8}",
              timeit.timeit(f"{func}(4, 2)",
                            number=n,
                            setup=f"from __main__ import {func}"))

Running the code produces the following output:

Answer has 19729 decimal digits
Timings for 100_000 repetitions of ack(4, 2)
  ack_ix:  15.179398099950049
  ack_im:  15.368259999959264
  ack_imt: 15.472230200015474

 

Personal conclusion

There isn't much difference in runtime between if and match statements. This example only touches on the match statements capabilities, and so seems equally expressive as the if statement variant to me.

Wednesday, June 09, 2021

Python hack: Creating local variables in a comprehension

 Lets say you have data and want to create a list of the running sum of the data. For example if the data is [1,2,3] the running sum is [1, 3, 6]

Doing this outside a comprehension:

data = [1, 2, 3]

s = 0  # accumulator
sums = []

for x in data:
    s += x
    sums.append(s)
print(sums)  # [1, 3, 6]

Now this needs the accumulator s, initialised to zero, but comprehensions create their own local variables and the syntax does not allow you to write a simple assignment within it, and each item from the comprehension is stated first in its syntax .


The Hack

I'll just show the hack then work through it afterwards.

In [1]: data = [1, 2, 3]

In [2]: [s for s in [0] for x in data for s in [s + x]]
Out[2]: [1, 3, 6]


When converting a comprehension into  similar for statements then the output expression at the beginning of the comprehension is thought of as moving to inside the rightmost if or for section of the comprehension, so we get:

# Comprehension over many lines
[s                          # output expression
 for s in [0]               # For clauses (nested)
     for x in data
         for s in [s + x]]

# Is similar too...
for s in [0]:                   # For clauses (nested)
    for x in data:
        for s in [s + x]:
            print(s, end=' ')   # output expresion: 1 3 6


Explanation

In the comprehension, the initial

[s for s in [0] ...

says:

  • Individual items of the comprehension will be the expression s.
    (Remember the output expression is stated first , but from the environment at the right of the comprehension).
  • In the comprehensions local scope we use the one-entry outer for loop to set local s to zero.

The middle for loop of the comprehension just iterates over the data

The final for loop of the comprehension is special:

... for s in [s + x]]

s is set to itself plus the next item of data, x, using iteration over a one element list [s + x]:

  • For the first x, s was initialised to zero in the local scope via the outermost for.
  • s becomes 0 + data[0] in the inner loop and becomes the first output expression value, 1.
  • For the second iteration of the middle loop, x = data[1], so s then becomes 0 + data[0] + data[1]. The second evaluation of the the output expression for the comprehension, 3.
  • And so on...

Multiple local variables

 We can generalise this. Here we generate running sums, and running sums of the squares which needs two local variables s and s2:

In [3]: data = [1, 2, 3]

In [4]: [(s, s2) for s, s2 in [(0, 0)] for x in data for s, s2 in [(s + x, s2 + x**2)]]
Out[4]: [(1, 1), (3, 5), (6, 14)]

Summary

  1. You can satisfy the need for local variables in comprehensions.
  2. Its a hard to understand hack!
 

UPDATE: Added Walrus:

The walrus operator can now be used to give a more readable equivalent. 
This introduces the external initialised variable into the comprehension as well as  keeping the running sums.

In [5]: # We had:

In [6]: data = [1, 2, 3]

In [7]: [s for s in [0] for x in data for s in [s + x]]
Out[7]: [1, 3, 6]

In [9]: # With :=

In [10]: s = 0

In [11]: [s := (s + x) for x in data]
Out[11]: [1, 3, 6]

In [12]: 

In [13]: # We then had:

In [14]: del s

In [15]: [(s, s2) for s, s2 in [(0, 0)] for x in data for s, s2 in [(s + x, s2 + x**2)]]
Out[15]: [(1, 1), (3, 5), (6, 14)]

In [16]: # Which becomes:

In [17]: s = s2 = 0

In [18]: [(s := s + x, s2 := s2 + x**2) for x in data]
Out[18]: [(1, 1), (3, 5), (6, 14)]
(Those external variables s and s2 are updated to the last running sum).

Source

 This all came about because I re-read the "Whats new in Python 3.9" doc after upgrading Anaconda and came across code I couldn't initially fathom.

End.

Tuesday, April 27, 2021

Read PEP 636 on Structural Pattern Matching: The Tutorial

 Hi, I am just reading through PEP 636 -- Structural Pattern Matching: Tutorial. I'll give my views as I go along...

I really liked the choice of example - a text adventure. I am old enough to relate to it, and I guess a quick google would inform others.

Section "Matching sequences" serves as a gentle introduction to a match statement with one case

Subsequent sections expand on aspects of match: literals seem to match literally; names, (also called variables), when matched are also bind-to for use in the stastements, (plueral), of the case block.

skipping a few sections to something that is stated, that is nevertheless a departure from previous uses, that is the treatment of the underscore.

Underscores

In section "Adding a wildcard" underscore becomes more than just a valid name, as it is outside the match statement. 

Underscore always matches an object - just like any other identifier name; but does not get assigned the object it matches - like any other name.

Their example:

match command.split():
    case ["quit"]: ... # Code omitted for brevity
    case ["go", direction]: ...
    case ["drop", *objects]: ...
    ... # Other cases
    case _:
        print(f"Sorry, I couldn't understand {command!r}")

You cannot write the following

match command.split():
    case ["quit"]: ... # Code omitted for brevity
    case ["go", direction]: ...
    case ["drop", *objects]: ...
    ... # Other cases
    case _:
        print(f"Sorry, I couldn't understand {_!r}")

As _ is not assigned. Is it de-assigned in this block or will an earlier assignment outside the match statement still exist?


I am curious as to the decision for this extra non-assignment behaviour for underscore?

Testing Scope and Assignment

Lets expand on their example assigning names globally and trying different commands to match different case clauses:


_, direction, objects = "_ global", "direction global", "objects global"
print(f"{(_, direction, objects)=}")

command = "Nothing to match"
for command in ["quit", "quit me", "go south", "drop anvil arrows"]:
    print(f"\n## COMMAND IS: {command!r}\n")
    match command.split():
        case ["quit"]:
            print("quit")
            print(f"{(_, direction, objects)=}")
        case ["go", direction]:
            print("go", direction)
            print(f"{(_, direction, objects)=}")
        case ["drop", *objects]:
            print("drop", objects)
            print(f"{(_, direction, objects)=}")
        case _:
            print(f"Sorry, I couldn't understand {command!r}")
            print(f"{(_, direction, objects)=}")

Output:

xx

Python 3.10.0a7 (tags/v3.10.0a7:53e5529, Apr  6 2021, 10:20:47) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> 
======== RESTART: C:/Users/paddy/Google Drive/Code/python_3_10_match.py ========
(_, direction, objects)=('_ global', 'direction global', 'objects global')

## COMMAND IS: 'quit'

quit
(_, direction, objects)=('_ global', 'direction global', 'objects global')

## COMMAND IS: 'quit me'

Sorry, I couldn't understand 'quit me'
(_, direction, objects)=('_ global', 'direction global', 'objects global')

## COMMAND IS: 'go south'

go south
(_, direction, objects)=('_ global', 'south', 'objects global')

## COMMAND IS: 'drop anvil arrows'

drop ['anvil', 'arrows']
(_, direction, objects)=('_ global', 'south', ['anvil', 'arrows'])
>>> 

 Concentrating on underscore, we see that it always has the initially assigned global value.

Once direction is changed in the case statement, however, it is this global value that is changed and the next iteration of the loop that does not match to it shows its new changed value.


The special treatment of underscore seems odd to me? Why not assign to it like normal and have only the special treatment of checking that if it is the only pattern then that pattern is the last pattern?

As

Reading section "Capturing matched sub-patterns" I kind of thought ahead and was thinking "another use for the walrus operator", until the use of the as keyword revealed itself :-)

"as" reads well, I like it.


Type and attribute matching

Yay! Duck type friendly matching of attribute names.

Outside of a match statement a call of MyObject might require arguments, but it seems that MyObject() is valid in a match statement, regardless - something to remember.


END?

I'll stop here for now although there is more of the doc to read.





Tuesday, April 06, 2021

Wednesday, February 17, 2021

Longest Common Substring: An investigation using Python.

 (Best browsed on a larger than phone screen)

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive