Wednesday, February 17, 2021

Longest Common Substring: An investigation using Python.

 (Best browsed on a larger than phone screen)

Saturday, February 13, 2021

Python: Counting without Counter

Normally, if you are counting things then you need to first think of collections.Counter. It's in the standard library and I use it a lot myself, but lately, I've had two instances where I was counting, and Counter was not the best fit!

1. When the keys to count are known, adjacent, integers. 

Or can be easily manipulated to be

In this case you can replace counter by an array of zeroes and increment array[key] whenever you see key.

I used this in code to bin numbers. The bin limits are set and sorted, so you know exactly how many bins are needed. numbers to bin are first compared with the limits to work out which bin they occupy and then that bin is incremented:

from bisect import bisect_right
 
def bin_it(limits: list, data: list) -> list:
    "Bin data according to (ascending) limits."
    bins = [0] * (len(limits) + 1)      # adds under/over range bins too
    for d in data:
        bins[bisect_right(limits, d)] += 1
    return bins

I have replaced hash lookup in Counter with array indexing.

2. When defaultdict(int) is better

Another way to count arbitrary objects is to use defaultdict(int) and just increment item counts with the defauldict automatically starting the count at zero from its initialising call of int().

I was splitting words in a book then working out the frequencies of what words follow others with an aim to randomly create sentencies where the following word would by a weighted random choice of the words in the book that followed a word.

So if the book were just two sentences it would be broken down as:

>>> book_text = 'The cat sat on the mat. The mat is that upon which the cat sat'
>>> words = remove_punctuation(book_text).split()
>>> words
['the', 'cat', 'sat', 'on', 'the', 'mat', '.', 'the', 'mat', 'is', 'that', 'upon', 'which', 'the', 'cat', 'sat', '.']
>>> 

I need to count how many times 'cat' comes after 'the'; 'sat' comes after 'cat', ... I also want to be able to to access 'the' and find all the words that it precedes, together with a count of how many times that word directly succeeds 'the'. 

I want a dictionary with key 'word' that returns a dictionary who's keys are all the immediate successor words, and who's values are the counts of how many times the successor word is found to the original word.

Something like the following nested dict:

{'the': {'cat': 2, 'mat': 2},
 'cat': {'sat': 2},
 'sat': {'on': 1, '.': 1},
 'on': {'the': 1},
 'mat': {'.': 1, 'is': 1},
 '.': {'the': 1},
 'is': {'that': 1},
 'that': {'upon': 1},
 'upon': {'which': 1},
 'which': {'the': 1}}

 

See how it shows that 'the' is followed twice by 'cat' and once by'mat'; 'sat' is followed once by 'on' and once by a full stop '.', etc.

The data structure needs to be an outer dict that will automatically create an inner counter dict. I use lowercase counter because that inner dict can be either a collections.Counter or a defaultdict(int) used as a counter. Since I need a defaultdict at the outer level, and don't need any of the Counter special methods, I chose to use defaultdict for the nested dicts too. 

That leads to the following definition and population of the datastructure:

>>> word2next = defaultdict(lambda :defaultdict(int))
>>> for lh, rh in zip(words, words[1:]):
    word2next[lh][rh] += 1

    
>>> assert word2next == {'the': {'cat': 2, 'mat': 2},
                     'cat': {'sat': 2},
                     'sat': {'on': 1, '.': 1},
                     'on': {'the': 1},
                     'mat': {'.': 1, 'is': 1},
                     '.': {'the': 1},
                     'is': {'that': 1},
                     'that': {'upon': 1},
                     'upon': {'which': 1},
                     'which': {'the': 1}}
>>> 
 
zip(*word2next[some_word].items()) then returns a list of successor words and a list of successor_word_counts that can then be used as arguments to random.choices() to randomly select the next word in a Markov Chain sentence generator based on the book text.  

Conclusion

When counting odd stuff in Python then collections.Counter should always be considered, but it isn't the only way to count; arrays initialised to zero, and  sometimes defaultdict(int), might also be worth considering.


End.




Tuesday, February 02, 2021

Pythone'er abuses walrus

(... the operator, that is 😊).

 

I haven't used the new walrus operator, := much in code. This is an example of a working mis-use of it. 

A new task

I wrote a new task description on Rosetta Code, together with Python recursive and iterative solutions.

The gist of the task is:

  • Given a list of integers, >0, representing nesting levels, e.g. [1, 2, 4]
  • Generate a tree of those values, in order, in the stated depth of nesting. i.e. [1, [2, [[4]]]]

(Someone on Stackoverflow had that problem, and it seemed like it could form the basis of a good RC task).

Iterative solution

I posted the following code as my iterative solution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def to_tree(x: list) -> list:
   nested = []
   stack = [nested]
   for this in x:
       while this != len(stack):
           if this > len(stack):
               innermost = []               # new level
               stack[-1].append(innermost)  # nest it
               stack.append(innermost)      # push it
           else: # this < stack:
               stack.pop(-1)
       stack[-1].append(this)

   return nested

It works using:

  1. len(stack) to capture nesting depth; and

  2. stack[-1] as a pointer into the active nest at that level; whilst

  3. nested accumulates the whole tree.

The walrus hack.

I was admiring lines seven to nine above and was initially rthinking about name innermost. It is a kind of temporary variable, but is instrumental in modifying stack. I thought: "could I do lines seven to eight in one statement"?

It is awkward as appends return None, but then the walrus operator came to my rescue and I produced the equivalent:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def to_tree(x: list) -> list:
   nested = []
   stack = [nested]
   for this in x:
       while this != len(stack):
           if this > len(stack):
               stack.append(None if stack[-1].append(innermost:=[])
                            else innermost)
           else: # this < stack:
               stack.pop(-1)
       stack[-1].append(this)

   return nested

The one statement at lines seven and eight is equivalent. The whole expression is evaluated inside out:

  1. An empty list is created and assigned to name innermost.
  2. that same list is nested by appending it to the list at the top of the stack.
  3. The inner append always returns None which is False in the boolean context of the if condition.
  4. The else clause is always triggered, returning innermost - the newly created and nested list.
  5. That newly created and nested list is appended to the head of the stack.
This use of the walrus operator, a.k.a the assignment expression is bad Python. It is much harder to read and maintain than the previous code.

Other uses are Pythonic!

 

STOP PRESS!

Steven D'Aprano gives valuable insight on this in a comment here.