Saturday, May 21, 2022

Big nO, How the interpreted nature of Python could skew quoted Big-O use.

 Or: If time is an issue then measure!

I was recently introduced to the Boyer-Moore MJRTY algorithm for determining which, if any, of an arbitrary
number of candidates has received a majority of the votes cast in an election.

The linkedin post  by Prof. Wissam Fawaz gave an implementation of that algorithm and I gave my own version, reproduced below. 

from typing import Any
from collections import Counter
import random


def boyer_moore(seq: list[Any]) -> tuple[bool, Any]:
    "Returns IF there is a majority vote winner, and the winner."
    m, i = None, 0
    for x in seq:
        m, i = ((x, 1) if i == 0
                else ((m, i+1) if m == x
                      else (m, i-1)))
    return seq.count(m) > len(seq) // 2, m


 The algorithm, given a list of peoples votes such as if thirteen people are voting between parties A, B, and C then their votes might be represented by the list AAACCBBCCCBCC. The MJRTY algorithm takes just two passes through the list to determine if some party has more than half the votes, and who that is.

I decided to also code  a solution using collections.Counter, which is the first consideration when a problem includes counting in Python:

def counted(seq: list[Any]) -> tuple[bool, Any]:
    "Uses collections.Counter"
    m, count = Counter(seq).most_common(1)[0]
    return count > len(seq) // 2, m

Given the problem statement, it is more evident how the counted version works:

  1. Find the most common vote
  2. See if it amounts to more than half of all votes.
The description of how and why the MJRTY is given in the book and is a little more involved.

Testing.

Like all tests, it has holes, but I decided to concentrate on the run time for longer and longer lists that were randomised, and also had a 50/50 chance of producing a winning candidate.

First a generator of suitable data:

def list_gen(odd_count=11) -> list[int]:
    """
    Generates a list with either one item appearing more
    than all the rest or not.
    """
    half_count = odd_count // 2
    if random.choice((True, False)):
        lst = random.choices([2, 3], k=half_count)     + [1] * (half_count + 1)
    else:
        lst = random.choices([2, 3], k=half_count + 1) + [1] * half_count
    random.shuffle(lst)
    return lst

for i in range(5):
    print((lst := list_gen()), boyer_moore(lst), counted(lst))

"""
Sample output from loop above:
[1, 2, 1, 2, 1, 3, 1, 3, 1, 2, 2] (False, 2) (False, 1)
[2, 2, 3, 1, 1, 1, 1, 1, 2, 1, 3] (True, 1) (True, 1)
[1, 1, 1, 2, 1, 2, 2, 3, 1, 1, 3] (True, 1) (True, 1)
[1, 1, 1, 2, 1, 2, 3, 1, 2, 2, 1] (True, 1) (True, 1)
[2, 1, 2, 1, 1, 2, 1, 3, 3, 2, 1] (False, 1) (False, 1)
"""

Then wrap each implementation and timeit (Thanks IPython/Spyder).

#%% Timings

def timed_bm(count=1_000):
    boyer_moore(list_gen(count))

def timed_c(count=1_000):
    counted(list_gen(count))

for i in range(1, 7):
    print('\n##',10**i)
    print("# Boyer-Moore")
    %timeit timed_bm(10**i)  # Spyder/IPython IDE specific
    print("# Counted")
    %timeit timed_c(10**i)

"""
Sample timed output:
## 10
# Boyer-Moore
10.8 µs ± 44.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# Counted
13.6 µs ± 44.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

## 100
# Boyer-Moore
73.7 µs ± 89.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Counted
69.9 µs ± 518 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

## 1000
# Boyer-Moore
739 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Counted
661 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

## 10000
# Boyer-Moore
7.58 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Counted
6.91 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

## 100000
# Boyer-Moore
75.3 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Counted
67.8 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

## 1000000
# Boyer-Moore
780 ms ± 5.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Counted
710 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""

Results

The run times are comparable over from ten to a million voters given the restrictions of the test strategy.

the vote generator restricts the number of parties to three which favours the dict underlying Counter. A more accurate comparison might be made by taking different sized samples from actual data.

END.




No comments:

Post a Comment