Sunday, November 17, 2024

There's the easy way...

 

Best seen on a larger than landscape phone

Someone blogged about a particular problem:

From: https://theweeklychallenge.org/blog/perl-weekly-challenge-294/#TASK1

Given an unsorted array of integers, `ints`
Write a script to return the length of the longest consecutive elements sequence.
Return -1 if none found. *The algorithm must run in O(n) time.*

The solution they blogged used a sort which meant it could not be O(n) in time, but the problem looked good so I gave it some thought.

Sets! sets are O(1) in Python and are good for looking things up.

What if when looking at the inputted numbers, one at a time, you also looked for other ints in the input that would extend the int you have to form a longer  range?  Keep tab of the longest range so far and if you remove ints from the pool as they form ranges, when the pool is empty, you should know the longest range.

I added the printout of the longest range too.

My code

def consec_seq(ints) -> tuple[int, int, int]:
    "Extract longest_seq_length, its_min, its_max"
    pool = set(ints)
    longest, longest_mn, longest_mx = 0, 1, 0
    while pool:
        this = start = pool.pop()
        ln = 1
        # check down
        while (this:=(this - 1)) in pool:
            ln += 1
            pool.remove(this)
        mn = this + 1
        # check up
        this = start
        while (this:=(this + 1)) in pool:
            ln += 1
            pool.remove(this)
        mx = this - 1
        # check longest
        if ln > longest:
            longest, longest_mn, longest_mx = ln, mn, mx

    return longest,longest_mn,longest_mx

def _test():
    for ints in[(),
            (69,),
            (-20, 78, 79, 1, 100),
            (10, 4, 20, 1, 3, 2),
            (0, 6, 1, 8, 5, 2, 4, 3, 0, 7),
            (10, 30, 20),
            (2,4,3,1,0, 10,12,11,8,9),  # two runs of five
            (10,12,11,8,9, 2,4,3,1,0),  # two runs of five - reversed
            (2,4,3,1,0,-1, 10,12,11,8,9),  # runs of 6 and 5
            (2,4,3,1,0, 10,12,11,8,9,7),   # runs of 5 and 6
            ]:
        print(f"Input {ints = }")
        longest, longest_mn, longest_mx = consec_seq(ints)

        if longest <2:
            print("  -1")
        else:
            print(f"  The/A longest sequence has {longest} elements {longest_mn}..{longest_mx}")


# %%
if __name__ == '__main__':
    _test()

Sample output

Input ints = ()
  -1
Input ints = (69,)
  -1
Input ints = (-20, 78, 79, 1, 100)
  The/A longest sequence has 2 elements 78..79
Input ints = (10, 4, 20, 1, 3, 2)
  The/A longest sequence has 4 elements 1..4
Input ints = (0, 6, 1, 8, 5, 2, 4, 3, 0, 7)
  The/A longest sequence has 9 elements 0..8
Input ints = (10, 30, 20)
  -1
Input ints = (2, 4, 3, 1, 0, 10, 12, 11, 8, 9)
  The/A longest sequence has 5 elements 0..4
Input ints = (10, 12, 11, 8, 9, 2, 4, 3, 1, 0)
  The/A longest sequence has 5 elements 0..4
Input ints = (2, 4, 3, 1, 0, -1, 10, 12, 11, 8, 9)
  The/A longest sequence has 6 elements -1..4
Input ints = (2, 4, 3, 1, 0, 10, 12, 11, 8, 9, 7)
  The/A longest sequence has 6 elements 7..12

Another Algorithm

What if, you kept and extended ranges untill you amassed all ranges then chose the longest? I need to keep the hash lookup. dict key lookup should also be O(1).  What to look up? Look up ints that would extend a range!

If you have an existing (integer) range, say 1..3 inclusive of end points then finding 0 would extend the range to 0..3 or finding one more than the range maximum, 4 would extend the original range to 1..4 

So if you have ranges then they could be extended by finding rangemin - 1 or rangemax +1. I call then extends

If you do find that the next int from the input ints is also an extends value then you need to find the range that it extends, (by lookup), so you can modify that range. - use a dict to map extends to their range and checking if an int is in the extends dict keys should also take O(1) time.

I took that sketch of an algorithm and started to code. It took two evenings to finally get something that worked and I had to work out several details that were trying. The main problem was what about coalescing ranges? if you have ranges 1..2 and 4..5 what happens when you see a 3? the resultant is the single range 1..5. It took particular test cases and extensive debugging to work out that the extends2range mapping should map to potentially more than one range and that you need to combine ranges if two of them are present for any extend value being hit.


So for 1..2 the extends being looked for are 0 and 3. For 4..5 the extends being looked for are 3, again, and 6. The extends2ranges data structure for just this should look like:

{0: [[1, 2]],
 3: [[1, 2], [4, 5]],
 6: [[4, 5]]}

 

The Code #2

from collections import defaultdict


def combine_ranges(min1, max1, min2, max2):
    "Combine two overlapping ranges return the new range as [min, max], and a set of limits unused in the result"
    assert (min1 <= max1 and min2 <= max2          # Well formed
            and ( min1 <= max2 and min2 <= max1 )) # and ranges touch or overlap
    range_limits = set([min1, max1, min2, max2])
    new_mnmx = [min(range_limits), max(range_limits)]
    unused_limits = range_limits - set(new_mnmx)

    return new_mnmx, unused_limits

def consec_seq2(ints) -> tuple[int, int, int]:
    "Extract longest_seq_length, its_min, its_max"
    if not ints:
        return -1, 1, -1
    seen = set()  # numbers seen so far
    extends2ranges = defaultdict(list)  # map extends to its ranges
    for this in ints:
        if this in seen:
            continue
        else:
            seen.add(this)

        if this not in extends2ranges:
            # Start new range
            mnmx = [this, this]    # Range of one int
            # add in the extend points
            extends2ranges[this + 1].append(mnmx)
            extends2ranges[this - 1].append(mnmx)
        else:
            # Extend an existing range
            ranges = extends2ranges[this]  # The range(s) that could be extended by this
            if len(ranges) == 2:
                # this joins the two ranges
                extend_and_join_ranges(extends2ranges, this, ranges)
            else:
                # extend one range, copied
                extend_and_join_ranges(extends2ranges, this, [ranges[0], ranges[0].copy()])

    all_ranges = sum(extends2ranges.values(), start=[])
    longest_mn, longest_mx = max(all_ranges, key=lambda mnmx: mnmx[1] - mnmx[0])

    return (longest_mx - longest_mn + 1), longest_mn, longest_mx

def extend_and_join_ranges(extends2ranges, this, ranges):
    mnmx, mnmx2 = ranges
    mnmx_orig, mnmx2_orig = mnmx.copy(), mnmx2.copy() # keep copy of originals
    mn, mx = mnmx
    mn2, mx2 = mnmx2
    if this == mn - 1:
        mnmx[0] = mn = this  # Extend lower limit of the range
    if this == mn2 - 1:
        mnmx2[0] = mn2 = this  # Extend lower limit of the range
    if this == mx + 1:
        mnmx[1] = mx = this  # Extend upper limit of the range
    if this == mx2 + 1:
        mnmx2[1] = mx2 = this  # Extend lower limit of the range
    new_mnmx, _unused_limits = combine_ranges(mn, mx, mn2, mx2)

    remove_merged_from_extends(extends2ranges, this, mnmx, mnmx2)
    add_combined_range_to_extends(extends2ranges, new_mnmx)


def add_combined_range_to_extends(extends2ranges, new_mnmx):
    "Add in the combined of two ranges's extends"
    new_mn, new_mx = new_mnmx
    for extend in (new_mn - 1, new_mx + 1):
        r = extends2ranges[extend]  # ranges at new limit extension
        if new_mnmx not in r:
            r.append(new_mnmx)

def remove_merged_from_extends(extends2ranges, this, mnmx, mnmx2):
    "Remove original ranges that were merged from extends"
    for lohi in (mnmx, mnmx2):
        lo, hi = lohi
        for extend in (lo - 1, hi + 1):
            if extend in extends2ranges:
                r = extends2ranges[extend]
                for r_old in (mnmx, mnmx2):
                    if r_old in r:
                        r.remove(r_old)
                if not r:
                    del extends2ranges[extend]
    # remove joining extend, this
    del extends2ranges[this]


def _test():
    for ints in[
            (),
            (69,),
            (-20, 78, 79, 1, 100),
            (4, 1, 3, 2),
            (10, 4, 20, 1, 3, 2),
            (0, 6, 1, 8, 5, 2, 4, 3, 0, 7),
            (10, 30, 20),
            (2,4,3,1,0, 10,12,11,8,9),  # two runs of five
            (10,12,11,8,9, 2,4,3,1,0),  # two runs of five - reversed
            (2,4,3,1,0,-1, 10,12,11,8,9),  # runs of 6 and 5
            (2,4,3,1,0, 10,12,11,8,9,7),   # runs of 5 and 6
            ]:
        print(f"Input {ints = }")
        longest, longest_mn, longest_mx = consec_seq2(ints)

        if longest <2:
            print("  -1")
        else:
            print(f"  The/A longest sequence has {longest} elements {longest_mn}..{longest_mx}")


# %%
if __name__ == '__main__':
    _test()


Its Output

 Input ints = ()
  -1
Input ints = (69,)
  -1
Input ints = (-20, 78, 79, 1, 100)
  The/A longest sequence has 2 elements 78..79
Input ints = (4, 1, 3, 2)
  The/A longest sequence has 4 elements 1..4
Input ints = (10, 4, 20, 1, 3, 2)
  The/A longest sequence has 4 elements 1..4
Input ints = (0, 6, 1, 8, 5, 2, 4, 3, 0, 7)
  The/A longest sequence has 9 elements 0..8
Input ints = (10, 30, 20)
  -1
Input ints = (2, 4, 3, 1, 0, 10, 12, 11, 8, 9)
  The/A longest sequence has 5 elements 0..4
Input ints = (10, 12, 11, 8, 9, 2, 4, 3, 1, 0)
  The/A longest sequence has 5 elements 8..12
Input ints = (2, 4, 3, 1, 0, -1, 10, 12, 11, 8, 9)
  The/A longest sequence has 6 elements -1..4
Input ints = (2, 4, 3, 1, 0, 10, 12, 11, 8, 9, 7)
  The/A longest sequence has 6 elements 7..12


This second algorithm gives correct results but is harder to develop and explain. It's a testament to my stubbornness as I thought there was a solution there, and debugging was me flexing my skills to keep them honed.


END.


Monday, May 27, 2024

Recreating the CVM algorithm for estimating distinct elements gives problems

 a Jamaican teaching Python in the Blue Mountains. Image 2 of 4

 Someone at work posted a link to this Quanta Magazine article. It describes a novel, and seemingly straight-forward way to estimate the number of distinct elements in a datastream. 

Quanta describes the algorithm, and as an example gives "counting the number of distinct words in Hamlet".

Following Quanta

I looked at the description and decided to follow their text. They carefully described each round of the algorithm which I coded up and then looked for the generalizations and implemented a loop over alll items in the stream ....

It did not work! I got silly numbers. I could download Hamlet split it into words, (around 32,000), do len(set(words) to get the exact number of distinct words, (around 7,000), then run it through the algorithm and get a stupid result with tens of digits for the estimated number of distinct words.
I re-checked my implementation of the Quanta-described algorithm and couldn't see any mistake, but I had originally noticed a link to the original paper. I did not follow it at first as original papers can be heavily into maths notation and I prefer reading algorithms described in code/pseudocode. 

I decided to take a look at the original.

The CVM Original Paper

I scanned the paper.

I read the paper.

I looked at Algorithm 1 as a probable candidate to decypher into Python, but the description was cryptic. Heres that description taken from the paper:

AI To the rescue!?

I had a brainwave💡lets chuck it at two AI's and see what they do. I had Gemini and I had Copilot to hand and asked them each to express Algorithm 1 as Python. Gemini did something, and Copilot finally did something but I first had to open the page in Microsoft Edge.
There followed hours of me reading and cross-comparing between the algorithm and the AI's. If I did not understand where something came from I would ask the generating AI; If I found an error I would first, (and second and...), try to get the AI to make a fix I suggested.

At this stage I was also trying to get a feel for how the AI's could help me, (now way past what I thought the algorithm should be, just to see what it would take to get those AI's to cross T's and dot I's on a good solution).
Not a good use of time! I now know that asking questions to update one of the 20 to 30 lines of the Python function might fix that line, but unfix another line you had fixed before. Code from the AI does not have line numbers making it difficult to state what needs changing, and where.They can suggest type hints and create the beginnings of docstrings, but, for example, it pulled out the wrong authors for the name of the algorithm.
In line 1 of the algorithm, the initialisation of thresh is clearly shown, I thought, but both AI's had difficulty getting the Python right. eventually I cut-n-pasted the text into each AI, where they confidentially said "OF course...", made a change, and then I had to re-check for any other changes.

My Code

I first created this function:

def F0_Estimator(stream: Collection[Any], epsilon: float, delta: float) -> float:
    """
    ...
    """
    p = 1
    X = set()
    m = len(stream)
    thresh = math.ceil(12 / (epsilon ** 2) * math.log(8 * m / delta))

    for item in stream:
        X.discard(item)
        if random.random() < p:
            X.add(item)
        if len(X) == thresh:
            X = {x_item for x_item in X
                    if random.random() < 0.5}
            p /= 2
    return len(X) / p

I tested it with Hamlet data and it made OK estimates.

Elated, I took a break.

Hacker News

The next evening I decided to do a search to see If anyone else was talking about the algorithm and found a thread on Hacker News that was right up my street. People were discussing those same problems found in the Quanta Article - and getting similar ginormous answers. They had one of the original Authors of the paper making comments! And others had created code from the actual paper and said it was also easier than the Quanta description.

The author mentioned that no less than Donald Knuth had taken an interest in their algorithm and had noted that the expression starting `X = ...` four lines from the end could, thoretically, make no change to X, and the solution was to encase the assignment in a while loop that only exited if len(X) < thresh.

Code update

I decided to add that change:

def F0_Estimator(stream: Collection[Any], epsilon: float, delta: float) -> float:
    """
    Estimates the number of distinct elements in the input stream.

    This function implements the CVM algorithm for the problem of
    estimating the number of distinct elements in a stream of data.
   
    The stream object must support an initial call to __len__

    Parameters:
    stream (Collection[Any]): The input stream as a collection of hashable
        items.
    epsilon (float): The desired relative error in the estimate. It must be in
        the range (0, 1).
    delta (float): The desired probability of the estimate being within the
        relative error. It must be in the range (0, 1).

    Returns:
    float: An estimate of the number of distinct elements in the input stream.
    """
    p = 1
    X = set()
    m = len(stream)
    thresh = math.ceil(12 / (epsilon ** 2) * math.log(8 * m / delta))

    for item in stream:
        X.discard(item)
        if random.random() < p:
            X.add(item)
        if len(X) == thresh:
            while len(X) == thresh:  # Force a change
                X = {x_item for x_item in X
                     if random.random() < 0.5}  # Random, so could do nothing
            p /= 2
    return len(X) / p


thresh

In the code above, the variable thresh, (threshhold), named from Algorithm 1, is used in the Quanta article to describe the maximum storage available to keep items from the stream that have been seen before. You must know the length of the stream - m, epsilon, and delta to calculate thresh.

If you were to have just the stream and  thresh as the arguments you could return both the estimate of the number of distinct items in the stream as well as counting the number of total elements in the stream.
Epsilon could be calculated from the numbers we now know.

def F0_Estimator2(stream: Iterable[Any],
                 thresh: int,
                 ) -> tuple[float, int]:
    """
    Estimates the number of distinct elements in the input stream.

    This function implements the CVM algorithm for the problem of
    estimating the number of distinct elements in a stream of data.
   
    The stream object does NOT have to support a call to __len__

    Parameters:
    stream (Iterable[Any]): The input stream as an iterable of hashable
        items.
    thresh (int): The max threshhold of stream items used in the estimation.py

    Returns:
    tuple[float, int]: An estimate of the number of distinct elements in the
        input stream, and the count of the number of items in stream.
    """
    p = 1
    X = set()
    m = 0  # Count of items in stream

    for item in stream:
        m += 1
        X.discard(item)
        if random.random() < p:
            X.add(item)
        if len(X) == thresh:
            while len(X) == thresh:  # Force a change
                X = {x_item for x_item in X
                     if random.random() < 0.5}  # Random, so could do nothing
            p /= 2
           
    return len(X) / p, m

def F0_epsilon(
               thresh: int,
               m: int,
               delta: float=0.05,  #  0.05 is 95%
              ) -> float:
    """
    Calculate the relative error in the estimate from F0_Estimator2(...)

    Parameters:
    thresh (int): The thresh value used in the call TO F0_Estimator2.
    m (int): The count of items in the stream FROM F0_Estimator2.
    delta (float): The desired probability of the estimate being within the
        relative error. It must be in the range (0, 1) and is usually 0.05
        to 0.01, (95% to 99% certainty).

    Returns:
    float: The calculated relative error in the estimate

    """
    return math.sqrt(12 / thresh * math.log(8 * m / delta))

Testing

def stream_gen(k: int=30_000, r: int=7_000) -> list[int]:
    "Create a randomised list of k ints of up to r different values."
    return random.choices(range(r), k=k)

def stream_stats(s: list[Any]) -> tuple[int, int]:
    length, distinct = len(s), len(set(s))
    return length, distinct

# %%
print("CVM ALGORITHM ESTIMATION OF NUMBER OF UNIQUE VALUES IN A STREAM")

stream_size = 2**18
reps = 5
target_uniques = 1
while target_uniques < stream_size:
    the_stream = stream_gen(stream_size+1, target_uniques)
    target_uniques *= 4
    size, unique = stream_stats(the_stream)

    print(f"\n  Actual:\n    {size = :_}, {unique = :_}\n  Estimations:")

    delta = 0.05
    threshhold = 2
    print(f"    All runs using {delta = :.2f} and with estimate averaged from {reps} runs:")
    while threshhold < size:
        estimate, esize = F0_Estimator2(the_stream.copy(), threshhold)
        estimate = sum([estimate] +
                    [F0_Estimator2(the_stream.copy(), threshhold)[0]
                        for _ in range(reps - 1)]) / reps
        estimate = int(estimate + 0.5)
        epsilon = F0_epsilon(threshhold, esize, delta)
        print(f"      With {threshhold = :7_} -> "
            f"{estimate = :_}, +/-{epsilon*100:.0f}%"
            + (f" {esize = :_}" if esize != size else ""))
        threshhold *= 8

The algorithm generates an estimate based on random sampling, so I run it multiple times for the same input and report the mean estimate from those runs.

Sample output

 

CVM ALGORITHM ESTIMATION OF NUMBER OF UNIQUE VALUES IN A STREAM

  Actual:
    size = 262_145, unique = 1
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 1, +/-1026%
      With threshhold =      16 -> estimate = 1, +/-363%
      With threshhold =     128 -> estimate = 1, +/-128%
      With threshhold =   1_024 -> estimate = 1, +/-45%
      With threshhold =   8_192 -> estimate = 1, +/-16%
      With threshhold =  65_536 -> estimate = 1, +/-6%

  Actual:
    ...
 
  Actual:
    size = 262_145, unique = 1_024
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 16_384, +/-1026%
      With threshhold =      16 -> estimate = 768, +/-363%
      With threshhold =     128 -> estimate = 1_101, +/-128%
      With threshhold =   1_024 -> estimate = 1_018, +/-45%
      With threshhold =   8_192 -> estimate = 1_024, +/-16%
      With threshhold =  65_536 -> estimate = 1_024, +/-6%

  Actual:
    size = 262_145, unique = 4_096
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 13_107, +/-1026%
      With threshhold =      16 -> estimate = 3_686, +/-363%
      With threshhold =     128 -> estimate = 3_814, +/-128%
      With threshhold =   1_024 -> estimate = 4_083, +/-45%
      With threshhold =   8_192 -> estimate = 4_096, +/-16%
      With threshhold =  65_536 -> estimate = 4_096, +/-6%

  Actual:
    size = 262_145, unique = 16_384
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 0, +/-1026%
      With threshhold =      16 -> estimate = 15_155, +/-363%
      With threshhold =     128 -> estimate = 16_179, +/-128%
      With threshhold =   1_024 -> estimate = 16_986, +/-45%
      With threshhold =   8_192 -> estimate = 16_211, +/-16%
      With threshhold =  65_536 -> estimate = 16_384, +/-6%

  Actual:
    size = 262_145, unique = 64_347
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 26_214, +/-1026%
      With threshhold =      16 -> estimate = 73_728, +/-363%
      With threshhold =     128 -> estimate = 61_030, +/-128%
      With threshhold =   1_024 -> estimate = 64_422, +/-45%
      With threshhold =   8_192 -> estimate = 64_760, +/-16%
      With threshhold =  65_536 -> estimate = 64_347, +/-6%

 Looks good!

Wikipedia

Another day, and I decide to start writing this blog post. I searched again and found the Wikipedia article on what it called the Count-distinct problem

Looking through it, It had this wrong description of the CVM algorithm:

The, (or a?),  problem with the wikipedia entry is that it shows

...within the while loop. You need an enclosing if |B| >= s for the while loop and the  assignment to p outside the while loop, but inside this new if statement.

STOp PRESS! 02 May 2024: The Wikipedia entry was actually correct. 

It's tough!

Both Quanta Magazine, and whoever added the algorithm to Wikipedia got the algorithm wrong.

I've written around two hundred tasks on site Rosettacode.org for over a decade. Others had to read my description and create code in their chosen language to implement those tasks. I have learnt from the feedback I got on talk pages to hone that craft, but details matter. Examples matter. Constructive feedback matters.

END.