Monday, May 27, 2024

Recreating the CVM algorithm for estimating distinct elements gives problems

 a Jamaican teaching Python in the Blue Mountains. Image 2 of 4

 Someone at work posted a link to this Quanta Magazine article. It describes a novel, and seemingly straight-forward way to estimate the number of distinct elements in a datastream. 

Quanta describes the algorithm, and as an example gives "counting the number of distinct words in Hamlet".

Following Quanta

I looked at the description and decided to follow their text. They carefully described each round of the algorithm which I coded up and then looked for the generalizations and implemented a loop over alll items in the stream ....

It did not work! I got silly numbers. I could download Hamlet split it into words, (around 32,000), do len(set(words) to get the exact number of distinct words, (around 7,000), then run it through the algorithm and get a stupid result with tens of digits for the estimated number of distinct words.
I re-checked my implementation of the Quanta-described algorithm and couldn't see any mistake, but I had originally noticed a link to the original paper. I did not follow it at first as original papers can be heavily into maths notation and I prefer reading algorithms described in code/pseudocode. 

I decided to take a look at the original.

The CVM Original Paper

I scanned the paper.

I read the paper.

I looked at Algorithm 1 as a probable candidate to decypher into Python, but the description was cryptic. Heres that description taken from the paper:

AI To the rescue!?

I had a brainwave💡lets chuck it at two AI's and see what they do. I had Gemini and I had Copilot to hand and asked them each to express Algorithm 1 as Python. Gemini did something, and Copilot finally did something but I first had to open the page in Microsoft Edge.
There followed hours of me reading and cross-comparing between the algorithm and the AI's. If I did not understand where something came from I would ask the generating AI; If I found an error I would first, (and second and...), try to get the AI to make a fix I suggested.

At this stage I was also trying to get a feel for how the AI's could help me, (now way past what I thought the algorithm should be, just to see what it would take to get those AI's to cross T's and dot I's on a good solution).
Not a good use of time! I now know that asking questions to update one of the 20 to 30 lines of the Python function might fix that line, but unfix another line you had fixed before. Code from the AI does not have line numbers making it difficult to state what needs changing, and where.They can suggest type hints and create the beginnings of docstrings, but, for example, it pulled out the wrong authors for the name of the algorithm.
In line 1 of the algorithm, the initialisation of thresh is clearly shown, I thought, but both AI's had difficulty getting the Python right. eventually I cut-n-pasted the text into each AI, where they confidentially said "OF course...", made a change, and then I had to re-check for any other changes.

My Code

I first created this function:

def F0_Estimator(stream: Collection[Any], epsilon: float, delta: float) -> float:
    """
    ...
    """
    p = 1
    X = set()
    m = len(stream)
    thresh = math.ceil(12 / (epsilon ** 2) * math.log(8 * m / delta))

    for item in stream:
        X.discard(item)
        if random.random() < p:
            X.add(item)
        if len(X) == thresh:
            X = {x_item for x_item in X
                    if random.random() < 0.5}
            p /= 2
    return len(X) / p

I tested it with Hamlet data and it made OK estimates.

Elated, I took a break.

Hacker News

The next evening I decided to do a search to see If anyone else was talking about the algorithm and found a thread on Hacker News that was right up my street. People were discussing those same problems found in the Quanta Article - and getting similar ginormous answers. They had one of the original Authors of the paper making comments! And others had created code from the actual paper and said it was also easier than the Quanta description.

The author mentioned that no less than Donald Knuth had taken an interest in their algorithm and had noted that the expression starting `X = ...` four lines from the end could, thoretically, make no change to X, and the solution was to encase the assignment in a while loop that only exited if len(X) < thresh.

Code update

I decided to add that change:

def F0_Estimator(stream: Collection[Any], epsilon: float, delta: float) -> float:
    """
    Estimates the number of distinct elements in the input stream.

    This function implements the CVM algorithm for the problem of
    estimating the number of distinct elements in a stream of data.
   
    The stream object must support an initial call to __len__

    Parameters:
    stream (Collection[Any]): The input stream as a collection of hashable
        items.
    epsilon (float): The desired relative error in the estimate. It must be in
        the range (0, 1).
    delta (float): The desired probability of the estimate being within the
        relative error. It must be in the range (0, 1).

    Returns:
    float: An estimate of the number of distinct elements in the input stream.
    """
    p = 1
    X = set()
    m = len(stream)
    thresh = math.ceil(12 / (epsilon ** 2) * math.log(8 * m / delta))

    for item in stream:
        X.discard(item)
        if random.random() < p:
            X.add(item)
        if len(X) == thresh:
            while len(X) == thresh:  # Force a change
                X = {x_item for x_item in X
                     if random.random() < 0.5}  # Random, so could do nothing
            p /= 2
    return len(X) / p


thresh

In the code above, the variable thresh, (threshhold), named from Algorithm 1, is used in the Quanta article to describe the maximum storage available to keep items from the stream that have been seen before. You must know the length of the stream - m, epsilon, and delta to calculate thresh.

If you were to have just the stream and  thresh as the arguments you could return both the estimate of the number of distinct items in the stream as well as counting the number of total elements in the stream.
Epsilon could be calculated from the numbers we now know.

def F0_Estimator2(stream: Iterable[Any],
                 thresh: int,
                 ) -> tuple[float, int]:
    """
    Estimates the number of distinct elements in the input stream.

    This function implements the CVM algorithm for the problem of
    estimating the number of distinct elements in a stream of data.
   
    The stream object does NOT have to support a call to __len__

    Parameters:
    stream (Iterable[Any]): The input stream as an iterable of hashable
        items.
    thresh (int): The max threshhold of stream items used in the estimation.py

    Returns:
    tuple[float, int]: An estimate of the number of distinct elements in the
        input stream, and the count of the number of items in stream.
    """
    p = 1
    X = set()
    m = 0  # Count of items in stream

    for item in stream:
        m += 1
        X.discard(item)
        if random.random() < p:
            X.add(item)
        if len(X) == thresh:
            while len(X) == thresh:  # Force a change
                X = {x_item for x_item in X
                     if random.random() < 0.5}  # Random, so could do nothing
            p /= 2
           
    return len(X) / p, m

def F0_epsilon(
               thresh: int,
               m: int,
               delta: float=0.05,  #  0.05 is 95%
              ) -> float:
    """
    Calculate the relative error in the estimate from F0_Estimator2(...)

    Parameters:
    thresh (int): The thresh value used in the call TO F0_Estimator2.
    m (int): The count of items in the stream FROM F0_Estimator2.
    delta (float): The desired probability of the estimate being within the
        relative error. It must be in the range (0, 1) and is usually 0.05
        to 0.01, (95% to 99% certainty).

    Returns:
    float: The calculated relative error in the estimate

    """
    return math.sqrt(12 / thresh * math.log(8 * m / delta))

Testing

def stream_gen(k: int=30_000, r: int=7_000) -> list[int]:
    "Create a randomised list of k ints of up to r different values."
    return random.choices(range(r), k=k)

def stream_stats(s: list[Any]) -> tuple[int, int]:
    length, distinct = len(s), len(set(s))
    return length, distinct

# %%
print("CVM ALGORITHM ESTIMATION OF NUMBER OF UNIQUE VALUES IN A STREAM")

stream_size = 2**18
reps = 5
target_uniques = 1
while target_uniques < stream_size:
    the_stream = stream_gen(stream_size+1, target_uniques)
    target_uniques *= 4
    size, unique = stream_stats(the_stream)

    print(f"\n  Actual:\n    {size = :_}, {unique = :_}\n  Estimations:")

    delta = 0.05
    threshhold = 2
    print(f"    All runs using {delta = :.2f} and with estimate averaged from {reps} runs:")
    while threshhold < size:
        estimate, esize = F0_Estimator2(the_stream.copy(), threshhold)
        estimate = sum([estimate] +
                    [F0_Estimator2(the_stream.copy(), threshhold)[0]
                        for _ in range(reps - 1)]) / reps
        estimate = int(estimate + 0.5)
        epsilon = F0_epsilon(threshhold, esize, delta)
        print(f"      With {threshhold = :7_} -> "
            f"{estimate = :_}, +/-{epsilon*100:.0f}%"
            + (f" {esize = :_}" if esize != size else ""))
        threshhold *= 8

The algorithm generates an estimate based on random sampling, so I run it multiple times for the same input and report the mean estimate from those runs.

Sample output

 

CVM ALGORITHM ESTIMATION OF NUMBER OF UNIQUE VALUES IN A STREAM

  Actual:
    size = 262_145, unique = 1
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 1, +/-1026%
      With threshhold =      16 -> estimate = 1, +/-363%
      With threshhold =     128 -> estimate = 1, +/-128%
      With threshhold =   1_024 -> estimate = 1, +/-45%
      With threshhold =   8_192 -> estimate = 1, +/-16%
      With threshhold =  65_536 -> estimate = 1, +/-6%

  Actual:
    ...
 
  Actual:
    size = 262_145, unique = 1_024
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 16_384, +/-1026%
      With threshhold =      16 -> estimate = 768, +/-363%
      With threshhold =     128 -> estimate = 1_101, +/-128%
      With threshhold =   1_024 -> estimate = 1_018, +/-45%
      With threshhold =   8_192 -> estimate = 1_024, +/-16%
      With threshhold =  65_536 -> estimate = 1_024, +/-6%

  Actual:
    size = 262_145, unique = 4_096
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 13_107, +/-1026%
      With threshhold =      16 -> estimate = 3_686, +/-363%
      With threshhold =     128 -> estimate = 3_814, +/-128%
      With threshhold =   1_024 -> estimate = 4_083, +/-45%
      With threshhold =   8_192 -> estimate = 4_096, +/-16%
      With threshhold =  65_536 -> estimate = 4_096, +/-6%

  Actual:
    size = 262_145, unique = 16_384
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 0, +/-1026%
      With threshhold =      16 -> estimate = 15_155, +/-363%
      With threshhold =     128 -> estimate = 16_179, +/-128%
      With threshhold =   1_024 -> estimate = 16_986, +/-45%
      With threshhold =   8_192 -> estimate = 16_211, +/-16%
      With threshhold =  65_536 -> estimate = 16_384, +/-6%

  Actual:
    size = 262_145, unique = 64_347
  Estimations:
    All runs using delta = 0.05 and with estimate averaged from 5 runs:
      With threshhold =       2 -> estimate = 26_214, +/-1026%
      With threshhold =      16 -> estimate = 73_728, +/-363%
      With threshhold =     128 -> estimate = 61_030, +/-128%
      With threshhold =   1_024 -> estimate = 64_422, +/-45%
      With threshhold =   8_192 -> estimate = 64_760, +/-16%
      With threshhold =  65_536 -> estimate = 64_347, +/-6%

 Looks good!

Wikipedia

Another day, and I decide to start writing this blog post. I searched again and found the Wikipedia article on what it called the Count-distinct problem

Looking through it, It had this wrong description of the CVM algorithm:

The, (or a?),  problem with the wikipedia entry is that it shows

...within the while loop. You need an enclosing if |B| >= s for the while loop and the  assignment to p outside the while loop, but inside this new if statement.

STOp PRESS! 02 May 2024: The Wikipedia entry was actually correct. 

It's tough!

Both Quanta Magazine, and whoever added the algorithm to Wikipedia got the algorithm wrong.

I've written around two hundred tasks on site Rosettacode.org for over a decade. Others had to read my description and create code in their chosen language to implement those tasks. I have learnt from the feedback I got on talk pages to hone that craft, but details matter. Examples matter. Constructive feedback matters.

END.

 

7 comments:

  1. I just updated the Wikipedia article, and added to the talk page.

    ReplyDelete
  2. Here's how GPT-4o handled it: https://i.imgur.com/LBWXNS1.png

    ReplyDelete
    Replies
    1. Now that is good! Apart from the size of example leading to thresh = 8968, much larger than the eleven element example stream :-)

      Delete
    2. Hmm, my GPT-4o question of "Generate a python program to do Algorithm 1 on site: https://arxiv.org/pdf/2301.10191" gave nothing nearly as good as what you show; its function had stream and p, that's right, p; as parameters!

      Delete
    3. In my tests it seems to do better by asking it to first describe the algorithm.

      Delete
    4. Yep, I think all three AI worked much better if you just cut-n- pasted Agorithm 1 and asked them to generate the Python

      Delete
  3. https://skal65535.github.io/CVM/ for a live demo w/ code

    ReplyDelete