Wednesday, March 05, 2025

Intsets by AI


 

 I will be using sets a lot in a future algorithm and wanted to find out about the speed of operation of different set implementations.

I will be dealing with different sets of strings from a total of around 200_000, (but the strings being added throughout the calculations). Python sets can do that directly, but I can create an integer index of all the strings and then do calculations on sets of integers.

Sets of integers can be translated to ints with bits set to indicate which ints were in the set. Numpy can do this but needs either all the strings up-front, or the maximum integer up front.  Python integers are variable length, as the algorithm adds more strings, they can be looked up and either their index returned or an incrementing index assigned and returned if the string is new. if you want to set the seven thousandth and one bit of a Python int then the int will extend as necessary to allow this.

Code by AI 

I decided to talk to Gemini AI to create all code and after a couple of hours of asking, refining and my testing I got my code. (Mine as I expended effort in its creation and debugging, even if it was by questioning Gemini).

Result!

I found that intsets can be hundreds of times faster than the same operation with the same values but with sets. (Frozensets in this particular case ase I cache a conversion function).

max_int: 262144, set_count: 5, repeat_count: 1
Average time for intset operations: 0.001064 seconds
Average time for set operations: 0.921396 seconds
Intset operations are approximately 866.31 times faster than set operations.

 Code

"""
This module provides functions for generating random sets, performing set operations,
and timing the performance of set operations using both standard Python sets and
integer-based sets (intsets). It includes caching for intset conversions and calculates
the speedup of intset operations over standard set operations.
"""

import random
import time
import itertools
from functools import lru_cache
from typing import FrozenSet, Tuple, List

def generate_random_sets_with_sample(max_int: int, set_count: int) -> List[FrozenSet[int]]:
    """
    Generates a list of random frozensets of integers using random.sample.

    Args:
        max_int: The maximum integer value in the sets.
        set_count: The number of random sets to generate.

    Returns:
        A list of random frozensets of integers.
    """
    all_ints = frozenset(range(max_int + 1))
    random_sets = []

    for _ in range(set_count):
        population = range(max_int + 1)
        sample_size = random.randint(0, max_int + 1)
        current_set = frozenset(random.sample(population, sample_size))
        random_sets.append(current_set)

    random_sets.append(all_ints)  # Append all_ints to random_sets
    return random_sets

def intset_difference(intset1: int, intset2: int) -> int:
    """Calculates the difference of two intsets."""
    return intset1 & ~intset2

def intset_intersection(intset1: int, intset2: int) -> int:
    """Calculates the intersection of two intsets."""
    return intset1 & intset2

def intset_symmetric_difference(intset1: int, intset2: int) -> int:
    """Calculates the symmetric difference of two intsets."""
    return intset1 ^ intset2

def intset_union(intset1: int, intset2: int) -> int:
    """Calculates the union of two intsets."""
    return intset1 | intset2

def set_difference(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
    """Calculates the difference of two frozensets."""
    return set1.difference(set2)

def set_intersection(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
    """Calculates the intersection of two frozensets."""
    return set1.intersection(set2)

def set_symmetric_difference(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
    """Calculates the symmetric difference of two frozensets."""
    return set1.symmetric_difference(set2)

def set_union(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
    """Calculates the union of two frozensets."""
    return set1.union(set2)

@lru_cache(maxsize=None)
def set_to_intset(input_set: FrozenSet[int]) -> int:
    """
    Converts a frozenset of integers to an intset, with caching.

    Args:
        input_set: The frozenset of integers.

    Returns:
        The intset representation of the frozenset.
    """
    result = 0
    for num in input_set:
        result |= (1 << num)
    return result

def time_set_operations(random_sets: List[FrozenSet[int]], repeat_count: int = 1) -> Tuple[float, float]:
    """
    Times set and intset operations multiple times and returns average timings.

    Args:
        random_sets: A list of frozensets.
        repeat_count: The number of times to repeat the timing.

    Returns:
        A tuple containing the average intset operation time and the average set operation time.
    """
    total_intset_time = 0
    total_set_time = 0

    for _ in range(repeat_count):
        intset_time = 0
        set_time = 0

        for set1, set2 in itertools.permutations(random_sets, 2):
            # Time intset operations
            intset1 = set_to_intset(set1)
            intset2 = set_to_intset(set2)

            start_intset = time.time()
            intset_difference(intset1, intset2)
            intset_intersection(intset1, intset2)
            intset_symmetric_difference(intset1, intset2)
            intset_union(intset1, intset2)
            end_intset = time.time()
            intset_time += end_intset - start_intset

            # Time set operations
            start_set = time.time()
            set_difference(set1, set2)
            set_intersection(set1, set2)
            set_symmetric_difference(set1, set2)
            set_union(set1, set2)
            end_set = time.time()
            set_time += end_set - start_set

        total_intset_time += intset_time
        total_set_time += set_time

    avg_intset_time = total_intset_time / repeat_count
    avg_set_time = total_set_time / repeat_count

    return avg_intset_time, avg_set_time

# Example Usage:
max_int = 2**18
set_count = 5
repeat_count = 1

random_sets = generate_random_sets_with_sample(max_int, set_count)

avg_intset_time, avg_set_time = time_set_operations(random_sets, repeat_count)

print(f"max_int: {max_int}, set_count: {set_count}, repeat_count: {repeat_count}")
print(f"Average time for intset operations: {avg_intset_time:.6f} seconds")
print(f"Average time for set operations: {avg_set_time:.6f} seconds")

if avg_intset_time > 0:
    speedup = avg_set_time / avg_intset_time
    print(f"Intset operations are approximately {speedup:.2f} times faster than set operations.")
else:
    print("Set operations time was zero. Unable to calculate speedup.")
   

 

My Gemini  chat

1.  python function set_to_intset that given a set containing integers returns an
    int with just those binary bits set to one; and a reverse function called
    intset_to_set that given an int returns a set of ints where the ints are the
    positions of all the bits that are one in the intset argument.
2.  methods between two python sets: difference, intersection,
    symmetric_difference and union can be achieved between two intsets using
    logical operators and logical functions. What are they?
3.  add a functionwith arguments max_int, set_count, and repeats; generates
    set_count random sets of the ints 0 to max_int and the set of all ints from
    zero to max_int.
4.  use random.sample with a randomised sample count
5.  population should be the range
6.  append all_ints to random_sets then test the stated set functions on all
    permutations of two of the random sets. accumulate the execution time of the
    set functions
7.  remove list from variable permutations then do not use an intermediate
    variable permutations, just iterate on its expression.
8.  Do similar timings for intset transformations of all random_sets
9.  No, keep the method timings as before.
10. keep two time accumulators, one for timing the runs of the set functions; do
    not time the transformations between set and intset; use the other timer for
    timing the intset functions. return two two times
11. do not time the conversions. time the set operations too
12. time_set_operations should take a repeat count and repeat the for loop that
    many times then return the avrage timings for each of sets, intsets
13. do not print the sets. print max_int, set_count, and repeat_count
14. show how much faster intset is
15. speedup calculation is wrong
16. speedup of intset is intset time / set time
17. no it is with respect to set time so divide by set time
18. that is not what I typed!
19. I was wrong about the speedum calculation
20. use it
21. inline the set andintset calculations when timing instead of calling out to
    the set_* and intset_+ functions, (but keep their definitions)
22. use max_int = 3**18 and set_count = 10
23. use max_int = 2**18; set_count = 5; repeat_count = 1
24. use lru_cached versions of function set_to_intset
25. generate random_sets as frozensets directly
26. add function type_hints
27. only have type hinting on function arguments and return values
28. add docstrings to all functions and add a module docstring

End.