I will be using sets a lot in a future algorithm and wanted to find out about the speed of operation of different set implementations.
I will be dealing with different sets of strings from a total of around 200_000, (but the strings being added throughout the calculations). Python sets can do that directly, but I can create an integer index of all the strings and then do calculations on sets of integers.
Sets of integers can be translated to ints with bits set to indicate which ints were in the set. Numpy can do this but needs either all the strings up-front, or the maximum integer up front. Python integers are variable length, as the algorithm adds more strings, they can be looked up and either their index returned or an incrementing index assigned and returned if the string is new. if you want to set the seven thousandth and one bit of a Python int then the int will extend as necessary to allow this.
I decided to talk to Gemini AI to create all code and after a couple of hours of asking, refining and my testing I got my code. (Mine as I expended effort in its creation and debugging, even if it was by questioning Gemini).
I found that intsets can be hundreds of times faster than the same operation with the same values but with sets. (Frozensets in this particular case ase I cache a conversion function).
"""
This module provides functions for generating random sets, performing set operations,
and timing the performance of set operations using both standard Python sets and
integer-based sets (intsets). It includes caching for intset conversions and calculates
the speedup of intset operations over standard set operations.
"""
import random
import time
import itertools
from functools import lru_cache
from typing import FrozenSet, Tuple, List
def generate_random_sets_with_sample(max_int: int, set_count: int) -> List[FrozenSet[int]]:
"""
Generates a list of random frozensets of integers using random.sample.
Args:
max_int: The maximum integer value in the sets.
set_count: The number of random sets to generate.
Returns:
A list of random frozensets of integers.
"""
all_ints = frozenset(range(max_int + 1))
random_sets = []
for _ in range(set_count):
population = range(max_int + 1)
sample_size = random.randint(0, max_int + 1)
current_set = frozenset(random.sample(population, sample_size))
random_sets.append(current_set)
random_sets.append(all_ints) # Append all_ints to random_sets
return random_sets
def intset_difference(intset1: int, intset2: int) -> int:
"""Calculates the difference of two intsets."""
return intset1 & ~intset2
def intset_intersection(intset1: int, intset2: int) -> int:
"""Calculates the intersection of two intsets."""
return intset1 & intset2
def intset_symmetric_difference(intset1: int, intset2: int) -> int:
"""Calculates the symmetric difference of two intsets."""
return intset1 ^ intset2
def intset_union(intset1: int, intset2: int) -> int:
"""Calculates the union of two intsets."""
return intset1 | intset2
def set_difference(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
"""Calculates the difference of two frozensets."""
return set1.difference(set2)
def set_intersection(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
"""Calculates the intersection of two frozensets."""
return set1.intersection(set2)
def set_symmetric_difference(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
"""Calculates the symmetric difference of two frozensets."""
return set1.symmetric_difference(set2)
def set_union(set1: FrozenSet[int], set2: FrozenSet[int]) -> FrozenSet[int]:
"""Calculates the union of two frozensets."""
return set1.union(set2)
@lru_cache(maxsize=None)
def set_to_intset(input_set: FrozenSet[int]) -> int:
"""
Converts a frozenset of integers to an intset, with caching.
Args:
input_set: The frozenset of integers.
Returns:
The intset representation of the frozenset.
"""
result = 0
for num in input_set:
result |= (1 << num)
return result
def time_set_operations(random_sets: List[FrozenSet[int]], repeat_count: int = 1) -> Tuple[float, float]:
"""
Times set and intset operations multiple times and returns average timings.
Args:
random_sets: A list of frozensets.
repeat_count: The number of times to repeat the timing.
Returns:
A tuple containing the average intset operation time and the average set operation time.
"""
total_intset_time = 0
total_set_time = 0
for _ in range(repeat_count):
intset_time = 0
set_time = 0
for set1, set2 in itertools.permutations(random_sets, 2):
# Time intset operations
intset1 = set_to_intset(set1)
intset2 = set_to_intset(set2)
start_intset = time.time()
intset_difference(intset1, intset2)
intset_intersection(intset1, intset2)
intset_symmetric_difference(intset1, intset2)
intset_union(intset1, intset2)
end_intset = time.time()
intset_time += end_intset - start_intset
# Time set operations
start_set = time.time()
set_difference(set1, set2)
set_intersection(set1, set2)
set_symmetric_difference(set1, set2)
set_union(set1, set2)
end_set = time.time()
set_time += end_set - start_set
total_intset_time += intset_time
total_set_time += set_time
avg_intset_time = total_intset_time / repeat_count
avg_set_time = total_set_time / repeat_count
return avg_intset_time, avg_set_time
# Example Usage:
max_int = 2**18
set_count = 5
repeat_count = 1
random_sets = generate_random_sets_with_sample(max_int, set_count)
avg_intset_time, avg_set_time = time_set_operations(random_sets, repeat_count)
print(f"max_int: {max_int}, set_count: {set_count}, repeat_count: {repeat_count}")
print(f"Average time for intset operations: {avg_intset_time:.6f} seconds")
print(f"Average time for set operations: {avg_set_time:.6f} seconds")
if avg_intset_time > 0:
speedup = avg_set_time / avg_intset_time
print(f"Intset operations are approximately {speedup:.2f} times faster than set operations.")
else:
print("Set operations time was zero. Unable to calculate speedup.")
1. python function set_to_intset that given a set containing integers returns an
int with just those binary bits set to one; and a reverse function called
intset_to_set that given an int returns a set of ints where the ints are the
positions of all the bits that are one in the intset argument.
2. methods between two python sets: difference, intersection,
symmetric_difference and union can be achieved between two intsets using
logical operators and logical functions. What are they?
3. add a functionwith arguments max_int, set_count, and repeats; generates
set_count random sets of the ints 0 to max_int and the set of all ints from
zero to max_int.
4. use random.sample with a randomised sample count
5. population should be the range
6. append all_ints to random_sets then test the stated set functions on all
permutations of two of the random sets. accumulate the execution time of the
set functions
7. remove list from variable permutations then do not use an intermediate
variable permutations, just iterate on its expression.
8. Do similar timings for intset transformations of all random_sets
9. No, keep the method timings as before.
10. keep two time accumulators, one for timing the runs of the set functions; do
not time the transformations between set and intset; use the other timer for
timing the intset functions. return two two times
11. do not time the conversions. time the set operations too
12. time_set_operations should take a repeat count and repeat the for loop that
many times then return the avrage timings for each of sets, intsets
13. do not print the sets. print max_int, set_count, and repeat_count
14. show how much faster intset is
15. speedup calculation is wrong
16. speedup of intset is intset time / set time
17. no it is with respect to set time so divide by set time
18. that is not what I typed!
19. I was wrong about the speedum calculation
20. use it
21. inline the set andintset calculations when timing instead of calling out to
the set_* and intset_+ functions, (but keep their definitions)
22. use max_int = 3**18 and set_count = 10
23. use max_int = 2**18; set_count = 5; repeat_count = 1
24. use lru_cached versions of function set_to_intset
25. generate random_sets as frozensets directly
26. add function type_hints
27. only have type hinting on function arguments and return values
28. add docstrings to all functions and add a module docstring
No comments:
Post a Comment