Sunday, July 07, 2013

Set divisions/partitions

You all remember Venn diagrams don't you?


Well, after needing to split two sets into:

  1. Items unique to set1
  2. Items common to both sets
  3. items unique to set2
For yet another time, I thought that the obvious way of::
    exclusively1, common, exclusively2 = set1 - set2, set1 & set2, set2 - set1
was having to traverse each set three times, and, seemed like a useful bit of code to propose that Python add to its libraries.

I dutifully asked just that on comp.lang.python-ideas, and the debate is healthily progressing.

The two set case

Oscar Benjamin and Peter Otten hammered out this code for an algorithm to do the split whilst traversing each set only once, that is we are down to two traversals in total.

def partition(setx, sety):
    xonly, xandy, yonly = set(), set(), set()
    for x in setx:
        if x in sety:
            xandy.add(x)
        else:
            xonly.add(x)
    for y in sety:
        if y not in setx:
            yonly.add(y)
    return xonly, xandy, yonly

I liked the code. But then I thought, what about more than two sets? What would a partition of more than two sets even look like?

The N set case

This is where Venn diagrams come in. Given three sets you want to return sets of items corresponding to the seven regions of the three set Venn diagram. Looking at the partition function above, I couldn't work it out  and went to bed on it.

Next morning I read a lot more of the Wikipedia article on Venn diagrams and picked up on them being a visualization of good ol' truth tables. That spark helped me understand that the sets could represent the three columns of input, if I add a binary count below the three then it looks like the LHS of a truth table; but the truth table for three inputs has eight lines not seven?

If the ones in the truth table represent an item being in a set then you get the following relationship between the input sets set0, set1, set2; and the contents of the output bins:

SET
210
===
000 <-> ???
001 <-> bin_0
010 <-> bin_1
011 <-> bin_0&1
100 <-> bin_2
101 <-> bin_0&2
110 <-> bin_1&2
111 <-> bin_0&1&2

(Scan across any row. For every one bit set then the set of that column is in the bin)

I also worked out that the initial row of 000 must correspond to a bin for items that are in no set?! Since the items come from a set then this bin will always be empty.

O.K. I have a relationship between the number of input sets, binary counts and output binning, but no code.

I wrote the following few lines, got stumped, and decided to finish reading "The long war" at a Pub on the river Avon whilst sipping Thatchers Gold cider served so cold it comes as a slushy.

set0 = {1, 3, 5, 7}
set1 = {2, 3, 6, 7}
set2 = {4, 5, 6, 7}

def venndiv(*sets):
    bins = tuple(set() for n in range(2**len(sets)) - 1)
    for i, si in enumerate(sets):

The Slog

Back from the pub; a walk; playing with my youngest son; ...

I took another hard look at the code I had and eventually thought that I just had to get something down and play around with it until it felt right. I finally came up with the following function that when tested seemed to give the right answers:

def venndiv(*sets):
    bins = tuple(set() for n in range(2**len(sets)))
    for i, si in enumerate(sets): # Over every individual item of each set
        for item in si:
            binindex = 0
            for j, sj in enumerate(sets): # Test for membership of each set
                if item in sj:
                    binindex += 2**j # Binary count binning
            bins[binindex].add(item)
    return bins[1:] # bins[0] - an item in no set is impossible. 

print(venndiv({1,2,3}, {3,4,5})) # (set([1, 2]), set([4, 5]), set([3]))
assert venndiv(set0, set1, set2) == (set([1]), set([2]), set([3]), set([4]),
                                     set([5]), set([6]), set([7]))

setA = set('A AB AC ABC'.split())
setB = set('B AB BC ABC'.split())
setC = set('C AC BC ABC'.split())

print(venndiv(setA, setB, setC)) # (set(['A']), set(['B']), set(['AB']), set(['C']),
                                 #  set(['AC']), set(['BC']), set(['ABC']))

Of course, you also need to know the order for the result so I then came up with the following:

def bin_order(n):
    return ['bin' + ''.join(chr(65+j) for j in range(n) if i & 1<for
i in range(1,2**n)]

 print(bin_order(3)) # ['binA', 'binB', 'binAB', 'binC', 'binAC', 'binBC', 'binABC']

End bit

I am not asking for function venndiv to become part of the Python libraries. It is just an exercise to see if I could expand on the useful two-set case.

Cheers :-)






9 comments:

  1. Is there any way to indent code in these comments? I haven’t found one, so instead I put the code I wanted to post here on gist.

    It’s a different way to write your venndiv function, more to my personal taste (but perhaps less to yours).

    ReplyDelete
    Replies
    1. Bosker, Any chance of maybe you doing your own blog entry on how you developed your answer as well as expanding on how it works?

      Delete
    2. Sure! Thanks for the suggestion. http://bosker.wordpress.com/2013/07/10/venn-diagram-partitioning/

      Delete
  2. Hi bosker, saw your code and will take a look in the next 12 hour hopefully.

    As for the lack of decent editing facilities; It seems Google gets enough ad revenue with this level of support. I and others have asked years ago but not much changes. I wouldn't be surprised if they can blogger in the near future as not being "shiney-shiney" enough.

    ReplyDelete
  3. Hi paddy3118,
    Thast exactly what I need :)

    But, there has been a problem in the visualization of the bin_order function, could you paste it in plain text.
    Thanks in advance

    Peio

    ReplyDelete
    Replies
    1. Sorry for the late reply.

      def bin_order(n):

      ....return ['bin' + ''.join(chr(65+j) for j in range(n) if i & 1<<j)

      ............for i in range(1,2**n)]


      Delete
  4. Hello,

    I am wondering if there is a way to pass n-number of sets into this function, instead of specifying one by one?

    Thanks.

    ReplyDelete
    Replies
    1. if you have a list pf sets just unpack it in the call, something like:

      venndiv(*list_of_sets)

      Delete
    2. Thanks, this helps. One more question: say that I am interested in only finding unique elements to a set rather than computing all possible binary combinations. For example, given A, B and C sets this code returns A, B, AB, C, AC, BC, ABC. I am interested in calculating only A, B and C conditions. How can I modify the code to obtain this result?

      Delete