Sunday, July 07, 2013

Set divisions/partitions

You all remember Venn diagrams don't you?


Well, after needing to split two sets into:

  1. Items unique to set1
  2. Items common to both sets
  3. items unique to set2
For yet another time, I thought that the obvious way of::
    exclusively1, common, exclusively2 = set1 - set2, set1 & set2, set2 - set1
was having to traverse each set three times, and, seemed like a useful bit of code to propose that Python add to its libraries.

I dutifully asked just that on comp.lang.python-ideas, and the debate is healthily progressing.

The two set case

Oscar Benjamin and Peter Otten hammered out this code for an algorithm to do the split whilst traversing each set only once, that is we are down to two traversals in total.

def partition(setx, sety):
    xonly, xandy, yonly = set(), set(), set()
    for x in setx:
        if x in sety:
            xandy.add(x)
        else:
            xonly.add(x)
    for y in sety:
        if y not in setx:
            yonly.add(y)
    return xonly, xandy, yonly

I liked the code. But then I thought, what about more than two sets? What would a partition of more than two sets even look like?

The N set case

This is where Venn diagrams come in. Given three sets you want to return sets of items corresponding to the seven regions of the three set Venn diagram. Looking at the partition function above, I couldn't work it out  and went to bed on it.

Next morning I read a lot more of the Wikipedia article on Venn diagrams and picked up on them being a visualization of good ol' truth tables. That spark helped me understand that the sets could represent the three columns of input, if I add a binary count below the three then it looks like the LHS of a truth table; but the truth table for three inputs has eight lines not seven?

If the ones in the truth table represent an item being in a set then you get the following relationship between the input sets set0, set1, set2; and the contents of the output bins:

SET
210
===
000 <-> ???
001 <-> bin_0
010 <-> bin_1
011 <-> bin_0&1
100 <-> bin_2
101 <-> bin_0&2
110 <-> bin_1&2
111 <-> bin_0&1&2

(Scan across any row. For every one bit set then the set of that column is in the bin)

I also worked out that the initial row of 000 must correspond to a bin for items that are in no set?! Since the items come from a set then this bin will always be empty.

O.K. I have a relationship between the number of input sets, binary counts and output binning, but no code.

I wrote the following few lines, got stumped, and decided to finish reading "The long war" at a Pub on the river Avon whilst sipping Thatchers Gold cider served so cold it comes as a slushy.

set0 = {1, 3, 5, 7}
set1 = {2, 3, 6, 7}
set2 = {4, 5, 6, 7}

def venndiv(*sets):
    bins = tuple(set() for n in range(2**len(sets)) - 1)
    for i, si in enumerate(sets):

The Slog

Back from the pub; a walk; playing with my youngest son; ...

I took another hard look at the code I had and eventually thought that I just had to get something down and play around with it until it felt right. I finally came up with the following function that when tested seemed to give the right answers:

def venndiv(*sets):
    bins = tuple(set() for n in range(2**len(sets)))
    for i, si in enumerate(sets): # Over every individual item of each set
        for item in si:
            binindex = 0
            for j, sj in enumerate(sets): # Test for membership of each set
                if item in sj:
                    binindex += 2**j # Binary count binning
            bins[binindex].add(item)
    return bins[1:] # bins[0] - an item in no set is impossible. 

print(venndiv({1,2,3}, {3,4,5})) # (set([1, 2]), set([4, 5]), set([3]))
assert venndiv(set0, set1, set2) == (set([1]), set([2]), set([3]), set([4]),
                                     set([5]), set([6]), set([7]))

setA = set('A AB AC ABC'.split())
setB = set('B AB BC ABC'.split())
setC = set('C AC BC ABC'.split())

print(venndiv(setA, setB, setC)) # (set(['A']), set(['B']), set(['AB']), set(['C']),
                                 #  set(['AC']), set(['BC']), set(['ABC']))

Of course, you also need to know the order for the result so I then came up with the following:

def bin_order(n):
    return ['bin' + ''.join(chr(65+j) for j in range(n) if i & 1<for
i in range(1,2**n)]

 print(bin_order(3)) # ['binA', 'binB', 'binAB', 'binC', 'binAC', 'binBC', 'binABC']

End bit

I am not asking for function venndiv to become part of the Python libraries. It is just an exercise to see if I could expand on the useful two-set case.

Cheers :-)






Tuesday, July 02, 2013

Job arrays: Nice idea, shame about the implementations.

We have a compute cluster at work that uses IBM platform computing's LSF to schedule jobs. It works well on the whole, but given the kind of jobs we run mostly: hundreds to tens of thousands of very similar simulations with differing input conditions; then it would work better if LSF was based on the running of arrays of jobs rather than the single job submission we have now.

LSF (and others such as the Sun Grid Engine), does have its Job Array feature that allows me to submit an "array" of 25K jobs to a queue in one go, but internally they get administered as 25K single jobs. This leads to problems in scale, if I want to change the parameters of the job array then I can use their bmod command on all 25K elements of the array at once, but this may cause the whole LSF system to slow, with others having to wait to submit jobs as LSF internally makes identical changes to 25K separate jobs.  If I have five million jobs to run in 300 job arrays over a week, I can't submit them at once as LSF gets a heart attack!

I think its time that job schedulers were written to only work on job arrays - a single job should be treated as a special-cased one element array with extra support from the LSF commands but internally arrays should be king, allowing the better use of multi-core compute clusters; after all, with internally developed tools, opensource tools, and all-you-can-eat licensed proprietary software, together with self checking / automatic triage of simulations; you can think differently about what it is possible to achieve on your companies compute farm, and at the moment, poor implementations of job arrays get in the way.

And whilst I am at it : If EDA companies want to create their own regression runners, then they should support job arrays too. And I mean proper support that actually works when tried rather than being in the manual but not actually working (I won't name anyone/anytwo)!