Go deh!: 2013

Sunday, November 03, 2013

I have a book for review.

This is new to me. A publisher put out a request for reviewers and I thought I would give it a go. "Python in Practice" by Mark Summerfield has arrived.

I get a book, which is nice. I get to try my hand at writing a book review - new experience, always welcome. But, best of all, someone thinks I'm worth giving the opportunity. (Although they did say they are in humanities when they stated that my blog looked "suitably technical"). Oh well ... :-)

I'm gonna get a smartphone! And I'm gonna run Python on it!

I get by with my company mobile and serious WiFi at home from Virgin: 120Mbs which is fast for the UK. Although our two teenage ~~kids~~ young adults have had (and lost) smartphones for years, I have not succumbed until yesterday when I put in an order for the Nexus 5.

Hell, I haven't worked out if I will have to carry two mobiles or what carrier deal to go for, but I have been watching the market for a while and think I have made the right choice of phone.

From present use of gadgets I think I want to be able to:

Read books.
Do admin tasks for Rosetta Code whilst out and about.
Capture code notes and snippets. I often write Python snippets with the core of an idea that I later expand. I am not thinking of doing nights of development on the Nexus 5, but I do want to be able to edit and run some code, or, cut-n-paste fifty lines of code from the web and run it on my phone to answer a Stack Overflow question.
Use that pedometer function of Android KitKat as I sit for too long :-)
Create a WiFi hotspot in the car for those long journeys with family where everyone has either a laptop and/or tablet; and the teenagers would rather use daddy's WiFi than expend their pay-as-you-go credit.

I have a script I wrote on my PC that uses wget to download pages from Rosetta Code then a Python program to search for new pages and display them on a browser as well as prompt me if the page should be deleted and the user blocked if they are spamming. The Python creates a script for iMacros which then deletes pages/blocks users as requested.
It is a lot of functionality to duplicate on Android, but I'll probably give it a go sometime. If you know of anything that could help/substitute for things like wget and iMacros that works on Android then please add to the comments, but I should stress that I haven't started looking myself as yet.

Monday, October 21, 2013

Unifying Pythons string and list sequences for fsplit

I was looking at a particular post A more functional split function by Simon Grondin and wondered what it would look like in Python.

Simon describes his function thus:
"Fsplit takes a single ‘predicate’ function as argument. It loops over the array and adds one element at a time to a ‘buffer’ array. After adding an element, it calls the predicate on the buffer. If the predicate returns true, the buffer without the last element is added to the ‘ret’ array. At the end, that ‘ret’ array is returned, filtered from any 0-length elements."
Now a Python version would take both the predicate and array as arguments and, since it is to be in a functional style I decided to make it a generator function.

fsplit_list

Simon's first fsplit function works on arrays and had as its first test to split an array of integers on the occurrence of integer 3. The second test was to split to " get all consecutive intervals with a sum of 4 or less"

I coded that up as the following Python:

>>> def fsplit_list(predicate, array):
    buf = []
    for c in array:
        buf.append(c)
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = [] if predicate([c]) else [c]
    else:
        if buf: yield buf


>>> # Split on 3
>>> list(fsplit_list(lambda x: 3 in x, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [4, 5, 1, 2], [4, 5]]
>>> # Get all consecutive intervals with a sum of 4 or less
>>> list(fsplit_list(lambda x: sum(x) > 4, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [3], [4], [1, 2], [3], [3], [4]]
>>>

fsplit_string

SImon then needed another function in his typed language that took a predicate and a string as arguments and worked in a similar way for strings. The test in this case was to split a string into substreings with no more than two vowels in them. My Python code was the similar function fsplit_string:

>>> import re
>>> def fsplit_string(predicate, string):
    buf = ''
    for c in string:
        buf += c
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = '' if predicate(c) else c
    else:
        if buf: yield buf


>>> # String
>>> list(fsplit_string(lambda string: len(re.findall(r'[aeiou]', string)) > 2, "lorem ipsum dolor sit amet"))
['lorem ', 'ipsum d', 'olor s', 'it am', 'et']
>>>

Unification

I wanted to have one function that worked on both lists and strings but without executing separate code dependant on the type of the sequence being split. Specifically it should work equally on lists and strings without testing for sequence type.

Comparing the two functions above I saw differences in the initialization of buf; the fact that iterating through a string leads to c being a string but for a list c is not a list - it is whatever object is in the list. This last point affects how buf is extended - for strings you can use += but lists have to be appended to.

My solution with the same tests is as follows:

>>> def fsplit(predicate, sequence):
    buf = type(sequence)()
    for c in (sequence[i:i+1] for i in range(len(sequence))):
        buf += c
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = type(sequence)() if predicate(c) else c
    else:
        if buf: yield buf


>>> # Split on 3
>>> list(fsplit(lambda x: 3 in x, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [4, 5, 1, 2], [4, 5]]
>>> # Get all consecutive intervals with a sum of 4 or less
>>> list(fsplit(lambda x: sum(x) > 4, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [3], [4], [1, 2], [3], [3], [4]]
>>> # String
>>> list(fsplit(lambda string: len(re.findall(r'[aeiou]', string)) > 2, "lorem ipsum dolor sit amet"))
['lorem ', 'ipsum d', 'olor s', 'it am', 'et']
>>>

That strange initialization of buf works because calling the type of a sequence gives the empty sequence of that type.

I realised that if I could create successive one member slices from a list as name c then they could be concatenated using += just like for strings hence the for statement that generates successive items from a list as one element lists - it works in a similar way for strings too - giving successive elements of a string as one element strings (but that is what the "normal" for statement did in fsplit_string).

I guess some of the fiddling about is because strings are immutable whereas lists are mutable in Python, but I wonder if there is a better/more pythonic way of writing function fsplit to the same constraints?

END.

Saturday, October 05, 2013

Project Euler #8 solution

Someone asked some innocuous question related to a Project Euler question where you had to find the largest number that is produced by multiplying five consecutive digits of a thousand digit number.

The digits were presented as twenty rows of fifty characters and the question was about how to grab the web page and extract the number.

I decided to do a straight-forward copy-paste of the digits into a string and going from there.

The algorithm I use recognises that finding a zero in the digits zeroes out any product and just traverses the digits in order performing appropriate actions based on accumulated state:

# http://projecteuler.net/problem=8#!

xstring = '''73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450'''
x = int(xstring.replace('\n', ''))
xl = [int(ch) for ch in xstring if ch != '\n']

# current product; current maximum and index of where seen; number of digits contributing to prod
prod, mx, digits = 0, (0, -1), 0
for i, digit in enumerate(xl):
    if digit == 0:
        # Need to start again at next non-zero
        prod = digits = 0
    else:
        digits += 1
        if digits == 1:
            prod = digit
        else:
            prod *= digit
        if digits > 5:
            # Remove influence of the first of six digits currently in product
            prod //= xl[i - 5]
            digits -= 1
        if digits == 5 and prod > mx[0]:
            # New max at position i
            mx = (prod, i)
if mx[0]:
    print ("The greatest product of five consecutive digits in the 1000-digit number")
    print ("is %i for ...%s... occurring at digits in position up to %i" %
           (mx[0], ''.join(str(i) for i in xl[mx[1] - 5: mx[1]]), mx[1]))
else:
    print ("The greatest product of five consecutive digits in the 1000-digit number does not occur")

The output is:

The greatest product of five consecutive digits in the 1000-digit number
is 40824 for ...39987... occurring at digits in position up to 368

- END.

Tuesday, August 20, 2013

Recursive Unix bang-bang in Python

There is a new series that goes 2, 3, 5, 17, 161, 15681, … and has been added to OEIS as well as being blogged about by both Shadab Ahmed and Robin Houston.

The series comes from a recurrence relation that you get by sitting at unix bash terminal session and first typing in
: '!!'
Then
: "!!" '!!'
Then after that just hitting the up arrow key to get the last command expanded by the shell then hitting return; add infinitum.

The series comes from counting the number of '!!' pairs in each successive line.

Now I don't always have unix and bash to hand so I decided to write a Python "shell" program with enough functionality to simulate the recurrence relation.

In the bash shell:

!! is replaced by the previous command if it is outside of any quotes.
!! is replaced by the previous command if it is inside a double quoted string, ("...").
!! is NOT replaced by the previous command if it is inside a single quoted string, ('...').
Quoted strings do not nest. Single quotes inside double quotes are treated as plain characters and equally double quotes inside single quotes are treated as plain characters; (they don't terminate the current type of quoted string).
Apart from when !! is replaced as above, characters are copied to form the latest expanded command.
The up-arrow key should put the previous expanded command on the input line.

The following works on windows for me using Python 2.7 (Anaconda) which has a windows version of the readline module.

# -*- coding: utf-8 -*-
"""
Created on Tue Aug 20 17:23:37 2013

@author: Paddy McCarthy
"""

try:    # Python 2/3 compatibility
    raw_input
except:
    raw_input = input
import readline


prev = ": '!!'"
print('$ ' + prev)
while True:
    cmd = raw_input('$ ').rstrip()
    dquote = squote = skipnext = False
    result = []
    for this, nxt in zip(cmd, cmd[1:] + ' '):
        if skipnext:
            skipnext = False
            continue
        elif this == '"' and not squote:
            dquote = not dquote
            result.append(this)
        elif this == "'" and not dquote:
            squote = not squote
            result.append(this)
        elif this == '!' == nxt and not squote:
            skipnext = True
            result.append(prev)
        else:
            result.append(this)
    result = ''.join(result)
    print('%s # bang-bang count = %i' % (result, result.count('!!')))
    readline.add_history(result)
    prev = result

I have used it to verify the first 6 terms of the series so far.

cmd.exe running the Python bang-bang shell simulator

-END.

Sunday, July 07, 2013

Set divisions/partitions

You all remember Venn diagrams don't you?

Well, after needing to split two sets into:

Items unique to set1
Items common to both sets
items unique to set2

For yet another time, I thought that the obvious way of::

exclusively1, common, exclusively2 = set1 - set2, set1 & set2, set2 - set1
was having to traverse each set three times, and, seemed like a useful bit of code to propose that Python add to its libraries.

I dutifully asked just that on comp.lang.python-ideas, and the debate is healthily progressing.

The two set case

Oscar Benjamin and Peter Otten hammered out this code for an algorithm to do the split whilst traversing each set only once, that is we are down to two traversals in total.

def partition(setx, sety):
    xonly, xandy, yonly = set(), set(), set()
    for x in setx:
        if x in sety:
            xandy.add(x)
        else:
            xonly.add(x)
    for y in sety:
        if y not in setx:
            yonly.add(y)
    return xonly, xandy, yonly

I liked the code. But then I thought, what about more than two sets? What would a partition of more than two sets even look like?

The N set case

This is where Venn diagrams come in. Given three sets you want to return sets of items corresponding to the seven regions of the three set Venn diagram. Looking at the partition function above, I couldn't work it out and went to bed on it.

Next morning I read a lot more of the Wikipedia article on Venn diagrams and picked up on them being a visualization of good ol' truth tables. That spark helped me understand that the sets could represent the three columns of input, if I add a binary count below the three then it looks like the LHS of a truth table; but the truth table for three inputs has eight lines not seven?

If the ones in the truth table represent an item being in a set then you get the following relationship between the input sets set0, set1, set2; and the contents of the output bins:

SET
210
===
000 <-> ???
001 <-> bin_0
010 <-> bin_1
011 <-> bin_0&1
100 <-> bin_2
101 <-> bin_0&2
110 <-> bin_1&2
111 <-> bin_0&1&2

(Scan across any row. For every one bit set then the set of that column is in the bin)

I also worked out that the initial row of 000 must correspond to a bin for items that are in no set?! Since the items come from a set then this bin will always be empty.

O.K. I have a relationship between the number of input sets, binary counts and output binning, but no code.

I wrote the following few lines, got stumped, and decided to finish reading "The long war" at a Pub on the river Avon whilst sipping Thatchers Gold cider served so cold it comes as a slushy.

set0 = {1, 3, 5, 7}
set1 = {2, 3, 6, 7}
set2 = {4, 5, 6, 7}

def venndiv(*sets):
    bins = tuple(set() for n in range(2**len(sets)) - 1)
    for i, si in enumerate(sets):

The Slog

Back from the pub; a walk; playing with my youngest son; ...

I took another hard look at the code I had and eventually thought that I just had to get something down and play around with it until it felt right. I finally came up with the following function that when tested seemed to give the right answers:

def venndiv(*sets):
    bins = tuple(set() for n in range(2**len(sets)))
    for i, si in enumerate(sets): # Over every individual item of each set
        for item in si:
            binindex = 0
            for j, sj in enumerate(sets): # Test for membership of each set
                if item in sj:
                    binindex += 2**j # Binary count binning
            bins[binindex].add(item)
    return bins[1:] # bins[0] - an item in no set is impossible. 

print(venndiv({1,2,3}, {3,4,5})) # (set([1, 2]), set([4, 5]), set([3]))
assert venndiv(set0, set1, set2) == (set([1]), set([2]), set([3]), set([4]),
                                     set([5]), set([6]), set([7]))

setA = set('A AB AC ABC'.split())
setB = set('B AB BC ABC'.split())
setC = set('C AC BC ABC'.split())

print(venndiv(setA, setB, setC)) # (set(['A']), set(['B']), set(['AB']), set(['C']),
                                 #  set(['AC']), set(['BC']), set(['ABC']))

Of course, you also need to know the order for the result so I then came up with the following:

def bin_order(n):
    return ['bin' + ''.join(chr(65+j) for j in range(n) if i & 1<for

i in range(1,2**n)]

print(bin_order(3)) # ['binA', 'binB', 'binAB', 'binC', 'binAC', 'binBC', 'binABC']

End bit

I am not asking for function venndiv to become part of the Python libraries. It is just an exercise to see if I could expand on the useful two-set case.

Cheers :-)

Tuesday, July 02, 2013

Job arrays: Nice idea, shame about the implementations.

We have a compute cluster at work that uses IBM platform computing's LSF to schedule jobs. It works well on the whole, but given the kind of jobs we run mostly: hundreds to tens of thousands of very similar simulations with differing input conditions; then it would work better if LSF was based on the running of arrays of jobs rather than the single job submission we have now.

LSF (and others such as the Sun Grid Engine), does have its Job Array feature that allows me to submit an "array" of 25K jobs to a queue in one go, but internally they get administered as 25K single jobs. This leads to problems in scale, if I want to change the parameters of the job array then I can use their bmod command on all 25K elements of the array at once, but this may cause the whole LSF system to slow, with others having to wait to submit jobs as LSF internally makes identical changes to 25K separate jobs. If I have five million jobs to run in 300 job arrays over a week, I can't submit them at once as LSF gets a heart attack!

I think its time that job schedulers were written to only work on job arrays - a single job should be treated as a special-cased one element array with extra support from the LSF commands but internally arrays should be king, allowing the better use of multi-core compute clusters; after all, with internally developed tools, opensource tools, and all-you-can-eat licensed proprietary software, together with self checking / automatic triage of simulations; you can think differently about what it is possible to achieve on your companies compute farm, and at the moment, poor implementations of job arrays get in the way.

And whilst I am at it : If EDA companies want to create their own regression runners, then they should support job arrays too. And I mean proper support that actually works when tried rather than being in the manual but not actually working (I won't name anyone/anytwo)!

Friday, June 14, 2013

Filtering an iterator into N parts lazily

I read "Filter a list into two parts" by Ned Batchelder where he mentions this solution by Peter Otten for a split into two based on the result of a boolean function (or predicate):

def partition(items, predicate=bool):
    a, b = itertools.tee((predicate(item), item) for item in items)
    return ((item for pred, item in a if not pred),
            (item for pred, item in b if pred))

Now it works, but the use of a and b seemed clumsy. I decided to make the function more functional and generalize to a split into more than three ways as a means of getting the technique to stick.

The above function uses a boolean predicate and in the return statement tests '...a if not pred' and '...b if pred'. This is the same as a test of pred==False then pred==True which because of the duality between False/True and integers 0/1 we could check pred==0 in the first case then pred==1 in the second.

So, to split an iterator into n parts we need to pass n as an argument and we swap to a predicate function that returns 0 to n-1 as the filter for each of the returned n iterators. itertools.tee takes an optional second integer argument to allow it to tee the input iterator into n parts so, as I wrote in my comment on Neds blog post, you can do the following:

def partitionn(items, predicate=int, n=2):
    tees = itertools.tee( ((predicate(item), item)
                          for item in items), n )
    return ( (lambda i:(item for pred, item in tees[i] if pred==i))(x)
              for x in range(n) )

I left it there on Neds blog, but looking at it tonight I wanted to see if I could rid myself of the tees name it is an iterator of n iterators, but maybe I could make it even more functional?

A bit of refactoring left me with the solution below which has no names outside the generator expression, runs the predicate once on each item, and lazily works on the input iterator as well as returning lazy iterators:

>>> def partitionn2(items, predicate=int, n=2):
    return ( (lambda i, tee:(item for pred, item in tee if pred==i))(x, t)
              for x, t in enumerate(itertools.tee( ((predicate(item), item)
                                                    for item in items), n )) )

>>> partitions = partitionn(items=range(15), predicate=lambda x: x % 2, n=2)
>>> for p in partitions: print(p, list(p))

<generator object <genexpr> at 0x02D3C828> [0, 2, 4, 6, 8, 10, 12, 14]
<generator object <genexpr> at 0x02DCAEB8> [1, 3, 5, 7, 9, 11, 13]
>>> partitions = partitionn(items=range(15), predicate=lambda x: x % 4, n=4)
>>> for p in partitions: print(p, list(p))

<generator object <genexpr> at 0x02DCAD28> [0, 4, 8, 12]
<generator object <genexpr> at 0x02DCAE90> [1, 5, 9, 13]
<generator object <genexpr> at 0x02DCAFA8> [2, 6, 10, 14]
<generator object <genexpr> at 0x02DCAFD0> [3, 7, 11]
>>>

Done:

Saturday, May 25, 2013

Greedy Ranking Algorithm in Python

I mentioned in an earlier post that I had written my own ranker and thought I'd revisit this with some code.

I verify and ensure the safety of microprocessors for my day job. One way that very complex CPU's are tested is to create another model of the chip which can be used to generate pseudo-random instruction streams to run on CPU. The so-called ISG can create thousands (millions!) of these tests in very little time, and the ISG is written in such a way that it can be 'tweaked' to give some control or steering to what the instruction streams will exercise on the CPU.

Now simulating these instruction streams and gathering information on just what parts of the CPU are exercised, called covered, by each individual test takes time, and multiple ISG generated tests may cover the same regions of the CPU. To increase the overall coverage of of the CPU we run what is called a regression - all the tests are run and their coverage and the time they take to simulate are stored. at the end of the regression run you may have several thousands of tests that cover only part of the CPU.

If you take the regression results and rank them you can find that subset of the tests that give all the coverage. Usually thousands of pseudo-random tests might be ranked and generate a sub-list of only hundreds of tests that when run would give the same coverage. What we then usually do is look at what isn't covered and generate some more tests by the ISG or other methods to try and fill the gaps; run the new regression and rank again in a loop to fully exercise the CPU and hit some target coverage goal.

Ranking tests is an important part of the regression flow described above, and when it works well you forget about it. Unfortunately sometimes I want to to rank other data, for which the stock ranking program from the CAD tool vendors does not fit. So here is the guts of a ranking program that will scale to handling hundreds of thousands of tests and coverage points.

Input

Normally I have to parse my input from text or HTML files of results generated by other CAD programs - it is tedious work that I will skip by providing idealised inputs in the form of a Python dict. (Sometimes the code for parsing input files can be as large or larger than the ranking algorithm).

Let us assume that each ISG test has a name, runs for a certain 'time' and when simulated is shown to 'cover' a set of numbered features of the design. after the parsing, the gathered input data is represented by the results dict in the program.

  1 
  2 results = {
  3 #    'TEST': (  TIME, set([COVERED_POINT ...])),
  4   'test_00': (  2.08, set([2, 3, 5, 11, 12, 16, 19, 23, 25, 26, 29, 36, 38, 40])),
  5   'test_01': ( 58.04, set([0, 10, 13, 15, 17, 19, 20, 22, 27, 30, 31, 33, 34])),
  6   'test_02': ( 34.82, set([3, 4, 6, 12, 15, 21, 23, 25, 26, 33, 34, 40])),
  7   'test_03': ( 32.74, set([4, 5, 10, 16, 21, 22, 26, 39])),
  8   'test_04': (100.00, set([0, 1, 4, 6, 7, 8, 9, 11, 12, 18, 26, 27, 31, 36])),
  9   'test_05': (  4.46, set([1, 2, 6, 11, 14, 16, 17, 21, 22, 23, 30, 31])),
 10   'test_06': ( 69.57, set([10, 11, 15, 17, 19, 22, 26, 27, 30, 32, 38])),
 11   'test_07': ( 85.71, set([0, 2, 4, 5, 9, 10, 14, 17, 24, 34, 36, 39])),
 12   'test_08': (  5.73, set([0, 3, 8, 9, 13, 19, 23, 25, 28, 36, 38])),
 13   'test_09': ( 15.55, set([7, 15, 17, 25, 26, 30, 31, 33, 36, 38, 39])),
 14   'test_10': ( 12.05, set([0, 4, 13, 14, 15, 24, 31, 35, 39])),
 15   'test_11': ( 52.23, set([0, 3, 6, 10, 11, 13, 23, 34, 40])),
 16   'test_12': ( 26.79, set([0, 1, 4, 5, 7, 8, 10, 12, 13, 31, 32, 40])),
 17   'test_13': ( 16.07, set([2, 6, 9, 11, 13, 15, 17, 18, 34])),
 18   'test_14': ( 40.62, set([1, 2, 8, 15, 16, 19, 22, 26, 29, 31, 33, 34, 38])),
 19  }
 20

Greedy ranking algorithm

The object of the algorithm is to select and order a subset of the tests that:

Cover as many of the coverage points as possible by at least one test.
After the above, reduce the number of tests needed to achieve that maximum coverage by as much as is possible.
Generate a ranking of the tests selected to allow an even smaller set of tests to be selected if necessary.
After all the above having increasing importance, it would be good to also reduce the total 'time' accrued by the ranking tests .
Of course it needs to work for large sets of tests and points to cover.

The greedy algorithm works by first choosing the test giving most coverage to be the test of highest rank, then the test giving the most incremental additional coverage as the next highest ranking test, and so on...
If there are more than one test giving the same incremental additional coverage at any stage then the test taking the least 'time' is picked.

The following function implements the algorithm:

 21 def greedyranker(results):
 22     results = results.copy()
 23     ranked, coveredsofar, costsofar, round = [], set(), 0, 0
 24     noncontributing = []
 25     while results:
 26         round += 1
 27         # What each test can contribute to the pool of what is covered so far
 28         contributions = [(len(cover - coveredsofar), -cost, test)
 29                          for test, (cost, cover) in sorted(results.items()) ]
 30         # Greedy ranking by taking the next greatest contributor                 
 31         delta_cover, benefit, test = max( contributions )
 32         if delta_cover > 0:
 33             ranked.append((test, delta_cover))
 34             cost, cover = results.pop(test)
 35             coveredsofar.update(cover)
 36             costsofar += cost
 37         for delta_cover, benefit, test in contributions:
 38             if delta_cover == 0:
 39                 # this test cannot contribute anything
 40                 noncontributing.append( (test, round) )
 41                 results.pop(test)
 42     return coveredsofar, ranked, costsofar, noncontributing
 43

Each time through the while loop (line 25), the next best test is appended to the ranking and tests that can nolonger contribute any extra coverage are discarded (lines 37-41)

The function above is a bit dry so I took a bit of time to annotate it with a tutor capability that when run prints out just what it is doing along the way:

The function with tutor

It implements the same thing but does it noisily:

 44 def greedyranker(results, tutor=True):
 45     results = results.copy()
 46     ranked, coveredsofar, costsofar, round = [], set(), 0, 0
 47     noncontributing = []
 48     while results:
 49         round += 1
 50         # What each test can contribute to the pool of what is covered so far
 51         contributions = [(len(cover - coveredsofar), -cost, test)
 52                          for test, (cost, cover) in sorted(results.items()) ]
 53         if tutor:
 54             print('\n## Round %i' % round)
 55             print('  Covered so far: %2i points: ' % len(coveredsofar))
 56             print('  Ranked so far: ' + repr([t for t, d in ranked]))
 57             print('  What the remaining tests can contribute, largest contributors first:')
 58             print('    # DELTA, BENEFIT, TEST')
 59             deltas = sorted(contributions, reverse=True)
 60             for delta_cover, benefit, test in deltas:
 61                 print('     %2i,    %7.2f,    %s' % (delta_cover, benefit, test))
 62             if len(deltas)>=2 and deltas[0][0] == deltas[1][0]:
 63                 print('  Note: This time around, more than one test gives the same')
 64                 print('        maximum delta contribution of %i to the coverage so far'
 65                        % deltas[0][0])
 66                 if deltas[0][1] != deltas[1][1]:
 67                     print('        we order based on the next field of minimum cost')
 68                     print('        (equivalent to maximum negative cost).')
 69                 else:
 70                     print('        the next field of minimum cost is the same so')
 71                     print('        we arbitrarily order by test name.')
 72             zeroes = [test for delta_cover, benefit, test in deltas
 73                      if delta_cover == 0]
 74             if zeroes:
 75                 print('  The following test(s) cannot contribute more to coverage')
 76                 print('  and will be dropped:')
 77                 print('    ' + ', '.join(zeroes))
 78 
 79         # Greedy ranking by taking the next greatest contributor                 
 80         delta_cover, benefit, test = max( contributions )
 81         if delta_cover > 0:
 82             ranked.append((test, delta_cover))
 83             cost, cover = results.pop(test)
 84             if tutor:
 85                 print('  Ranking %s in round %2i giving extra coverage of: %r'
 86                        % (test, round, sorted(cover - coveredsofar)))
 87             coveredsofar.update(cover)
 88             costsofar += cost
 89 
 90         for delta_cover, benefit, test in contributions:
 91             if delta_cover == 0:
 92                 # this test cannot contribute anything
 93                 noncontributing.append( (test, round) )
 94                 results.pop(test)
 95     if tutor:
 96         print('\n## ALL TESTS NOW RANKED OR DISCARDED\n')
 97     return coveredsofar, ranked, costsofar, noncontributing

Every block starting if tutor: above has the added code.

Sample output

The code to call the ranker and print the results is:

 98 
 99 
100 totalcoverage, ranking, totalcost, nonranked = greedyranker(results)
101 print('''
102 A total of %i points were covered, 
103 using only %i of the initial %i tests,
104 and should take %g time units to run.
105 
106 The tests in order of coverage added:
107     
108     TEST  DELTA-COVERAGE'''
109  % (len(totalcoverage), len(ranking), len(results), totalcost))
110 print('\n'.join('  %6s  %i' % r for r in ranking))

The output has a lot of stuff from the tutor followed by the result at the end.

For this pseudo randomly generate test case of 15 tests it shows that only seven are needed to generate the maximum total coverage. (And if you were willing to loose the coverage of three tests that each cover only one additional point then 4 out of 15 tests would give 92.5% of the maximum coverage possible).

## Round 1
Covered so far: 0 points:
Ranked so far: []
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
14, -2.08, test_00
14, -100.00, test_04
13, -40.62, test_14
13, -58.04, test_01
12, -4.46, test_05
12, -26.79, test_12
12, -34.82, test_02
12, -85.71, test_07
11, -5.73, test_08
11, -15.55, test_09
11, -69.57, test_06
9, -12.05, test_10
9, -16.07, test_13
9, -52.23, test_11
8, -32.74, test_03
Note: This time around, more than one test gives the same
maximum delta contribution of 14 to the coverage so far
we order based on the next field of minimum cost
(equivalent to maximum negative cost).
Ranking test_00 in round 1 giving extra coverage of: [2, 3, 5, 11, 12, 16, 19, 23, 25, 26, 29, 36, 38, 40]

## Round 2
Covered so far: 14 points:
Ranked so far: ['test_00']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
12, -58.04, test_01
10, -100.00, test_04
9, -12.05, test_10
9, -26.79, test_12
9, -85.71, test_07
8, -4.46, test_05
7, -15.55, test_09
7, -16.07, test_13
7, -40.62, test_14
7, -69.57, test_06
6, -34.82, test_02
5, -5.73, test_08
5, -32.74, test_03
5, -52.23, test_11
Ranking test_01 in round 2 giving extra coverage of: [0, 10, 13, 15, 17, 20, 22, 27, 30, 31, 33, 34]

## Round 3
Covered so far: 26 points:
Ranked so far: ['test_00', 'test_01']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
7, -100.00, test_04
5, -12.05, test_10
5, -26.79, test_12
5, -85.71, test_07
4, -4.46, test_05
3, -5.73, test_08
3, -16.07, test_13
3, -32.74, test_03
3, -34.82, test_02
2, -15.55, test_09
2, -40.62, test_14
1, -52.23, test_11
1, -69.57, test_06
Ranking test_04 in round 3 giving extra coverage of: [1, 4, 6, 7, 8, 9, 18]

## Round 4
Covered so far: 33 points:
Ranked so far: ['test_00', 'test_01', 'test_04']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
4, -12.05, test_10
3, -85.71, test_07
2, -4.46, test_05
2, -32.74, test_03
1, -5.73, test_08
1, -15.55, test_09
1, -26.79, test_12
1, -34.82, test_02
1, -69.57, test_06
0, -16.07, test_13
0, -40.62, test_14
0, -52.23, test_11
The following test(s) cannot contribute more to coverage
and will be dropped:
test_13, test_14, test_11
Ranking test_10 in round 4 giving extra coverage of: [14, 24, 35, 39]

## Round 5
Covered so far: 37 points:
Ranked so far: ['test_00', 'test_01', 'test_04', 'test_10']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
1, -4.46, test_05
1, -5.73, test_08
1, -26.79, test_12
1, -32.74, test_03
1, -34.82, test_02
1, -69.57, test_06
0, -15.55, test_09
0, -85.71, test_07
Note: This time around, more than one test gives the same
maximum delta contribution of 1 to the coverage so far
we order based on the next field of minimum cost
(equivalent to maximum negative cost).
The following test(s) cannot contribute more to coverage
and will be dropped:
test_09, test_07
Ranking test_05 in round 5 giving extra coverage of: [21]

## Round 6
Covered so far: 38 points:
Ranked so far: ['test_00', 'test_01', 'test_04', 'test_10', 'test_05']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
1, -5.73, test_08
1, -26.79, test_12
1, -69.57, test_06
0, -32.74, test_03
0, -34.82, test_02
Note: This time around, more than one test gives the same
maximum delta contribution of 1 to the coverage so far
we order based on the next field of minimum cost
(equivalent to maximum negative cost).
The following test(s) cannot contribute more to coverage
and will be dropped:
test_03, test_02
Ranking test_08 in round 6 giving extra coverage of: [28]

## Round 7
Covered so far: 39 points:
Ranked so far: ['test_00', 'test_01', 'test_04', 'test_10', 'test_05', 'test_08']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
1, -26.79, test_12
1, -69.57, test_06
Note: This time around, more than one test gives the same
maximum delta contribution of 1 to the coverage so far
we order based on the next field of minimum cost
(equivalent to maximum negative cost).
Ranking test_12 in round 7 giving extra coverage of: [32]

## Round 8
Covered so far: 40 points:
Ranked so far: ['test_00', 'test_01', 'test_04', 'test_10', 'test_05', 'test_08', 'test_12']
What the remaining tests can contribute, largest contributors first:
# DELTA, BENEFIT, TEST
0, -69.57, test_06
The following test(s) cannot contribute more to coverage
and will be dropped:
test_06

## ALL TESTS NOW RANKED OR DISCARDED

A total of 40 points were covered,
using only 7 of the initial 15 tests,
and should take 209.15 time units to run.

The tests in order of coverage added:

TEST DELTA-COVERAGE
test_00 14
test_01 12
test_04 7
test_10 4
test_05 1
test_08 1
test_12 1

What should be next

There is a new Unified Coverage Interoperability Standard for a database for storing test coverage data ideally the greedy ranker should be hooked up to that UCIS DB to get its inputs via its C-interface or maybe its XML output instead of parsing text files.

Addendum

Random results dict creator

As used for testing:

def cover_creator(ntests=25, maxcoverpoints=100):
    import random
    results = {}
    coveredrange = (maxcoverpoints * 1 // 6, 
                    1 + maxcoverpoints * 2 // 6)
    print coveredrange
    for test in range(ntests):
        name = 'test_%02i' % test
        covered = sorted(set(random.randint(0, maxcoverpoints-1)
                         for i in range(random.randint(*coveredrange))))
        time = len(covered) * (100 + (random.random() - 0.5) * 40) / 100.0
        results[name] = ( float('%6.2f' % time), set(covered))
    return results

END.

Go deh!