Go deh!: 2007

Tuesday, December 18, 2007

Batch Process Runner in bash shell

Picked up an interesting problem from a colleague:

How to run n batch processes but only m at a time?

It seems that resource constraints allow every four batch processes to share one license - so long as all four batch processes are started from the same terminal on the same machine. Users may have tens of processes to run and currently they are run individually on our compute farm.

I tend to code whilst having a cold for some reason? And needed to use more advanced bash scripting - I decided against Python for the task as I would like my colleague to take over further developments of the code and I can't rely on him knowing Python, (and I don't want to code in Perl today).

I came up with this little demonstrator which runs processes that go to sleep for a random amount of time. The script runs as many as you specify but only a maximum number of proc's at a time.

The script is followed by sample output and was tested on cygwin using bash version 3.2.9(11)-release (i686-pc-cygwin):


  1 #!/bin/bash
  2 
  3 ##
  4 ## process_runner.sh <concurrent> <total procs>
  5 ##
  6 ## Example script given the maximum number of processes to
  7 ## run concurrently, c, and the total number of processes, n
  8 ## runs at most c, processes in the background until all n
  9 ## Have been run.
 10 ##
 11 ## Author Donald 'Paddy' McCarthy Dec. 17 2007
 12 ##
 13 
 14 # how many processes to run in parallel
 15 concurrent=$1
 16 # total processes to run
 17 maxprocs=$2
 18 
 19 printf "\n## STARTING %i Processes, %i at a time\n\n" \
 20   $maxprocs $concurrent
 21 
 22 
 23 # main loop wait time between checking background procs.
 24 tick=1
 25 
 26 
 27 # dummy processes sleep for a random time
 28 function runproc {
 29   local -i n=$1
 30   local -i j
 31   (( j = 5+$RANDOM*10/32767 ))
 32   #(date; printf "#>>>JOB %i : Sleeping for %i\n" $n $j)
 33   printf "OUT JOB ,%03i, Sleep for ,%2i,  , @,%s\n" $n $j "`date`"
 34   sleep $j
 35   returned=$?
 36   printf "IN  JOB ,%03i, Slept for ,%2i, returned ,%i, @,%s\n" \
 37     $n $j $returned "`date`"
 38   #(date; printf "#<<<JOB %i : Slept for %i\n" $n $j)
 39 }
 40 
 41 function print_runstats {
 42   printf '# %i Jobs in background. %i/%i started\n\n'  \
 43     `jobs -r|wc -l` $ran $maxprocs
 44 }
 45 
 46 # Bash array running keeps track of the background process numbers
 47 # Start with nothing running (sentinel value will not be a process number
 48 for ((i=0; i<$concurrent; i+=1 )); do running[$i]=123456789; done
 49 
 50 ran=0
 51 until
 52   while [ $ran -lt $maxprocs ]; do
 53     for ((p=0; p<$concurrent; p+=1 )); do
 54       proc=${running[$p]}
 55       # Over all running processes...
 56       # $proc still running?
 57       ps -p $proc | fgrep $proc >/dev/null
 58       if [ $? -ne '0' ] ; then
 59         # Not found  i.e. finished
 60         # Run another in background and store the proc number
 61         runproc $ran &
 62         running[$p]=$!
 63         ((ran+=1))
 64         if [ $ran -ge $maxprocs ]; then break 1; fi
 65       fi
 66     done
 67     sleep $tick
 68     # Status
 69     print_runstats
 70   done
 71 
 72   sleep $tick
 73   # Status
 74   print_runstats
 75 do [  `jobs -r|wc -l` -eq 0 ]
 76 done
 77 wait
 78 printf "\n## FINISHED\n"
 79 
 80 exit 0
 81 
 82 sample_output=<<!
 83 bash$ ./process_runner.sh 2 5
 84 
 85 ## STARTING 5 Processes, 2 at a time
 86 
 87 OUT JOB ,000, Sleep for ,10,  , @,Tue Dec 18 09:26:00 GMTST 2007
 88 OUT JOB ,001, Sleep for , 8,  , @,Tue Dec 18 09:26:00 GMTST 2007
 89 # 2 Jobs in background. 2/5 started
 90 
 91 # 2 Jobs in background. 2/5 started
 92 
 93 # 2 Jobs in background. 2/5 started
 94 
 95 # 2 Jobs in background. 2/5 started
 96 
 97 # 2 Jobs in background. 2/5 started
 98 
 99 # 2 Jobs in background. 2/5 started
100 
101 IN  JOB ,001, Slept for , 8, returned ,0, @,Tue Dec 18 09:26:09 GMTST 2007
102 # 1 Jobs in background. 2/5 started
103 
104 OUT JOB ,002, Sleep for ,11,  , @,Tue Dec 18 09:26:09 GMTST 2007
105 IN  JOB ,000, Slept for ,10, returned ,0, @,Tue Dec 18 09:26:10 GMTST 2007
106 # 2 Jobs in background. 3/5 started
107 
108 OUT JOB ,003, Sleep for , 7,  , @,Tue Dec 18 09:26:11 GMTST 2007
109 # 2 Jobs in background. 4/5 started
110 
111 # 2 Jobs in background. 4/5 started
112 
113 # 2 Jobs in background. 4/5 started
114 
115 # 2 Jobs in background. 4/5 started
116 
117 # 2 Jobs in background. 4/5 started
118 
119 IN  JOB ,003, Slept for , 7, returned ,0, @,Tue Dec 18 09:26:18 GMTST 2007
120 # 1 Jobs in background. 4/5 started
121 
122 OUT JOB ,004, Sleep for , 6,  , @,Tue Dec 18 09:26:19 GMTST 2007
123 # 2 Jobs in background. 5/5 started
124 
125 IN  JOB ,002, Slept for ,11, returned ,0, @,Tue Dec 18 09:26:20 GMTST 2007
126 # 1 Jobs in background. 5/5 started
127 
128 IN  JOB ,004, Slept for , 6, returned ,0, @,Tue Dec 18 09:26:25 GMTST 2007
129 
130 ## FINISHED
131 !
132 
133 
134 
135 
136

Thursday, December 13, 2007

Terry Pratchett

I feel sad that someone who has given me so much pleasure over such an extended time, has medical problems. I wish him well.

Sunday, November 25, 2007

Python: Concise

I have spent several years both reading and contributing to comp.lang.python. Yet another thread on proposed language changes together with replies pointing out that Python tries to be both short and clear set me off on a tangent trying to find that one adjective that captured that property of Python.

After some online research using Chambers, (do they use Python on their website? The link includes /chref.py/); AskOxford; and Thesaurus.com, I introduced myself to the alt.usage.english newsgroup and asked them to help me improve on my then best candidate word - succinct.

After some useful dialogue I received a private email through the newsgroup from Orlando Enrique suggesting concise. Acting quite mechanically I went and did a search on AskOxford for concise and got many more hits than I expected, and then it dawned on me, the publishers of the Oxford dictionary use the word concise in the title of their abridged works to convey just what I wanted for Python: it is short, comprehensive, and clear. Any shorter would affect clarity. Any longer would not be necessary.

So in summary: Try Python, it's concise.

Saturday, October 13, 2007

What's in a name?

I wrote a tool for work that would order VHDL files for compilation. VHDL is one of those languages in which, (annoyingly), everything that file a.vhd depends on must be compiled before a.vhd. After scanning all the source files I extract a dependency list which I stored as a dictionary mapping source file names to the source files they depended on.

I then needed to sort the files so that no file was compiled before any file it depended on.

Now I did google for such a pre-written routine using phrases that included dependency, dependancy :-), depends etc; sort and maybe Python. I got nothing! Eventually I wrote my own which worked very well and I was proud of it.

A couple of months later, whilst reading a newsgroup reply on something or other, someone mentioned a topological sort and so I went and looked that up. If only I knew the term 'topological' I could have found a ready-made solution such as this (which has a good explanation too).

So, "What's in a name"?

Quite a lot actually.

:-) Paddy.

P.S. There is also the tsort Unix utility.

Thursday, October 11, 2007

Multi-Processing Design Pattern

I am currently off work with a cold, ~~nose dripping onto my keyboard~~, in front of a pretty old laptop, with no real way to test the following.

For the Wide Finder project, a way to realize the task would be to split the large file, fork sub-process to process each section in parallel then join all the outputs together. Fork/join is Verilog speak. You might also call it mapping file processors onto each file section in parallel then reducing their output.

I have the following bash shell that I need to test on a real machine:


    #!/bin/bash -x

    ##
    ## Multi-subprocessing of Wide Finder file
    ##

    # Number of sub processes
    #subprocs="$1"
    subprocs="4"
    # Input file
    #infile="$2"
    infile="o10k.ap"

    subprocscript="clv5.sh"
    splitfileprefix=_multi_proc_tmp

    rm -f ${splitfileprefix}*
    size=$(stat -c%s "$infile")
    splitsize=`gawk -v x=$size -v y=$subprocs 'BEGIN{printf\"%i\", (x/y)*1.01}'`

    ## Split
    split --line-bytes=$splitsize "$infile" $splitfileprefix

    for f in ${splitfileprefix}*; do
      $subprocscript $f > $f.sub
    done
    jobs
    wait

    ## Join
    gawk '{x[$2]+=$1}END{for (n in x){print x[n], n}}' ${splitfileprefix}*.sub \
        | sort -n \
        | tail

Wide Finder on the command line

Having followed articles on Wide Finder for the past week, I thought the command line pipe version here, gave fast results straight away but could be improved.

Unfortunately, my main computer at home, my 17" laptop had to go for repairs and so I decided to start using an old Toshiba Portege 7200 that runs XP but has a tiny disk.
The temporary laptop doesn't have the resources to run Cygwin as its unix-on-windows toolset which I am comfortable with so I installed GNU tools for windows fromhere and bash for windows from here then set about re-running and improving the command-line example.

   ##
   ## clv1.sh
   ##

   # Original 'command line tools version from
   # http://www.dalkescientific.com/writings/diary/archive/2007/10/10/other_wide_finder_implementations.html
   time (
       grep -E 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap \
       | gawk '{print $7}' \
       | sort \
       | uniq -c \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv2.sh
   ##

   # include grep functionality in gawk call
   time (
       gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{print $7}' o1000k.ap \
       | sort \
       | uniq -c \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv3.sh
   ##

   # uniq -c function moved into gawk
   time (
       gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{x[$7]++}END{for (n in x){print x[n], n}}' o1000k.ap \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv4.awk
   ##

   /GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{
       field2count[$7]++
       }
   END{
       # asort requires no duplicate array values
       # Accumulate fields with same count
       for (field in field2count){
           count = field2count[field]
           count2fields[count+0] = count2fields[count] count " " field "\n";
       }
       # invert and force value type to be numeric
       for (count in count2fields){
       allfields2count[count2fields[count]] = count+0
       }
       n = asort(allfields2count,orderedcounts)
       for (i=n-10; i<=n; i++){
           printf "%s", count2fields[orderedcounts[i]]
       }
   }

   ##
   ## clv4.sh
   ##

   # All but final tail moved into gawk
   time (
       gawk --re-interval -f clv4.awk o1000k.ap \
       | tail \
       )

   ##
   ## clv5.awk
   ##

   /GET \/ongoing\/When\/[0-9][0-9][0-9]x\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]\/[0-9][0-9]\/[^ .]+ /{
       field2count[$7]++
       }
   END{
       # asort requires no duplicate array values
       # Accumulate fields with same count
       for (field in field2count){
           count = field2count[field]
           count2fields[count+0] = count2fields[count] count " " field "\n";
       }
       # invert and force value type to be numeric
       for (count in count2fields){
       allfields2count[count2fields[count]] = count+0
       }
       n = asort(allfields2count,orderedcounts)
       for (i=n-10; i<=n; i++){
           printf "%s", count2fields[orderedcounts[i]]
       }
   }

   ##
   ## clv5.sh
   ##

   # All but final tail moved into gawk
   # Change regexp to not use interval expressions
   time (
       gawk -f clv5.awk o1000k.ap \
       | tail \
       )

Here are the timings for each version (best of three runs):

clv1.sh 43.3s
clv2.sh 42.1s
clv3.sh 32.5s
clv4.sh 32.7s
clv5.sh 32.5s

Conclusion

The biggest gain came when I used gawk to do the frequency counting in clv3.sh which is the kind of thing I would naturally do and saves a third of the time which would make it the fastest in the table here with a proportionally scaled run time of 0.62seconds

STOP Press!

I decided to compare run times with Frederic Lundh's work. On my initial run I got times above 32s for everything, and with wf-6.py taking over 90s.
The memory mapping seemed to be sticky as a subsequent run gave the following timings:

wf-1-gala.py 30.4s
wf-2.py 8.2s
wf-3.py 9.2s
wf-4.py 9.5s
wf-5.py 10.0s
wf-6.py 6.9s
This is much faster. I decided to test the persistance of the memory mapping of the file by re-running the command-line tests which gave improved results of:

clv1.sh 39.5s
clv2.sh 43.0s
clv3.sh 17.0s
clv4.sh 14.8s
clv5.sh 14.8s

It seems that memory mapping a file decreases run time but on my slow laptop, if the file is to be used once then the time to make the file memory resident may kill any time savings.

If this task was the only thing I had to do in a cron job on the hour, every hour then, despite knowing Python and Perl I would probably go for an awk solution. I would have probably stopped at clv3.sh.

Tuesday, October 02, 2007

Indenting python for blog comments

Note: Script updated 5 Oct 2007

Normally I use vim to turn my python code into nicely formatted and syntax-coloured examples for my posts, but blog comment fields can swallow leading spaces leaving any posted Python code without indentation.

Most blog comment fields only accept a small set of HTML tags (if any), so I cannot use the vim output.

I have written blogger.py to turn leading spaces into HTML space characters which seems to work on blogger and wordpress.

Here is the code:



'''
blogspace.py
Turns leading spaces into HTML &nbsp tokens which shows 
as correct indentation on Blogger comment fields 
(and maybe other blogs).

Donald. 'Paddy' McCarthy Sept,Oct 2011
'''
import fileinput, re, cgi

def escape(lineiterator=fileinput.input,
           cgi_escape=True, tag_with_pre=True, tag_with_br=False):
  output = 
  if tag_with_pre: output.append('<pre>\n')
  for line in lineiterator():
    if cgi_escape: line = cgi.escape(line)
    line =  re.sub(r'^(\s*)\s', 
                 lambda m: '&nbsp;'*m.end(), 
                 line.rstrip())
    output.append(line)
    if tag_with_br:
      output.append('<br>\n')
    else:
      output.append('\n')
  if tag_with_pre: output.append('</pre>\n')
  return ''.join(output)

if __name__ == '__main__':
  print escape()

- Paddy.

( Also posted here and here)

Friday, September 28, 2007

I wish Perl6 were out!

And why, pray tell, would a self confessed 'Pythonista' be waiting on Perl6?

I'll tell you. I think that in the search for the language, Perl 6 looks as if it may be simulated annealing at work with the temperature turned up high. From studying Perl 6 and adapting/using what it gets right we might find ways to improve Python whilst keeping Pythons aims.

Tuesday, September 18, 2007

An Open Letter to the Blogspot Team

Your editors for creating new posts and especially for creating comments just are not good enough!

As a programmer I have to use external programs to highlight the syntax of code for a new post which is bad, but the list of usable tags in a comment field is so small that I am reduced to writing a program to parse indented code and spit out something that will look merely OK.

Contrast Blogger editing with this which is available on Wikipedia.

Blogger, catch up!

- Paddy.

Sunday, September 16, 2007

On assertions and regular expressions

Text editor is to regular expression as
Waveform is to ... assertion

What I mean is that bump in productivity you get when you've learnt to
use regular expressions with vi/vim/sed/awk/perl/grep... on the unix
command line should be matched by an equal jump in productivity if you
use assertions in your digital design flow.

At the moment, the main digital simulation vendors such as Cadence ncsim, and Mentor questasim allow you to embed or attach assertions to designs that are compiled for simulation, and will automatically extract coverage for these assertions.

What is not so widespread, and what needs to be improved, is the use of interpreted assertions at the command lines of waveform browsers/debugging environments to help navigate waveforms (go to the second write after a read of memory in the range ...),; and to help aggregate displayed waveforms for better understanding of what is shown (highlight any overlap of a write on bus1 with a write on bus2).

I'm currently doing a mind experiment on turning assertions on VCD files, into regular expressions on a string.

- Paddy.

Wednesday, August 22, 2007

Duck Typing Explained.

One rule:

Don't check/assert the type of any argument or part of an argument to a method/function

One suggestion:

Make it easy to see what method calls/parameter accesses are made on arguments.

Create and debug your function/method with the argument types you have in mind.
Future users might substitute arguments of a different type but compatible method/parameter signatures, for just those methods used on the arguments.

For example: Create an assembler as if it is taking in a file object, and calling readline to get a line then assembling that line, in a loop, for all lines in the file.

To test the assembler you could:

Generate pseudo-random assembler statements.
Write them to a tmp file
Close the file.
Reopen the file for reading.
And pass the file object to the assembler function.

You could instead:

Add a readline method to the pseudo-random assembler generator.
And pass this object to the unchanged assembler function.

This would be 'Duck Typing' as, to the assembler, the code generator looks as file-like as it needs.

With Python, Duck typing is a doddle. Tutorials state that checking the type of objects should not normally be done. It is standard practice.

In some statically and manifestly typed languages such as C++, objective C and Java, you can it seems get part or all of this behaviour, (by using Templates in C++ for example), but functions/methods have to be explicitly written to support this behaviour from the outset, and the means by which such behaviour is accessed is not the 'normal' way one would write functions/methods if the behaviour were not considered.

Duck typing's power is rarely seen unless normal coding practices allow it to be later applied to method/function calls without having to know up-front of the need.

Saturday, July 21, 2007

Python makes simple solutions.

The title is a slogan I thought of to counter those who say Python is simple. My slogan emphasises that :

Python solves problems.
Those solutions are simple to understand.
The code produced is simple to maintain.

Wednesday, July 11, 2007

I've been Crunched!

There is a new crunchy out. I just downloaded the release, unzipped to a directory, then double-clicked crunchy.py, and Firefox opened a new page of delight. Five minutes of reading and I was able to open this very blog and have all the Python code come alive. On this entry, I was able to re-run the example and have new times printed - in the browser, with just a click of an (added) button!
If I had a new algorithm I could have used the edit window supplied to write and test it as well as save my changes.

Here is what a crunchy version of this page looks like.

The added editor comes pre-loaded with my example code. The new execute button, when clicked inserts the output of running the code from the editor below it. (click on the image for a larger view).

Neat!

Sunday, July 01, 2007

bit twiddling extras

As a follow up to this
I commented on what I thought would be a better solution. Richard Jones pointed us at how pyglet solved how to find the net power of two and how to check if a number is a power of two using some nifty bit twiddling.

Will McGugan then timed
both his and pyglets code, but didn't implement my suggestion (I didn't give code).

I have added an implementation of my suggestion and timed all three sets.
The timing shows that although my implementation of is_power_of_2 is fastest by a smidgen; My implementation of next_power_of_2 by iterative comparison of a list is half way between Wills and pyglets implementation, with pyglet the winner.

I have not got timings with psyco.

The code


setup_1 = """
from math import log, ceil
 
def is_power_of_2(n):
    return log(n, 2) % 1.0 == 0.0
 
def next_power_of_2(n):
    return (2 ** ceil(log(n, 2)))
"""

setup_2 = """
def next_power_of_2(v):
    v -= 1
    v |= v >> 1
    v |= v >> 2
    v |= v >> 4
    v |= v >> 8
    v |= v >> 16
    return v + 1
 
def is_power_of_2(v):
    return (v & (v - 1)) == 0
"""

setup_3 = """
list_of_powers = [x**2 for x in range(32)]
set_of_powers = set(list_of_powers)

def next_power_of_2(v):
    for x in list_of_powers:
        if x >= v: return x

def is_power_of_2(n):
    return n in set_of_powers

"""


from timeit import Timer

t1 = Timer("is_power_of_2(128)", setup_1)
t2 = Timer("is_power_of_2(128)", setup_2)
t3 = Timer("next_power_of_2(125)", setup_1)
t4 = Timer("next_power_of_2(125)", setup_2)
t5 = Timer("is_power_of_2(128)", setup_3)
t6 = Timer("next_power_of_2(125)", setup_3)

print "float math is_power_of_2:", t1.timeit()
print "bit-twiddling is_power_of_2:", t2.timeit()
print "set-lookup is_power_of_2:", t5.timeit()
print
print "float math next power of 2:", t3.timeit()
print "bit-twiddling next power of 2:", t4.timeit()
print "list-comparison next power of 2:", t6.timeit()

The Timings


float math is_power_of_2: 2.33588961726
bit-twiddling is_power_of_2: 0.573784377623
set-lookup is_power_of_2: 0.54499859619

float math next power of 2: 3.06279429369
bit-twiddling next power of 2: 2.08060354039
list-comparison next power of 2: 2.58548279181

Wednesday, June 06, 2007

My First Python one-liner

I created my first Python one-liner, to be used in anger today!
I was working on Unix, and because I know and use the normal command line utilities sed/grep/awk/perl etc, as well as Python, I tend to not use Python for one-liners (even very long one-liners), as Pythons syntax isn't up for it, and the other tools are in their element.
Today I was installing four simulation/verification packages from Cadence. We use the modules utility to allow us to easily access tools, but these new tools had what I was told was a complex setup script to be sourced that did a lot before setting environment variables to appropriate values.

I decided I needed to see what the changes to the environment were so needed to create a sorted list of environment variables before & after sourcing the setup, and then doing a gvimdiff. I could not use env|sort as our environment has a few variables with multi-line string values which are printed on separate lines and so mess with the sort.

I remembered that pprint sorts dictionaries and prints string values with escaped versions of newlines so came up with:

 python -c 'import os, pprint; pprint.pprint(dict(os.environ))'

The dict() is needed as os.environ is not a true dictionary, it can interact with calls to putenv() and unsetenv.

Here is a sample session showing how it prints newlines in values:

bash$ export XX='Spam
Ham'
bash$ python -c 'import os, pprint; pprint.pprint(dict(os.environ))'|fgrep XX
 'XX': 'Spam\nHam',
bash$

- Paddy.

Friday, May 18, 2007

The Hat Problem

I found an item describing an intriguing mathematical game, that I immediately 'got' for the size of problem described and decided to write a Python program to help me explore it.

In the source, you can create different strategy functions that take the same arguments
and return either a colour or the global name ipass.

Enjoy!



'''
The hat problem

From: http://www.msri.org/people/members/sara/articles/hat.html

The hat problem goes like this:
    Three players enter a room and a red or blue hat is placed on each
    person's head. The color of each hat is determined by a coin toss,
    with the outcome of one coin toss having no effect on the others. Each
    person can see the other players' hats but not his own.

    No communication of any sort is allowed, except for an initial
    strategy session before the game begins. Once they have had a chance
    to look at the other hats, the players must simultaneously guess the
    color of their own hats or pass. The group shares a hypothetical $3
    million prize if at least one player guesses correctly and no players
    guess incorrectly.
'''

import random

verbose = True
numplayers = 3
hatcolours = 'red blue'.split()
ipass = 'pass'

def otherhatcolour(hat):
    ''' not(hat) when hat can have two or more colours'''
    return hatcolours[(hatcolours.index(hat)+1) % len(hatcolours)]

def playerhats(n = numplayers):
    ''' generate all hat combinations for the players, but scrambled.'''
    if n:
        hats = hatcolours[:]
        random.shuffle(hats)
        for h1 in hats:
            for h2 in playerhats(n-1):
                yield [h1] + h2
    else:
        yield []

def strategy0(otherhats, hatcolours):
    '''
    If there is a clear max total of hat colours:
        Yours is a random choice from the hat colours occuring the least
    else:
        pass
    '''
    # Count the total hats in each colour
    colourcount = [otherhats.count(c) for c in hatcolours]
    # What hat colours are most/least prevalent
    mn = min(colourcount)
    mx = max(colourcount)
    #
    if colourcount.count(mx) == 1:
        # only one colour occurs as the maximum
        return hatcolours[random.choice([ index
                                          for index, count in enumerate(colourcount)
                                          if count == mn]) ]
    else:
        return ipass

def strategy1(otherhats, hatcolours):
    '''
    If there is a clear max and min total of hat colours,
    Yours is the min; otherwise pass.
    '''
    # Count the total hats in each colour
    colourcount = [otherhats.count(c) for c in hatcolours]
    # What hat colours are most/least prevalent
    mn = min(colourcount)
    mx = max(colourcount)
    #
    if mn != mx and (mn == 0 or colourcount.count(mx) == colourcount.count(mn) == 1):
        # only one colour occurs as the maximum, another as minimum
        # So guess the minimum
        return hatcolours[colourcount.index(mn)]
    else:
        return ipass

def scoreplayers(ph, strategy, _verbose = verbose):
    ''' Compute score and return 1 if players win otherwise zero'''
    n = len(ph)
    strat = []
    playerguessed = []
    for playernum,hat in enumerate(ph):
        strat = strategy([h for i,h in enumerate(ph) if i != playernum], hatcolours)
        if _verbose:
            print "%sPlayer%i:%s under a %6s hat replies: %6s" % (
                "  "*(playernum+1), playernum, "  "*(n-playernum-1), hat, strat
                ),
        if strat == hat:
            if _verbose:
                print "Correct"
            playerguessed.append(True)
        elif strat == ipass:
            if _verbose:
                print "-"
            playerguessed.append(ipass)
        else:
            if _verbose:
                print "Incorrect"
            playerguessed.append(False)
    if (True in playerguessed) and not( False in playerguessed):
        if _verbose:
            print "   Players WIN\n"
        return 1
    else:
        if _verbose:
            print "   Players LOSE\n"
        return 0



def run_stratagem(strategy, _verbose= verbose, _numplayers = numplayers,
                   _hatcolours = hatcolours):
    '''Applies a strategy function to all possible games of a hat problem.

    Prints either a verbose output of every trial, or a summary of
    all trials in csv format
    '''

    global verbose, numplayers, hatcolours
    verbose, numplayers, hatcolours = _verbose, _numplayers, _hatcolours

    totaltries = 0
    totalwins = 0
    if verbose:
        print "\nSTRATEGY"
        print strategy.__name__
        print strategy.__doc__
    for ph in playerhats():
        totaltries +=1
        totalwins += scoreplayers(ph, strategy, verbose)
    if verbose:
        print "\n\n Players won %i/%i = %.2g%% of the time by using %s\n" % (
                totalwins, totaltries, float(totalwins)/totaltries*100.0,
                strategy.__name__)
    else:
        print "Strategy,%s,Hats,%i,Players,%i,Games,%i,Wins,%i,Win%%,%.4f" % (
                strategy.__name__, len(hatcolours), numplayers,
                totaltries, totalwins, float(totalwins)/totaltries)


colours = 'red blue orange yellow green indigo violet'.split()
print
run_stratagem(strategy0, verbose)
run_stratagem(strategy1, verbose)

Tuesday, March 27, 2007

Wheel of time...

Just read Invasion Of The Dynamic Language Weenies it seems that what goes around, comes around, in that here we have a Java programmer decrying an article on dynamic languages, when years earlier I could read similar articles from C++ adherents about this new Java thing from Sun. java must be cool. Java had its culture. Interminable articles on run-anywhere code and late night drinking (of high strength coffee).

Excuse me while I fetch a lemon - some crocodile tears are called for. ;-)

"The Wheel of Time turns, and Ages come and pass, leaving memories that become legend. Legend fades to myth, and even myth is long forgotten when the Age that gave it birth comes again".

Robert Jordan - "The Wheel of Time".

Saturday, March 24, 2007

Bash style environment variable expander in Python

At work we have a certain config file that has bash/tcsh style embedded environment variable references of two styles:
${VARNAME}
and
$VARNAME

I thought I would try my hand at a concise python expander for the syntax and came up with:


bash$  python -c '
import sys,os,re
s=sys.stdin.read()
s1 = re.subn(r"\$([a-z_A-Z][A-Za-z_0-9\-]*)", r"%(\1)s", s)[0]
s2 = re.subn(r"\${([^}]+)}", r"%(\1)s", s1)[0]
print s2 % os.environ ' < in_file > expanded_file

Sunday, March 04, 2007

Fizz-Buzz.py

I am having to set some extra work at home for my ten year old son. I found information on a game Fizz-Buzz and introduced it to my son, only to find that he had already played a variant at school.

Oh well.

By that time I had already created a python program to check his answers (written in simplistic style as I intend one day to get all of my kids to learn Python).

'''
Fizz-Buzz.py

Fizz-Buzz Game.
Instructions for two players, 3-5 game. (Easiest).

   1. Players take it in turns counting from one to ninety-nine.
      (Don't take too much or too little time between numbers
       - try for a steady pace)
   2. Player must say Fizz-Buzz instead of their number if the number is a
      multiple of three and of five.
   3. Otherwise, player must say Fizz instead of their number if the number
      is a multiple of three.
   4. Otherwise, player must say Buzz instead of their number if the number
      is a multiple of Five. 

Digit version
  Use basic instructions 1-4 as above, then add the following rules:

   5. Otherwise, player must say Fizz for each digit in the number that is
      a three,
      And player must also say Buzz for each digit in the number that is a
      five.
      (for example, say Fizz-one instead of 31, Fizz-two instead of thirty
      two)
   6. Remember, the rules must be applied in order so  thirty three is
      actually Fizz and not Fizz-Fizz as it is first a multiple of three
'''

fizz = 3
buzz = 5
digits_version = True

for tens in (0,1,2,3,4,5,6,7,8,9):
   for units in (0,1,2,3,4,5,6,7,8,9):
       number = tens*10 + units
       if number == 0:
           pass
       else:
           print "For", number, "Say",
           if (number % fizz) == 0 and (number % buzz) == 0:
               print "FizzBuzz"
           elif (number % fizz) == 0:
               print "Fizz"
           elif (number % buzz) == 0:
               print "Buzz"
           elif digits_version:
               saytens = tens
               sayunits = units
               digits_as_fizzbuzz = False
               if tens == fizz:
                   saytens = "Fizz"
                   digits_as_fizzbuzz = True
               if units == fizz:
                   sayunits = "Fizz"
                   digits_as_fizzbuzz = True
               if tens == buzz:
                   saytens = "Buzz"
                   digits_as_fizzbuzz = True
               if units == buzz:
                   sayunits = "Buzz"
                   digits_as_fizzbuzz = True
               if digits_as_fizzbuzz:
                   print saytens,sayunits
               else:
                   print number
           else:
               print number

Sunday, February 25, 2007

Pythons Function Nested Parameter Definitions

(And I don't mean nested function definitions).

Whilst reading PEP3107 on what's coming this year in Python 3000 I came across this section where it mentions nested parameters in function definitions.
I could not emember this feature so googled a bit, and still could not find it.
It was time to suck-it-and-see, so using Python2.5 I did:


>>> def x ((p0, p1), p2):
...     return p0,p1,p2
...
>>> x(('Does', 'this'), 'work')
('Does', 'this', 'work')
>>>

My current skunk-works project extracts data from VHDL source files as a list of nested tuples. I use each outer tuple as a record but unpack the record manually at each function call. If I used the above then the function parameter definition would reflect the record structure.

On closer inspection of the Python Function definition grammar, I find that nested parameters are represented by the sublist token.

Neat!

- Paddy.

Friday, February 09, 2007

unzip un-needed in Python

Someone blogged about Python not having an unzip function to go with zip().
unzip is straight-forward to calculate because:


>>> t1 = (0,1,2,3)
>>> t2 = (7,6,5,4)
>>> [t1,t2] == zip(*zip(t1,t2))
True

Explanation

In answer to a commentator, I have written a (large), program to explain the above.
unzip_explained.py:


'''
Explanation of unzip expression zip(*zip(A,B))

References:
 1: Unpacking argumment lists
    http://www.network-theory.co.uk/docs/pytut/UnpackingArgumentLists.html
 2: Zip
    >>> help(zip)
    Help on built-in function zip in module __builtin__:

    zip(...)
        zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
        
        Return a list of tuples, where each tuple contains the i-th element
        from each of the argument sequences.  The returned list is truncated
        in length to the length of the shortest argument sequence.


'''

def show_args(*positional, **kwargs):
  "Straight-forward function to show its arguments"
  n = 0
  for p in positional:
    print "    positional argument", n, "is", p
    n += 1
  for k,v in sorted(kwargs.items()):
    print "    keyword argument", k, "is", v

A = tuple( "A%i" % n for n in range(3) )
print "\n\nTuple A is:"; print "   ", A
B = tuple( "B%i" % n for n in range(3) )
print "Tuple B is:"; print "   ", B

print "\nLets go slowly through the expression: [A,B] == zip(*zip(A,B))\n"

print "List [A,B] is:"
print "   ", [A,B]
print "zip(A,B) has arguments:"; show_args(A,B)
print "zip(A,B) returns:"
print "   ", zip(A,B)
print "The leftmost zip in zip(*zip(A,B)), due"
print " to the 'list unpacking' of the previous"
print " value has arguments of:"; show_args(*zip(A,B))
print "The outer zip therefore returns:"
print "   ", zip(*zip(A,B))
print "Which is the same as [A,B]\n"

And here is the program output:


Tuple A is:
    ('A0', 'A1', 'A2')
Tuple B is:
    ('B0', 'B1', 'B2')

Lets go slowly through the expression: [A,B] == zip(*zip(A,B))

List [A,B] is:
    [('A0', 'A1', 'A2'), ('B0', 'B1', 'B2')]
zip(A,B) has arguments:
    positional argument 0 is ('A0', 'A1', 'A2')
    positional argument 1 is ('B0', 'B1', 'B2')
zip(A,B) returns:
    [('A0', 'B0'), ('A1', 'B1'), ('A2', 'B2')]
The leftmost zip in zip(*zip(A,B)), due
 to the 'list unpacking' of the previous
 value has arguments of:
    positional argument 0 is ('A0', 'B0')
    positional argument 1 is ('A1', 'B1')
    positional argument 2 is ('A2', 'B2')
The outer zip therefore returns:
    [('A0', 'A1', 'A2'), ('B0', 'B1', 'B2')]
Which is the same as [A,B]

Saturday, January 20, 2007

Python example: from 1997 to now.

I stumbled on an old site: My Programming Language Crisis of ~1997.
There is a data-mining example, complete with sample input data and 'golden output' called invert:

Stdin consists of lines of two tab-separated fields. Call the first field A and the second field B.
The program must gather like values of B together, collecting the A's that go with them.
The output of the program must be lines consisting of an arbitrary number of tab-separated fields; the first field of each line is a unique B; all subsequent fields of that line are A's that were associated with that B in the input. There will be as many output lines as there were unique B's in the input. The output lines must be sorted on the first field (B's) and all subsequent fields of each line (each B's As) must be sorted.

...For example, suppose you wanted to gather together all the unique URLs in a set of web pages in order to validate them efficiently. Your report needs to be able to associate dead URLs with the files they were originally found in. Thus the A's in the input are filenames and the B's are URLs.

The Python program of its day brought back memories and I of course had to re-write it using some of the later additions to the language.

Things I used includes:

The use of the fileinput module to do looping over lines of input.
split as a string method.
The default split action of splitting on white space.
(A little dicey but input fields don't include spaces).
The setdefault dict method for providing an empty list when necessary, so the try...except block can go.
The new sorted function that returns a sorted value.
list unpacking of kv into k,v in the output for loop.
The join method to insert tab separators into the ordered list of output fields.

Overall I'd say that Python has become more clear over the years.

The prog:

# invert benchmark in Python
# see http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/
# This version by Donald 'Paddy' McCarthy

import fileinput

B = {}
for line in fileinput.input():
   fields = line.split()
   B.setdefault(fields[1], []).append(fields[0])

# In-memory data sort
# in-place sorting of values in key-value pairs
kv = sorted(B.items())
for k,v in kv:
   v.sort()
   print "\t".join([k]+v)


'''

## Here follows the original (1997?), program

#! /depot/python/arch/bin/python
# invert benchmark in Python
# see <url:http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/
# Original by Keith Waclena 
# Optimized by Tom Carroll 

from sys import stdin
from string import split, join

B = {}

while 1:
   line = stdin.readline()
   if not line: break
   fields = split(line[0:-1], "\t")
   try:  #assume this key is already present
       B[fields[1]].append(fields[0])
   except:  #it's not!? well then, we best put it in...
       B[fields[1]] = [fields[0],]

keys = B.keys()
keys.sort()

   values = B[key]
   values.sort()
   print key + "\t" + join(values, "\t")
'''

Tuesday, January 02, 2007

Data Mining: in three languages

I answered this
post with a reply written in AWK then wrote versions in Perl and
Python. You sshould be aware that I have writen more awk, perl and
python (in that order); but I think I know more awk, python then perl
(in that order).

The awk example:

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{
  nodata = 0;             # Curret run of consecutive flags<0 in lines of file
  nodata_max=-1;          # Max consecutive flags<0 in lines of file
  nodata_maxline="!";     # ... and line number(s) where it occurs
}
FNR==1 {
  # Accumulate input file names
  if(infiles){
    infiles = infiles "," infiles
  } else {
    infiles = FILENAME
  }
}
{
  tot_line=0;             # sum of line data
  num_line=0;             # number of line data items with flag>0

  # extract field info, skipping initial date field
  for(field=2; field<=NF; field+=2){
    datum=$field;
    flag=$(field+1);
    if(flag<1){
      nodata++
    }else{
      # check run of data-absent fields
      if(nodata_max==nodata && (nodata>0)){
        nodata_maxline=nodata_maxline ", " $1
      }
      if(nodata_max<nodata && (nodata>0)){
        nodata_max=nodata
        nodata_maxline=$1
      }
      # re-initialise run of nodata counter
      nodata=0;
      # gather values for averaging
      tot_line+=datum
      num_line++;
    }
  }

  # totals for the file so far
  tot_file += tot_line
  num_file += num_line

  printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
         $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

  # debug prints of original data plus some of the computed values
  #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
  #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline


}

END{
  printf "\n"
  printf "File(s)  = %s\n", infiles
  printf "Total    = %10.3f\n", tot_file
  printf "Readings = %6i\n", num_file
  printf "Average  = %10.3f\n", tot_file / num_file

  printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}

The same functionality in perl is very similar to the awk program:

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN {
  $nodata = 0;             # Curret run of consecutive flags<0 in lines of file
  $nodata_max=-1;          # Max consecutive flags<0 in lines of file
  $nodata_maxline="!";     # ... and line number(s) where it occurs
}
foreach (@ARGV) {
  # Accumulate input file names
  if($infiles ne ""){
    $infiles = "$infiles, $_";
  } else {
    $infiles = $_;
  }
}

while (<>){
  $tot_line=0;             # sum of line data
  $num_line=0;             # number of line data items with flag>0

  # extract field info, skipping initial date field
  chomp;
  @fields = split(/\s+/);
  $nf = @fields;
  $date = $fields[0];
  for($field=1; $field<$nf; $field+=2){
    $datum = $fields[$field] +0.0;
    $flag  = $fields[$field+1] +0;
    if(($flag+1<2)){
      $nodata++;
    }else{
      # check run of data-absent fields
      if($nodata_max==$nodata and ($nodata>0)){
        $nodata_maxline = "$nodata_maxline, $fields[0]";
      }
      if($nodata_max<$nodata and ($nodata>0)){
        $nodata_max = $nodata;
        $nodata_maxline=$fields[0];
      }
      # re-initialise run of nodata counter
      $nodata = 0;
      # gather values for averaging
      $tot_line += $datum;
      $num_line++;
    }
  }

  # totals for the file so far
  $tot_file += $tot_line;
  $num_file += $num_line;

  printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
         $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n";
printf "File(s)  = %s\n", $infiles;
printf "Total    = %10.3f\n", $tot_file;
printf "Readings = %6i\n", $num_file;
printf "Average  = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
       $nodata_max, $nodata_maxline;

The python program however splits the fields in the line slightly
differently (although it could use the method used in the perl and
awk programs too):

# Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput
import sys

nodata = 0;             # Curret run of consecutive flags<0 in lines of file
nodata_max=-1;          # Max consecutive flags<0 in lines of file
nodata_maxline=[];      # ... and line number(s) where it occurs

tot_file = 0            # Sum of file data
num_file = 0            # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():
  tot_line=0;             # sum of line data
  num_line=0;             # number of line data items with flag>0

  # extract field info
  field = line.split()
  date  = field[0]
  data  = [float(f) for f in field[1::2]]
  flags = [int(f)   for f in field[2::2]]

  for datum, flag in zip(data, flags):
    if flag<1:
      nodata += 1
    else:
      # check run of data-absent fields
      if nodata_max==nodata and nodata>0:
        nodata_maxline.append(date)
      if nodata_max<nodata and nodata>0:
        nodata_max=nodata
        nodata_maxline=[date]
      # re-initialise run of nodata counter
      nodata=0;
      # gather values for averaging
      tot_line += datum
      num_line += 1

  # totals for the file so far
  tot_file += tot_line
  num_file += num_line

  print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
        date,
        len(data) -num_line,
        num_line, tot_line,
        tot_line/num_line if (num_line>0) else 0)

print ""
print "File(s)  = %s" % (", ".join(infiles),)
print "Total    = %10.3f" % (tot_file,)
print "Readings = %6i" % (num_file,)
print "Average  = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
    nodata_max, ", ".join(nodata_maxline))

Timings:

$ time gawk -f readings.awk readingsx.txt readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readingsx.txt,readingsx.txt
Total    = 1361259.300
Readings = 129579
Average  =     10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real    0m1.069s
user    0m0.904s
sys     0m0.061s

$ time perl readings.pl readingsx.txt readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readingsx.txt, readings.txt
Total    = 1361259.300
Readings = 129579
Average  =     10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real    0m2.450s
user    0m1.639s
sys     0m0.015s

$ time /cygdrive/c/Python25/python readings.py readingsx.txt readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readingsx.txt, readings.txt
Total    = 1361259.300
Readings = 129579
Average  =     10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real    0m1.138s
user    0m0.061s
sys     0m0.030s

$

The differences in the Python prog. are not done as an
optimisation. The nifty list indexing of [1::2] and the zip just flow
naturally, (to me), from the data format.

The data format consists of single
line records of this format:

<string:date> [
<float:data-n> <int:flag-n> ]*24

e.g.

1991-03-31      10.000  1       10.000  1       ... 20.000      1       35.000  1