Mainly Tech projects on Python and Electronic Design Automation.

Saturday, March 28, 2009

Huffman Encoding in Python



Huffman encoding came up on Rosetta Code.
Huffman encoding is a way to assign binary codes to symbols that reduces the overall number of bits used to encode a typical string of of those symbols.
For example, if you use letters as symbols and have details of the frequency of occurence of those letters in typical strings, then you could just encode each letter with a fixed number of bits, such as in ASCII codes. You can do better than this by encoding more frequently occurring letters such as e and a, with smaller bit strings; and less frequently occurring letters such as q and x with longer bit strings.
Any string of letters will be encoded as a string of bits that are no-longer of the same length per letter. To successfully decode such as string, the smaller codes assigned to letters such as 'e' cannot occur as a prefix in the larger codes such as that for 'x'.
If you were to assign a code 01 for 'e' and code 011 for 'x', then if the bits to decode started as 011... then you would not know iif you should decode an 'e' or an 'x'.
The Huffman coding scheme takes each symbol and its weight (or frequency of occurrence), and generates proper encodings for each symbol taking account of the weights of each symbol, so that higher weighted symbols have less bits in their encoding. (See the WP article for more information).
A Huffman encoding can be computed by first creating a tree of nodes:
  1. Create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue:
    1. Remove the node of highest priority (lowest probability) twice to get two nodes.
    2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
    3. Add the new node to the queue.
  3. The remaining node is the root node and the tree is complete.
Traverse the constructed binary tree from root to leaves assigning and accumulating a '0' for one branch and a '1' for the other at each node. The accumulated zeroes and ones at each leaf constitute a Huffman encoding for those symbols and weights.

In Python

I orginally gave an an example that matched a definition that was later found to be insufficient, so substituted my own definition above.. My first Python solution  on RC to the wrong definition, did have the advantage, (as I saw it), of not having to traverse a tree.
My 'true' Huffman code creator assembles each symbol and its weight into the following structure initially (the leaf structure):
[ weight, [ symbol, []]]
The weight applies to every (in this case only one), of the [symbol, []] pairs after it in the same list.
The empty list is used to accumulate the Huffman code for the symbol as we manipulate the heap, without having to walk a constructed tree structure.

There are two types of input to the program that I am running examples with:

  1. A string of space separated symbol, weight pairs, as used in small examples.
  2. A sample of text for which letters and letter frequencies are extracted.
The if statement at line 23  allows me to switch between the two types of input whilst exploring the algorithm.
The tutor argument to the encode function shows what is happening in the loop around the heap pops
 1 
 2 from heapq import heappush, heappop, heapify
 3 
 4 def codecreate(symbol2weights, tutor= False):
 5     ''' Huffman encode the given dict mapping symbols to weights '''
 6     heap = [ [float(wt), [sym, []]] for sym, wt in symbol2weights.iteritems() ]
 7     heapify(heap)
 8     if tutor: print "ENCODING:", sorted(symbol2weights.iteritems())
 9     while len(heap) >1:
10         lo = heappop(heap)
11         hi = heappop(heap)
12         if tutor: print "  COMBINING:", lo, '\n        AND:', hi
13         for i in lo[1:]: i[1].insert(0, '0')
14         for i in hi[1:]: i[1].insert(0, '1')
15         lohi = [ lo[0] + hi[0] ] + lo[1:] + hi[1:]
16         if tutor: print "  PRODUCING:", lohi, '\n'
17         heappush(heap, lohi)
18     codes = heappop(heap)[1:]
19     for i in codes: i[1] = ''.join(i[1])
20     return sorted(codes, key=lambda x: (len(x[-1]), x))
21 
22 # Input types
23 if 1:
24     readin = "B 25   C 2.5 D  12.5 A 5 \n"
25     #readin = "a .1 b .15 c .3 d .16 e .29" # Wikipedia sample
26     #readin = "a1 .4 a2 .35 a3 .2 a4 .05" # Wikipedia sample
27     #readin = "A 50 B 25 C 12.5 D 12.5" # RC example
28 
29     cleaned = readin.strip().split()
30     symbol2weights = dict((symbol, wt)
31                          for symbol, wt in zip(cleaned[0::2], cleaned[1::2]) )
32 else:
33     astring = "this is an example for huffman encoding"
34     symbol2weights = dict((ch, astring.count(ch)) for ch in set(astring)) # for astring
35 
36 huff = codecreate(symbol2weights, True)
37 print "\nSYMBOL\tWEIGHT\tHUFFMAN CODE"
38 for h in huff:
39     print "%s\t%s\t%s" % (h[0], symbol2weights[h[0]], h[1])

A run, with the tutor enabled gives the following output:
ENCODING: [('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]
  COMBINING: [2.5, ['C', []]]
        AND: [5.0, ['A', []]]
  PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]]

  COMBINING: [7.5, ['C', ['0']], ['A', ['1']]]
        AND: [12.5, ['D', []]]
  PRODUCING: [20.0, ['C', ['0', '0']], ['A', ['0', '1']], ['D', ['1']]]

  COMBINING: [20.0, ['C', ['0', '0']], ['A', ['0', '1']], ['D', ['1']]]
        AND: [25.0, ['B', []]]
  PRODUCING: [45.0, ['C', ['0', '0', '0']], ['A', ['0', '0', '1']], ['D', ['0', '1']], ['B', ['1']]]


SYMBOL    WEIGHT    HUFFMAN CODE
B         25        1
D         12.5      01
A         5         001
C         2.5       000


Encode/Decode Round-tripping

I realised that I could use a method similar to how I accumulate the codes in the heap loop, to generate a single function that can recognise a single symbol from the beginning of the encoded symbols. By using the function in a loop, I could regenerate the symbol list.

In the codecreate function, the leaf structure is modified:
[weight, [ [symbol, []] ], repr(sym)]
Their is an extra level of list around the [symbol, code accumulation list pair]  as well as a new item: 'repr(sym)' it is the third item in the outer list and will always be the  function to generate the item as accumulated so far.. This function starts off by returning just the symbol and an outer if/then/else expression is added as we go around the heap loop (see line 43)

After the outer expression is accumulated, it is turned into a lambda expression, the string eval'd, and assigned to a global variable (see line 47)

I use function probchoice (from earlier RC work), to create an arbitrary sequence of symbols to in the given weighting then encode and decode it as well as giving some stats on space saving.

  1 
  2 from heapq import heappush, heappop, heapify
  3 import random, bisect
  4 
  5 
  6 # Helper routine for generating test sequences
  7 def probchoice(items, probs):
  8   '''\
  9   Splits the interval 0.0-1.0 in proportion to probs
 10   then finds where each random.random() choice lies
 11   (This routine, probchoice, was released under the
 12   GNU Free Documentation License 1.2)
 13   '''
 14 
 15   prob_accumulator = 0
 16   accumulator = []
 17   for p in probs:
 18     prob_accumulator += p
 19     accumulator.append(prob_accumulator)
 20 
 21   while True:
 22     r = random.random()
 23     yield items[bisect.bisect(accumulator, r)]
 24 
 25 
 26 # placeholder
 27 decode = lambda : None
 28 
 29 def codecreate(symbol2weights, tutor= False):
 30     ''' Huffman encode the given dict mapping symbols to weights '''
 31     global decode
 32 
 33     heap = [ [float(wt), [[sym, []]], repr(sym)] for sym, wt in symbol2weights.iteritems() ]
 34     heapify(heap)
 35     if tutor: print "ENCODING:", sorted(symbol2weights.iteritems())
 36     while len(heap) >1:
 37         lo = heappop(heap)
 38         hi = heappop(heap)
 39         if tutor: print "  COMBINING:", lo, '\n        AND:', hi
 40         for i in lo[1]: i[1].insert(0, '0')
 41         for i in hi[1]: i[1].insert(0, '1')
 42         lohi = [ lo[0] + hi[0] ] + [lo[1] + hi[1]]
 43         lohi.append('(%s if nextbit() else %s)' % (hi[2], lo[2]))
 44         if tutor: print "  PRODUCING:", lohi, '\n'
 45         heappush(heap, lohi)
 46     wt, codes, decoder = heappop(heap)
 47     decode = eval('lambda :' + decoder, globals())
 48     decode.__doc__ = decoder
 49     for i in codes: i[1] = ''.join(i[1])
 50     #for i in codes: i[::] = i[:2]
 51     return sorted(codes, key=lambda x: (len(x[-1]), x))
 52 
 53 # Input types
 54 if 1:
 55     tutor = True
 56     sequencecount = 50
 57     readin = "B 25   C 2.5 D  12.5 A 5 \n"
 58     #readin = "a .1 b .15 c .3 d .16 e .29" # Wikipedia sample
 59     #readin = "a1 .4 a2 .35 a3 .2 a4 .05" # Wikipedia sample
 60     #readin = "A 50 B 25 C 12.5 D 12.5" # RC example
 61 
 62     cleaned = readin.strip().split()
 63     symbol2weights = dict((symbol, wt)
 64                          for symbol, wt in zip(cleaned[0::2], cleaned[1::2]) )
 65 else:
 66     tutor = False
 67     sequencecount = 500
 68     astring = "this is an example for huffman encoding"
 69     symbol2weights = dict((ch, astring.count(ch)) for ch in set(astring)) # for astring
 70 
 71 huff = codecreate(symbol2weights, tutor= tutor)
 72 print "\nSYMBOL\tWEIGHT\tHUFFMAN CODE"
 73 for h in huff:
 74     print "%s\t%s\t%s" % (h[0], symbol2weights[h[0]], h[1])
 75 
 76 ##
 77 ## encode-decode check
 78 ##
 79 symbol2code = dict(huff)
 80 symbols, weights = zip(*symbol2weights.iteritems())
 81 # normalize weights
 82 weights = [float(wt) for wt in weights]
 83 tot = sum(weights)
 84 weights = [wt/tot for wt in weights]
 85 # Generate a sequence
 86 nxt = probchoice(symbols, weights).next
 87 symbolsequence = [nxt() for i in range(sequencecount)]
 88 # encode it
 89 bitsequence = ''.join(symbol2code[sym] for sym in symbolsequence)
 90 
 91 sslen, slen, blen = len(symbolsequence), len(symbols), len(bitsequence)
 92 countlen = len(bin(slen-1)[2:])
 93 print '''
 94 
 95 
 96 ROUND-TRIPPING
 97 ==============
 98 I have generated a random sequence of %i symbols to the given weights.
 99 If I use a binary count to encode each of the %i symbols I would need
100 %i * %i = %i bits to encode the sequence.
101 Using the Huffman code, I need only %i bits.
102 ''' % (sslen, slen, sslen, countlen, sslen * countlen, blen )
103 
104 ## decoding
105 nextbit = (bit=='1' for bit in bitsequence).next
106 
107 decoded = []
108 try:
109     while 1:
110         decoded.append(decode())
111 except StopIteration:
112     pass
113 
114 print "Comparing the decoded sequence with the original I get:", decoded == symbolsequence

This short run is in tutor mode, so you can track the accumulation of the decode function:
ENCODING: [('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]
  COMBINING: [2.5, [['C', []]], "'C'"]
        AND: [5.0, [['A', []]], "'A'"]
  PRODUCING: [7.5, [['C', ['0']], ['A', ['1']]], "('A' if nextbit() else 'C')"]

  COMBINING: [7.5, [['C', ['0']], ['A', ['1']]], "('A' if nextbit() else 'C')"]
        AND: [12.5, [['D', []]], "'D'"]
  PRODUCING: [20.0, [['C', ['0', '0']], ['A', ['0', '1']], ['D', ['1']]], "('D' if nextbit() else ('A' if nextbit() else 'C'))"]

  COMBINING: [20.0, [['C', ['0', '0']], ['A', ['0', '1']], ['D', ['1']]], "('D' if nextbit() else ('A' if nextbit() else 'C'))"]
        AND: [25.0, [['B', []]], "'B'"]
  PRODUCING: [45.0, [['C', ['0', '0', '0']], ['A', ['0', '0', '1']], ['D', ['0', '1']], ['B', ['1']]], "('B' if nextbit() else ('D' if nextbit() else ('A' if nextbit() else 'C')))"]


SYMBOL    WEIGHT    HUFFMAN CODE
B         25        1
D         12.5      01
A         5         001
C         2.5       000



ROUND-TRIPPING
==============
I have generated a random sequence of 50 symbols to the given weights.
If I use a binary count to encode each of the 4 symbols I would need
50 * 2 = 100 bits to encode the sequence.
Using the Huffman code, I need only 90 bits.

Comparing the decoded sequence with the original I get: True

And if I change line 54 to be False, I get the following:
SYMBOL    WEIGHT    HUFFMAN CODE
     6    101
n    4    010
a    3    1001
e    3    1100
f    3    1101
h    2    0001
i    3    1110
m    2    0010
o    2    0011
s    2    0111
g    1    00000
l    1    00001
p    1    01100
r    1    01101
t    1    10000
u    1    10001
x    1    11110
c    1    111110
d    1    111111



ROUND-TRIPPING
==============
I have generated a random sequence of 500 symbols to the given weights.
If I use a binary count to encode each of the 19 symbols I would need
500 * 5 = 2500 bits to encode the sequence.
Using the Huffman code, I need only 2012 bits.

Comparing the decoded sequence with the original I get: True


Note: The purpose of the program is to teach me more about Huffman coding and is not an exercise in speed!

I am the author of all the Python on this page. The diagram is from Wikipedia. Please refrain from passing-off my code as your own (that one is mainly for students).

Saturday, March 21, 2009

Batch Process Runner in bash shell - Forgotten Enhancements

A comment on an href="http://paddy3118.blogspot.com/2007/12/batch-process-runner-in-bash-shell.html">earlier
post caused me to dig out this enhancement to my original
batch process runner.



What you do is create a file of one-liner commands that you would enter
at a command prompt, for example, this is file style="font-weight: bold;">process_list.txt :


1  color="#0000ff"># (Comments have '#' at the left margin)
color="#804040">2 sleep 4; echo slept for 4 at line 2; csdf_sdf_sd; exit 19
color="#804040">3 sleep 1; cause-an-error; echo slept for 1 at line 3
color="#804040">4 sleep 3; echo color="#ff00ff">'slept for 3 at line 4'
color="#804040">5 sleep 7; echo color="#ff00ff">"slept for 7 at line 5"
color="#804040">6 # Whee!
color="#804040">7 sleep 9; another-erro; echo slept for 9 at line 7
color="#804040">8 sleep 5; echo color="#ff00ff">'slept "for 5" at line 8'


A bit of warning, bash returns the exit code of this laast command it
executes, earlier exit codes won't be seen.



Give my script the number of processes to run in parallel, N, followed
by the above file, and it will create N '.job'  files out of
the non-comment lines of process_list.txt and execute them as
background jobs, with their output sent to '.n' files.:


bash$ ./process_list_runner.sh 2 process_list.txt

## STARTING 6 Processes from file: process_list.txt, 2 at a time with id plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP

# 2 Jobs in background. 6/6 started. 2009-03-21-05:36:36

## FINISHED, (stats in plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.csv)

bash$ style="font-weight: bold;">ls -1 plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.*
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.1
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.2
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.3
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.4
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.5
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.6
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.csv
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.job_0
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP.job_1
bash$




The script does what it can to create a csv file of run stats, which,
with a bit of tidying up in calc produces the following:




frame="void" rules="none">
width="340"> width="110"> width="115"> width="73">























































































































width="1409">##
STARTING 6 Processes from file: process_list.txt
size="4">2 at a time with id
plr_HPDV8025EA_2009-03-21-05_36_19_PADDYS-HPLAPTOP
























align="center" bgcolor="#ccffff" height="123"
valign="middle">JOB
align="right" bgcolor="#ccffff" valign="middle">LINE align="center" bgcolor="#ccffff" valign="middle">Command
being timed
align="center" bgcolor="#ccffff" valign="middle">User
time (seconds)
align="center" bgcolor="#ccffff" valign="middle">System
time (seconds)
align="center" bgcolor="#ccffff" valign="middle">Percent
of CPU this job got
align="center" bgcolor="#ccffff" valign="middle">Elapsed
(wall clock) time (h:mm:ss or m:ss)
align="center" bgcolor="#ccffff" valign="middle">Maximum
resident set size (kbytes)
align="center" bgcolor="#ccffff" valign="middle">Major
(requiring I/O) page faults
align="center" bgcolor="#ccffff" valign="middle">Signals
delivered
align="center" bgcolor="#ccffff" valign="middle">Page
size (bytes)
align="center" bgcolor="#ccffff" valign="middle">Exit
status
sdnum="2057;" align="center" bgcolor="#ffffcc"
height="18">1
sdnum="2057;" align="center" bgcolor="#ffffcc">2 bgcolor="#ffffcc">sleep 4; echo slept for 4 at line 2;
csdf_sdf_sd; exit 19
sdnum="2057;" align="center" bgcolor="#ffffcc">0.13 sdnum="2057;" align="center" bgcolor="#ffffcc">0.15 sdnum="2057;0;0.00%" align="center" bgcolor="#ffffcc">6.00% sdval="0.0000503472222222222" sdnum="2057;0;MM:SS.00"
align="center" bgcolor="#ffffcc">00:04.35
sdval="760832" sdnum="2057;" align="center"
bgcolor="#ffffcc">760832
sdnum="2057;" align="center" bgcolor="#ffffcc">3277 sdnum="2057;" align="center" bgcolor="#ffffcc">6 sdval="65536" sdnum="2057;" align="center"
bgcolor="#ffffcc">65536
sdnum="2057;" align="center" bgcolor="#ffffcc">19
sdnum="2057;" align="center" bgcolor="#ccccff"
height="17">2
sdnum="2057;" align="center" bgcolor="#ccccff">3 bgcolor="#ccccff">sleep 1; cause-an-error; echo slept for 1
at line 3
sdnum="2057;" align="center" bgcolor="#ccccff">0.12 sdnum="2057;" align="center" bgcolor="#ccccff">0.09 sdnum="2057;0;0.00%" align="center" bgcolor="#ccccff">16.00% sdval="0.0000152777777777778" sdnum="2057;0;MM:SS.00"
align="center" bgcolor="#ccccff">00:01.32
sdval="761088" sdnum="2057;" align="center"
bgcolor="#ccccff">761088
sdnum="2057;" align="center" bgcolor="#ccccff">3273 sdnum="2057;" align="center" bgcolor="#ccccff">6 sdval="65536" sdnum="2057;" align="center"
bgcolor="#ccccff">65536
sdnum="2057;" align="center" bgcolor="#ccccff">0
sdnum="2057;" align="center" bgcolor="#ffffcc"
height="17">3
sdnum="2057;" align="center" bgcolor="#ffffcc">4 bgcolor="#ffffcc">sleep 3; echo 'slept for 3 at line 4' sdnum="2057;" align="center" bgcolor="#ffffcc">0.07 sdnum="2057;" align="center" bgcolor="#ffffcc">0.06 sdnum="2057;0;0.00%" align="center" bgcolor="#ffffcc">4.00% sdval="0.0000377314814814815" sdnum="2057;0;MM:SS.00"
align="center" bgcolor="#ffffcc">00:03.26
sdval="617216" sdnum="2057;" align="center"
bgcolor="#ffffcc">617216
sdnum="2057;" align="center" bgcolor="#ffffcc">2639 sdnum="2057;" align="center" bgcolor="#ffffcc">3 sdval="65536" sdnum="2057;" align="center"
bgcolor="#ffffcc">65536
sdnum="2057;" align="center" bgcolor="#ffffcc">0
sdnum="2057;" align="center" bgcolor="#ccccff"
height="17">4
sdnum="2057;" align="center" bgcolor="#ccccff">5 bgcolor="#ccccff">sleep 7; echo "slept for 7 at line 5" sdnum="2057;" align="center" bgcolor="#ccccff">0.07 sdnum="2057;" align="center" bgcolor="#ccccff">0.07 sdnum="2057;0;0.00%" align="center" bgcolor="#ccccff">2.00% sdval="0.000084837962962963" sdnum="2057;0;MM:SS.00"
align="center" bgcolor="#ccccff">00:07.33
sdval="617216" sdnum="2057;" align="center"
bgcolor="#ccccff">617216
sdnum="2057;" align="center" bgcolor="#ccccff">2644 sdnum="2057;" align="center" bgcolor="#ccccff">3 sdval="65536" sdnum="2057;" align="center"
bgcolor="#ccccff">65536
sdnum="2057;" align="center" bgcolor="#ccccff">0
sdnum="2057;" align="center" bgcolor="#ffffcc"
height="18">5
sdnum="2057;" align="center" bgcolor="#ffffcc">7 bgcolor="#ffffcc">sleep 9; another-erro; echo slept for 9 at
line 7
sdnum="2057;" align="center" bgcolor="#ffffcc">0.09 sdnum="2057;" align="center" bgcolor="#ffffcc">0.13 sdnum="2057;0;0.00%" align="center" bgcolor="#ffffcc">2.00% sdval="0.000106597222222222" sdnum="2057;0;MM:SS.00"
align="center" bgcolor="#ffffcc">00:09.21
sdval="760064" sdnum="2057;" align="center"
bgcolor="#ffffcc">760064
sdnum="2057;" align="center" bgcolor="#ffffcc">3274 sdnum="2057;" align="center" bgcolor="#ffffcc">6 sdval="65536" sdnum="2057;" align="center"
bgcolor="#ffffcc">65536
sdnum="2057;" align="center" bgcolor="#ffffcc">0
sdnum="2057;" align="center" bgcolor="#ccccff"
height="17">6
sdnum="2057;" align="center" bgcolor="#ccccff">8 bgcolor="#ccccff">sleep 5; echo 'slept "for 5" at line 8' sdnum="2057;" align="center" bgcolor="#ccccff">0.1 sdnum="2057;" align="center" bgcolor="#ccccff">0.06 sdnum="2057;" align="center" bgcolor="#ccccff">0.03 sdval="0.0000622685185185185" sdnum="2057;0;MM:SS.00"
align="center" bgcolor="#ccccff">00:05.38
sdval="617216" sdnum="2057;" align="center"
bgcolor="#ccccff">617216
sdnum="2057;" align="center" bgcolor="#ccccff">2639 sdnum="2057;" align="center" bgcolor="#ccccff">3 sdval="65536" sdnum="2057;" align="center"
bgcolor="#ccccff">65536
sdnum="2057;" align="center" bgcolor="#ccccff">0




The script itself is:


 1  color="#0000ff">#!/bin/bash 
color="#804040"> 2 #!/opt/TWWfsw/bin/bash -
color="#804040"> 3
color="#804040"> 4 set color="#008080">-u
color="#804040"> 5
color="#804040"> 6 ##
color="#804040"> 7 ## process_runner.sh <concurrent> <total procs>
color="#804040"> 8 ##
color="#804040"> 9 ## Example script given the maximum number of processes to
color="#804040"> 10 ## run concurrently, c, and the total number of processes, n
color="#804040"> 11 ## runs at most c, processes in the background until all n
color="#804040"> 12 ## Have been run.
color="#804040"> 13 ##
color="#804040"> 14 ## Author Donald 'Paddy' McCarthy Dec. 17 2007
color="#804040"> 15 ##
color="#804040"> 16
color="#804040"> 17 # how many processes to run in parallel
color="#804040"> 18 concurrent= color="#a020f0">$1
color="#804040"> 19 # File of commands
color="#804040"> 20 proclist= color="#804040">" color="#a020f0">$2"
color="#804040"> 21
color="#804040"> 22 ##
color="#804040"> 23 ##
color="#804040"> 24
color="#804040"> 25 # main loop wait time between checking background procs.
color="#804040"> 26 tick= color="#ff00ff">1
color="#804040"> 27 # Unique_id for process files of this run
color="#804040"> 28 unique_id=plr_ color="#6a5acd">`date +${ color="#a020f0">USER// color="#804040">/ color="#a020f0">}_%F-%T_ color="#a020f0">${HOSTNAME color="#804040">// color="#804040">/ color="#a020f0">}`
color="#804040"> 29 unique_id= color="#a020f0">${unique_id color="#804040">//: color="#804040">/_ color="#a020f0">}
color="#804040"> 30
color="#804040"> 31 function read_proclistfile color="#6a5acd">{
color="#804040"> 32 # read processes from file ignoring comment lines
color="#804040"> 33 maxprocs= color="#ff00ff">0
color="#804040"> 34 local -i color="#008080">linenumber= color="#ff00ff">0
color="#804040"> 35 while color="#804040"> color="#804040">read color="#804040">; color="#804040"> color="#804040">do
color="#804040"> 36 (( color="#008080">linenumber+= color="#ff00ff">1))
color="#804040"> 37 # echo [ "${REPLY:0:1}" == "#" ] $maxprocs $REPLY
color="#804040"> 38 if color="#804040">[ color="#804040">" color="#a020f0">${REPLY color="#804040">: color="#ff00ff">0: color="#ff00ff">1} color="#804040">" color="#804040">!= color="#804040">" color="#ff00ff">#" color="#804040">] color="#804040">; color="#804040">then
color="#804040"> 39 ((maxprocs+ color="#804040">= color="#ff00ff">1))
color="#804040"> 40 allprocs color="#804040">[ color="#a020f0">$maxprocs color="#804040">] color="#804040">= color="#804040">" color="#a020f0">$REPLY"
color="#804040"> 41 allprocline color="#804040">[ color="#a020f0">$maxprocs color="#804040">] color="#804040">= color="#a020f0">$linenumber
color="#804040"> 42 # echo $linenumber $maxprocs $REPLY
color="#804040"> 43 fi
color="#804040"> 44 done color="#804040">< color="#a020f0">$proclist
color="#804040"> 45 }
color="#804040"> 46
color="#804040"> 47 read_proclistfile
color="#804040"> 48
color="#804040"> 49 printf color="#804040">" color="#6a5acd">\n## STARTING %i Processes from file: %s, %i at a time with id %s color="#6a5acd">\n\n" \
color="#804040"> 50 $maxprocs color="#a020f0">$proclist color="#a020f0">$concurrent color="#a020f0">$unique_id
color="#804040"> 51
color="#804040"> 52
color="#804040"> 53
color="#804040"> 54 function assemble_job color="#6a5acd">{
color="#804040"> 55 #
color="#804040"> 56 local color="#008080">plr_unique_id= color="#804040">" color="#a020f0">$1"; color="#804040">local color="#008080">plr_ran=" color="#a020f0">$2"; color="#804040">local color="#008080">plr_proc= color="#804040">" color="#a020f0">$3"; color="#804040">local color="#008080">plr_procline= color="#804040">" color="#a020f0">$4"; color="#804040">local color="#008080">plr_proclist= color="#804040">" color="#a020f0">$5"
color="#804040"> 57 cat color="#804040"><<!
color="#804040"> 58
color="#804040"> 59 # color="#a020f0">$plr_unique_id color="#ff00ff"> $plr_ran color="#ff00ff">
color="#804040"> 60 ` color="#804040">local color="#6a5acd">`
color="#804040"> 61
color="#804040"> 62 trap 'error= color="#6a5acd">\$?;printf " color="#6a5acd">\n\n##STATISTICS For Job %i color="#6a5acd">\n" color="#a020f0">$plr_ran; printf "# Line %i of file %s color="#6a5acd">\n\n" color="#a020f0">$plr_procline color="#ff00ff"> $plr_proclist color="#ff00ff">; exit \$ color="#ff00ff">error' EXIT
color="#804040"> 63
color="#804040"> 64
color="#804040"> 65 $plr_proc
color="#804040"> 66
color="#804040"> 67 !
color="#804040"> 68 }
color="#804040"> 69
color="#804040"> 70 function print_runstats color="#6a5acd">{
color="#804040"> 71 printf color="#804040">' color="#ff00ff"># %i Jobs in background. %i/%i started. %s\r color="#804040">' \
color="#804040"> 72 ` color="#804040">jobs color="#6a5acd"> -r color="#804040">| color="#6a5acd">wc -l` $ran color="#a020f0">$maxprocs color="#804040">" color="#6a5acd">`date +%F-%T` color="#804040">"
color="#804040"> 73 }
color="#804040"> 74
color="#804040"> 75 # Bash array running keeps track of the background process numbers
color="#804040"> 76 # Start with nothing running (sentinel value will not be a process number
color="#804040"> 77 for color="#6a5acd">((i= color="#ff00ff">0; i color="#804040">< color="#a020f0">$concurrent color="#804040">; i+ color="#804040">= color="#ff00ff">1 )) color="#804040">; color="#804040">do running color="#804040">[ color="#a020f0">$i] color="#804040">= color="#ff00ff">123456789 color="#804040">; color="#804040">done
color="#804040"> 78
color="#804040"> 79 ran= color="#ff00ff">0
color="#804040"> 80 until
color="#804040"> 81 while color="#804040">[ color="#a020f0">$ran -lt color="#a020f0">$maxprocs color="#804040">] color="#804040">; color="#804040">do
color="#804040"> 82 for color="#6a5acd">((p= color="#ff00ff">0; p color="#804040">< color="#a020f0">$concurrent color="#804040">; p+ color="#804040">= color="#ff00ff">1 )) color="#804040">; color="#804040">do
color="#804040"> 83 proc= color="#a020f0">${running color="#a020f0">[$p color="#a020f0">]}
color="#804040"> 84 # Over all running processes...
color="#804040"> 85 # $proc still running?
color="#804040"> 86 ps -p color="#a020f0">$proc | color="#804040">fgrep color="#a020f0">$proc >/dev/null
color="#804040"> 87 if color="#804040">[ color="#a020f0">$? -ne color="#804040">' color="#ff00ff">0' color="#804040">] color="#804040">; color="#804040">then
color="#804040"> 88 # Not found i.e. its finished
color="#804040"> 89 # So start the next
color="#804040"> 90 ((ran+ color="#804040">= color="#ff00ff">1))
color="#804040"> 91 #(assemble_job ) >$unique_id.job_$p
color="#804040"> 92 ( assemble_job color="#804040">" color="#a020f0">$unique_id color="#804040">" color="#804040">" color="#a020f0">$ran" color="#804040">" color="#a020f0">${allprocs color="#a020f0">[$ran color="#a020f0">]} color="#804040">" \
color="#804040"> 93 ${ color="#a020f0">allprocline color="#a020f0">[$ran color="#a020f0">]} color="#a020f0">$proclist color="#804040">) color="#804040">> color="#a020f0">$unique_id.job_ color="#a020f0">$p
color="#804040"> 94 chmod +x color="#a020f0">$unique_id.job_ color="#a020f0">$p
color="#804040"> 95 # run one job in background collecting run stats
color="#804040"> 96 (/bin/ color="#804040">time -v -a -o color="#a020f0">$unique_id. color="#a020f0">$ran < /dev/null color="#a020f0">$unique_id.job_ color="#a020f0">$p 2>&1 color="#804040">) color="#804040">> color="#a020f0">$unique_id. color="#a020f0">$ran &
color="#804040"> 97 running color="#804040">[ color="#a020f0">$p] color="#804040">= color="#a020f0">$!
color="#804040"> 98 runningprocnum color="#804040">[ color="#a020f0">$p] color="#804040">= color="#a020f0">$ran
color="#804040"> 99 if color="#804040">[ color="#a020f0">$ran -ge color="#a020f0">$maxprocs color="#804040">] color="#804040">; color="#804040">then color="#804040">break color="#ff00ff">1; color="#804040">fi
color="#804040">100 #exit
color="#804040">101 fi
color="#804040">102 done
color="#804040">103 sleep color="#a020f0">$tick
color="#804040">104 # Status
color="#804040">105 print_runstats
color="#804040">106 done
color="#804040">107
color="#804040">108 ## break 1 gets us here!
color="#804040">109
color="#804040">110 # Keep on printing status while there are background processes
color="#804040">111 # even though there are no more to start.
color="#804040">112 sleep color="#a020f0">$tick
color="#804040">113 # Status
color="#804040">114 print_runstats
color="#804040">115 do color="#804040">[ color="#6a5acd">`jobs color="#6a5acd"> -r color="#804040">| color="#6a5acd">wc -l` -eq color="#ff00ff">0 ]
color="#804040">116 done
color="#804040">117 wait
color="#804040">118
color="#804040">119 function stats2csv color="#6a5acd">{
color="#804040">120 # Gather overall stats in csv file
color="#804040">121
color="#804040">122 # CSV Heading
color="#804040">123 printf color="#804040">" color="#6a5acd">\n## STARTING %i Processes from file: %s color="#6a5acd">\n%i at a time with id %s color="#6a5acd">\n\n" \
color="#804040">124 $maxprocs color="#a020f0">$proclist color="#a020f0">$concurrent color="#a020f0">$unique_id color="#804040">> color="#a020f0">$unique_id.csv
color="#804040">125 # CSV Column labels
color="#804040">126 gawk color="#804040">' color="#ff00ff">/^##STATISTICS For Job/{s++}
color="#804040">127 s&&/^# Line .*of file/{printf "JOB,LINE";next}
color="#804040">128 s&&/: /{l=$0;sub(/: .*/,"");$1=$1;printf",%s",$0}
color="#804040">129 s&&l~/Exit status:/{s=0;print"";l="";exit}
color="#804040">130 color="#804040">' color="#a020f0">$unique_id. color="#804040">[ color="#ff00ff">0-9]* color="#804040">>> color="#a020f0">$unique_id.csv
color="#804040">131
color="#804040">132 # Stick allprocs into env for later gawk substitution of command
color="#804040">133 for i color="#804040">in color="#a020f0">${!allprocs color="#a020f0">[@] color="#a020f0">}; color="#804040">do
color="#804040">134 declare color="#008080">-x _ap color="#a020f0">$i= color="#804040">" color="#a020f0">${allprocs color="#a020f0">[$i color="#a020f0">]} color="#804040">"
color="#804040">135 done
color="#804040">136 #env|grep '_ap.=' ; debug
color="#804040">137 # CSV Rows
color="#804040">138 gawk color="#804040">' color="#ff00ff">/^##STATISTICS For Job/{s++;j=$NF}
color="#804040">139 s&&/^# Line .*of file/{printf "%s,%s",j,$3;next}
color="#804040">140 s&&/Command [^:]*: /{printf",%s",ENVIRON["_ap" j]; next}
color="#804040">141 s&&/: /{l=$0;sub(/.*: /,"");printf",%s",$0}
color="#804040">142 s&&l~/Exit status:/{s=0;print"";l="";nextfile}
color="#804040">143 color="#804040">' color="#a020f0">$unique_id. color="#804040">[ color="#ff00ff">0-9]* color="#804040">>> color="#a020f0">$unique_id.csv
color="#804040">144 }
color="#804040">145
color="#804040">146 stats2csv
color="#804040">147
color="#804040">148 printf color="#804040">" color="#6a5acd">\n## FINISHED, (stats in %s) color="#6a5acd">\n\n" color="#a020f0">$unique_id.csv
color="#804040">149
color="#804040">150 exit color="#ff00ff">0
color="#804040">151
color="#804040">152
color="#804040">153
color="#804040">154


I remember writing this with the help of the advanced bash tutorial
which is great, and looking back at this program, I would need the
tutorials help if I needed to extend it!



Why not in Python?


 Of course I thought of doing it in Python, but bash's job
control features were used to flesh out the initial idea and it grew to
become the finished article. Looking back at this program as it's over
a year old, I find that I like bash's job control, and gawk one-liners
 (but not the readability) :-)



- Paddy.



href="http://kompozer.net/"> alt="Document made with KompoZer"
src="http://kompozer.sourceforge.net/images/kompozer_80x15.png"
border="0">

Wednesday, March 04, 2009

Psion Netbook

I was just reading about Intel's claim that the term Netbook was generic. It seems to be bullying to me.



So, in support of 'our Psion' I got out my old Psion Series 3 for nostalgias sake, and again marvelled at its design.



P.S. Offers of a complementary Netbook would be gratefully received :-)



- Paddy.

Saturday, February 28, 2009

Extended Vanity Search on Rosetta Code

This is the href="http://paddy3118.blogspot.com/2009/02/vanity-search-on-rosetta-code.html">first,
extended so I can see how far up I am in the href="http://www.rosettacode.org/w/index.php?title=Special:PopularPages">league
table of page views.



Since that first blog entry I have created three new pages:' href="http://www.rosettacode.org/wiki/First-class_functions">First-class
functions ', href="http://www.rosettacode.org/wiki/Octal">'Octal'
which was created after I made a large edit to the href="http://www.rosettacode.org/wiki/Hexadecimal">Hex
page, and 'Y
combinator
', which is my current ' href="http://paddy3118.blogspot.com/2009/02/y-combinator-in-python.html">stretch
goal'.



Here is the code:


'''
Rosetta Code Vanity search:
color="#ff00ff"> How many new pages has someone created?
'''

color="#a020f0">import urllib, re

user = ' color="#ff00ff">Paddy3118'

site = ' color="#ff00ff">http://www.rosettacode.org'
nextpage = site + ' color="#ff00ff">/wiki/Special:Contributions/' + user
nextpage_re = re.compile(
r' color="#ff00ff"><a href="([^"]+)" title="[^"]+" rel="next">older ')

newpages = []
pagecount = 0
color="#804040">while nextpage:
page = urllib.urlopen(nextpage)
pagecount +=1
nextpage = ''
color="#804040">for line color="#804040">in page:
color="#804040">if color="#804040">not nextpage:
color="#0000ff"># Search for URL to next page of results for download
nextpage_match = re.search(nextpage_re, line)
color="#804040">if nextpage_match:
nextpage = (site + nextpage_match.groups()[0]).replace(' color="#ff00ff">&amp;', ' color="#ff00ff">&')
color="#0000ff">#print nextpage
npline=line
color="#804040">if ' color="#ff00ff"><span class="newpage">' color="#804040">in line:
color="#0000ff"># extract N page name from title
newpages.append(line.partition(' color="#ff00ff"> title="')[2].partition(' color="#ff00ff">"')[0])
page.close()

nontalk = [p color="#804040">for p color="#804040">in newpages color="#804040">if color="#804040">not ' color="#ff00ff">:' in p]

color="#804040">print " color="#ff00ff">User: %s has created %i new pages of which %i were not Talk: pages, from approx %i edits" % (
user, len(newpages), len(nontalk), pagecount*50 )
color="#804040">print " color="#ff00ff">New pages created, in order, are: color="#6a5acd">\n ",
color="#804040">print " color="#6a5acd">\n ".join(nontalk[::-1])




nextpage = site + ' color="#ff00ff">/w/index.php?title=Special:PopularPages'
nextpage_re = re.compile(
r' color="#ff00ff"><a href="([^"]+)" class="mw-nextlink">next ')

data_re = re.compile(
r' color="#ff00ff">^<li><a href="[^"]+" title="([^"]+)".*</a>.*\(([0-9,]+) views\)' )

title2rankviews = {}
rank = 1
pagecount = 0
color="#804040">while nextpage:
page = urllib.urlopen(nextpage)
pagecount +=1
nextpage = ''
color="#804040">for line color="#804040">in page:
color="#804040">if color="#804040">not nextpage:
color="#0000ff"># Search for URL to next page of results for download
nextpage_match = re.search(nextpage_re, line)
color="#804040">if nextpage_match:
nextpage = (site + nextpage_match.groups()[0]).replace(' color="#ff00ff">&amp;', ' color="#ff00ff">&')
color="#0000ff"># print nextpage
npline=line
datamatch = re.search(data_re, line)
color="#804040">if datamatch:
title, views = datamatch.groups()
views = int(views.replace(' color="#ff00ff">,', ''))
title2rankviews[title] = [rank, views]
rank += 1
page.close()

color="#804040">print " color="#6a5acd">\n\n Highest page Ranks for user pages:"
fmt = " color="#ff00ff"> %-4s %-6s %s" color="#0000ff"># rank, views, title
color="#804040">print fmt % (' color="#ff00ff">RANK', 'VIEWS', ' color="#ff00ff">TITLE')
highrank = [title2rankviews.get(t,[99999, 0]) + [t] color="#804040">for t color="#804040">in nontalk]
highrank.sort()
color="#804040">for x color="#804040">in highrank:
color="#804040">print fmt % tuple(x)




(I promise to restructure it if I go back to it again :-)



Here is the output:


User: Paddy3118 has created 33 new pages of which 20 were not Talk: pages, from approx 300 edits
New pages created, in order, are:
Spiral
Monty Hall simulation
Web Scraping
Sequence of Non-squares
Anagrams
Max Licenses In Use
One dimensional cellular automata
Conway's Game of Life
Data Munging
Data Munging 2
Column Aligner
Probabilistic Choice
Knapsack Problem
Yuletide Holiday
Common number base conversions
Octal
Integer literals
Command Line Interpreter
First-class functions
Y combinator


Highest page Ranks for user pages:
RANK VIEWS TITLE
107 1767 Monty Hall simulation
127 1409 Conway's Game of Life
171 1109 Anagrams
183 1037 Knapsack Problem
224 812 Max Licenses In Use
232 789 Web Scraping
239 717 Spiral
242 712 One dimensional cellular automata
289 536 Sequence of Non-squares
321 442 Yuletide Holiday
329 422 Column Aligner
333 418 Probabilistic Choice
347 389 Data Munging
351 382 Data Munging 2
427 175 Integer literals
448 128 Common number base conversions
454 110 Command Line Interpreter
469 61 First-class functions
480 42 Octal
634 2 Y combinator




A quick perusal of the results shows that although Conway's Game of
Life was created later, it is high in views; conversely Spiral, my
first page, is down in views.





- Paddy.

Friday, February 27, 2009

Y Combinator in Python

I've been thinking again, and it hurts (its a good hurt though).



Somehow I came across the term 'Y combinator' again and this time
followed it up. It seems that it is a way to create a function
 that adds recursion to functions that don't have recursion in
their definition.



I am still learning from mainly href="http://mvanier.livejournal.com/2897.html">this
explanation 'The Y Combinator (Slight Return)' by Mike Vanier. but I've
managed to follow some things.



Here is a definition of a factorial function in Python:


factorial = lambda n: (1  color="#804040">if n<2  color="#804040">else n*factorial(n-1))


it is easy to see that it is recursive as the lambda expression is
assigned to a name and the name, fibonacci, is used in the lambda
expression.

We can mechanically generate a related function by calling the recursed
call 'f' and by wrapping the expression in an outer lambda expression
of f:


fac = lambda f:  color="#804040">lambda n: (1  color="#804040">if n<2  color="#804040">else n*f(n-1))


Notice that fac is not a recursive function as fac is not called inside
its definition. The fac function does not, in itself, compute the
factirial, but wrapping it in the Y combinator function Y produces a
function that computes a factorial, i.e:


Y(fac)(i) == factorial(i)


 

I am not too clear on the whole derivation of Y but Y,is itself a non
recursive function and, in Python is given by:


Y = lambda f: ( color="#804040">lambda x: x(x))( color="#804040">lambda y: f( color="#804040">lambda z: y(y)(z)))


here follows my interactive session where I am using it to compute
factorials and fibonacci numbers by adding recursion to non-recursive
functions:


>>> Y = lambda f: ( color="#804040">lambda x: x(x))( color="#804040">lambda y: f( color="#804040">lambda z: y(y)(z)))
>>> fac = color="#804040">lambda f: color="#804040">lambda n: (1 color="#804040">if n<2 color="#804040">else n*f(n-1))
>>> [ Y(fac)(i) color="#804040">for i color="#804040">in range(10) ]
[1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]
>>> fib = color="#804040">lambda f: color="#804040">lambda n: 0 color="#804040">if n == 0 color="#804040">else (1 color="#804040">if n == 1 color="#804040">else f(n-1) + f(n-2))
>>> [ Y(fib)(i) color="#804040">for i color="#804040">in range(10) ]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
>>>




I need to set this investigation aside, then come back to it.



- Paddy

Sunday, February 22, 2009

Vanity Search on Rosetta Code

A vanity search is usually when you look for your own name on Google to
show how popular you are (in those terms, no personal slight intended).



I wanted to know how many new pages on Rosetta Code that I had started
so wrote the following script that starts from a users first page of href="http://www.rosettacode.org/wiki/Special:Contributions/Paddy3118">contributions,
and downloads all further pages searching for new page creations, which
are specially marked in the HTML and show as style="font-weight: bold;">N in the table.



The code:



'''
Rosetta Code Vanity search:
color="#ff00ff"> How many new pages has someone created?
'''

color="#a020f0">import urllib, re

user = ' color="#ff00ff">Paddy3118'

site = ' color="#ff00ff">http://www.rosettacode.org'
nextpage = site + ' color="#ff00ff">/wiki/Special:Contributions/' + user
nextpage_re = re.compile(
r' color="#ff00ff"><a href="([^"]+)" title="[^"]+" rel="next">older ')

newpages = []
pagecount = 0
color="#804040">while nextpage:
page = urllib.urlopen(nextpage)
pagecount +=1
nextpage = ''
color="#804040">for line color="#804040">in page:
color="#804040">if color="#804040">not nextpage:
color="#0000ff"># Search for URL to next page of results for download
nextpage_match = re.search(nextpage_re, line)
color="#804040">if nextpage_match:
nextpage = (site + nextpage_match.groups()[0]).replace(' color="#ff00ff">&amp;', ' color="#ff00ff">&')
color="#0000ff">#print nextpage
npline=line
color="#804040">if ' color="#ff00ff"><span class="newpage">' color="#804040">in line:
color="#0000ff"># extract N page name from title
newpages.append(line.partition(' color="#ff00ff"> title="')[2].partition(' color="#ff00ff">"')[0])
page.close()

nontalk = [p color="#804040">for p color="#804040">in newpages color="#804040">if color="#804040">not p.startswith(' color="#ff00ff">Talk:')]

color="#804040">print " color="#ff00ff">User: %s has created %i new pages of which %i were not Talk: pages, from approx %i edits" % (
user, len(newpages), len(nontalk), pagecount*50 )
color="#804040">print " color="#ff00ff">New pages created, in order, are: color="#6a5acd">\n ",
color="#804040">print " color="#6a5acd">\n ".join(nontalk[::-1])






What I have created on RC


The output of the program shows all the pages I created , in order of
creation:


User: Paddy3118 has created 31 new pages of which 20 were not Talk: pages, from approx 300 edits
New pages created, in order, are:
href="http://paddy3118.blogspot.com/2008/08/spiral.html">Spiral
href="http://paddy3118.blogspot.com/2008/08/monty-hall-problem-simulations.html">Monty Hall simulation
Web Scraping
Sequence of Non-squares
Anagrams
User talk:Lupus
href="http://paddy3118.blogspot.com/2008/10/max-licenses-in-use.html">Max Licenses In Use
One dimensional cellular automata
Conway's Game of Life
Village Pump:Home/Foldable output
Data Munging
Data Munging 2
Column Aligner
Probabilistic Choice
href="http://paddy3118.blogspot.com/2008/12/knapsack-problem.html">Knapsack Problem
Yuletide Holiday
Common number base conversions
Octal
Integer literals
Command Line Interpreter




I have added links to show when I blogged about a task as well as
starting the RC page. I can see that I have a 'User talk:' page that
should also be filtered out.



I was always writing small examples that I thought might be useful
examples for a Python training course. I was looking for a public home
for them and initially thought that, after stumbling across RC, that RC
would be a good home for them. I am only partially right, but I have
found the discipline of writing for RC to be enjoyable in itself, so
continue to contribute.



I quite enjoyed the challenge of creating an RC task beginning with the
letters K and then Y, so they could complete their full alphabet of
 named tasks. Yuletide Holiday was created around xmas 2008
and is really about Y2k errors - but they seem to have stuck with my
name :-)

I need to re-visit Data Munging and add extra clarification to the task
as RC needs a good task description, wheras a lot of data munging tasks
don't. 



If you are interested in language comparison sites then you might want
to take a look at RC too!



- Paddy.


Saturday, February 14, 2009

Comparison of Python Solutions to The Josephus Problem

The
Problem

After coming across href="http://danvk.org/josephus.html">this
language comparison I decided to revisit the Josephus problem:
style="font-style: italic;">

"
Flavius Josephus was a roman historian of Jewish
origin. During the
Jewish-Roman wars of the first century AD, he was in a cave with fellow
soldiers,
40 men in all, surrounded by enemy Roman troops. They decided to commit
suicide by standing in a ring and counting off each third man. Each man
so
designated was to commit suicide...Josephus,
not wanting to die, managed to place himself in the position of the
last survivor.


In the general version of the problem, there are n soldiers
numbered from 1
to n and each m-th soldier will be eliminated. The count starts from
the first
soldier. What is the number of the last survivor?
"

Notice
that:

  1. The soldiers are numbered from one
    to n.
  2. If the number of soldiers, n, is more than m
    then the first soldier to be killed is the m'th 

Why
play with it?

The class based solution was created by Dan
purely to compare languages and worked well for him I guess. i on the
other hand, read his post and immediately started thinking of ways of
using Pythons lists for the task, and one thing lead to another ...

My
ramblings

Preamble

Just conditionally use
Psyco
  2  color="#804040">try:
color="#804040"> 3 import psyco
color="#804040"> 4 psyco.full()
color="#804040"> 5 except ImportError:
color="#804040"> 6 print ' color="#ff00ff">Psyco not installed, the program will just run slower'
color="#804040"> 7

Dan's Class based solution:

I
 made a slight change to it. I needed a cleaner interface so
created a function of (m,n) that returned the one-based index of the
last soldier to be killed.:
 color="#804040">  8 
color="#804040"> 9 class color="#008080">Person:
color="#804040"> 10 '''
color="#804040"> 11 # From: href="http://danvk.org/josephus.html">http://danvk.org/josephus.html
color="#804040"> 12 '''
color="#804040"> 13 def color="#008080">__init__(self,pos):
color="#804040"> 14 self.pos = pos
color="#804040"> 15 self.alive = 1
color="#804040"> 16 def color="#008080">__str__(self):
color="#804040"> 17 color="#804040">return " color="#ff00ff">Person #%d, %s" % (self.pos, self.alive)
color="#804040"> 18
color="#804040"> 19 # Creates a chain of linked people
color="#804040"> 20 # Returns the last one in the chain
color="#804040"> 21 def color="#008080">createChain(self,n):
color="#804040"> 22 color="#804040">if n>0:
color="#804040"> 23 self.succ = Person(self.pos+1)
color="#804040"> 24 color="#804040">return self.succ.createChain(n-1)
color="#804040"> 25 color="#804040">else:
color="#804040"> 26 color="#804040">return self
color="#804040"> 27
color="#804040"> 28 # Kills in a circle, getting every nth living person
color="#804040"> 29 # When there is only one remaining, the lone survivor is returned
color="#804040"> 30 def color="#008080">kill(self,pos,nth,remaining):
color="#804040"> 31 color="#804040">if self.alive == 0: color="#804040">return self.succ.kill(pos,nth,remaining)
color="#804040"> 32 color="#804040">if remaining == 1: color="#804040">return self
color="#804040"> 33 color="#804040">if pos == nth:
color="#804040"> 34 self.alive = 0
color="#804040"> 35 pos=0
color="#804040"> 36 remaining-=1
color="#804040"> 37 color="#0000ff">#print "Killing", self
color="#804040"> 38 color="#804040">return self.succ.kill(pos+1,nth,remaining)
color="#804040"> 39
color="#804040"> 40 def color="#008080">josephus_class_ring(m,n):
color="#804040"> 41 import sys
color="#804040"> 42
color="#804040"> 43 sys.setrecursionlimit(25000)
color="#804040"> 44
color="#804040"> 45 first = Person(1)
color="#804040"> 46 last = first.createChain(n-1)
color="#804040"> 47 last.succ = first
color="#804040"> 48 winner = first.kill(1,m,n)
color="#804040"> 49 return winner.pos

(Line 37 is also mine)

My
first Python solution:

I wanted to use Python lists and then
use the modulo operator, % on the indexing  so that a list
would wrap-around and become a ring. I got 'cocky' too and decided to
 kill all soldiers to be killed in each pass over the list in
one go by using Pythons list slicing together with del
 color="#804040"> 51 def  color="#008080">josephus_modulo(m,n):
color="#804040"> 52 ''' color="#ff00ff"> Use list indexing to do multiple deletes per 'round' '''
color="#804040"> 53 if n == 1:
color="#804040"> 54 color="#804040">return 1
color="#804040"> 55 if m == 1:
color="#804040"> 56 color="#804040">return n
color="#804040"> 57
color="#804040"> 58 soldiers = range(1,n+1)
color="#804040"> 59 startindex = m-1
color="#804040"> 60 while soldiers:
color="#804040"> 61 leng = len(soldiers)
color="#804040"> 62 killed = soldiers[startindex::m] color="#804040">or killed
color="#804040"> 63 color="#804040">del soldiers[startindex::m]
color="#804040"> 64 startindex = (m - leng%m + startindex) % m
color="#804040"> 65 return killed[-1]

I
admit, I was stumped over the correct equation necessary for
calculating the startindex for another 'round' at line 64 - it took
some time to get right and prompted the comparison code added to the
end of the file..

I got the above working , then
went to bed.

My second solution:

In the
light of day, I thought my first solution was opaque , and I wanted to
also try a solution that used a list, but just killed soldiers one by
one. I came up with:
 66 
color="#804040"> 67 def color="#008080">josephus_list_ring(m,n):
color="#804040"> 68 ''' color="#ff00ff"> Use list as ring via modulo '''
color="#804040"> 69 if n == 1:
color="#804040"> 70 color="#804040">return 1
color="#804040"> 71 if m == 1:
color="#804040"> 72 color="#804040">return n
color="#804040"> 73 soldiers = range(1,n+1)
color="#804040"> 74 leng = n
color="#804040"> 75 m1 = index = m-1
color="#804040"> 76 while leng>1:
color="#804040"> 77 color="#0000ff">#print "soldiers, leng, index", soldiers, leng, index
color="#804040"> 78 killed = soldiers.pop(index)
color="#804040"> 79 color="#0000ff">#print 'killed', killed
color="#804040"> 80 index += m1
color="#804040"> 81 leng -= 1
color="#804040"> 82 index %= leng
color="#804040"> 83 return soldiers[0]

href="http://www.python.org/doc/current/library/stdtypes.html?highlight=pop#mutable-sequence-types">list.pop
was the natural way to kill each soldier.

I liked
this function. I would find this relatively easy to both explain and
maintain.

A solution using a class based ring of
objects

I know Dan deliberately added recursion to his class
based solution, and it showed, in me hitting the recursion limit when
using some combinations of m and n. I decided to try using class
instaances to create a ring but iterate over the ring to kill soldiers
insteaad of using recursion.
 color="#804040"> 84 
color="#804040"> 85 def color="#008080">josephus_iter_class(m,n):
color="#804040"> 86 ''' color="#ff00ff"> Iterative class based ring solution '''
color="#804040"> 87
color="#804040"> 88 if n == 1:
color="#804040"> 89 color="#804040">return 1
color="#804040"> 90 if m == 1:
color="#804040"> 91 color="#804040">return n
color="#804040"> 92
color="#804040"> 93 class color="#008080">Soldier(object):
color="#804040"> 94 color="#804040">def color="#008080">__init__(self, number):
color="#804040"> 95 self.number = number
color="#804040"> 96 self.next = None
color="#804040"> 97
color="#804040"> 98 # Soldiers without position
color="#804040"> 99 soldiers = [Soldier(j+1) color="#804040">for j color="#804040">in xrange(n)]
color="#804040">100 # link soldiers in a ring
color="#804040">101 for j color="#804040">in xrange(n):
color="#804040">102 soldiers[j].next = soldiers[(j+1) % n]
color="#804040">103 last, index = soldiers[m-2], soldiers[m-1]
color="#804040">104 for i color="#804040">in xrange(n-1):
color="#804040">105 color="#0000ff"># Kill soldier at index
color="#804040">106 index = last.next = index.next
color="#804040">107 color="#0000ff"># advance
color="#804040">108 color="#804040">for j color="#804040">in xrange(m-1):
color="#804040">109 last, index = index, index.next
color="#804040">110
color="#804040">111 return index.number
Notice
that the class Soldier has just enough fields to form a ring and hold
its original position, (number), in the ring. The real heavy lifting is
done in the body of the function.

I liked the above
solution too, but not as much as josephus_list_ring

Timing

Note:
Dan's solution was not created with fast run times as a goal.

I
decided to run some timings on the various solutions:
 color="#804040">112 
color="#804040">113
color="#804040">114 def color="#008080">times():
color="#804040">115 ''' color="#ff00ff"> Time the different josephus functions
color="#804040">116 '''
color="#804040">117 import timeit
color="#804040">118
color="#804040">119 functions = [josephus_class_ring, josephus_modulo, josephus_list_ring, josephus_iter_class]
color="#804040">120 for f color="#804040">in functions:
color="#804040">121 color="#804040">print ' color="#ff00ff">Time for', f.__name__, timeit.Timer(
color="#804040">122 ' color="#ff00ff">[%s(m,n) for n in range(1,50) for m in range(1,n-1)]' % f.__name__,
color="#804040">123 ' color="#ff00ff">from josephus import %s' % f.__name__).timeit(number=1)

The
results were: (without psyco):
style="margin-left: 40px;"> style="font-family: monospace;">Time for josephus_class_ring
6.44339937059
style="font-family: monospace;">Time for josephus_modulo
0.141748741809
style="font-family: monospace;">Time for josephus_list_ring
0.0586138741097
style="font-family: monospace;">Time for josephus_iter_class
0.336112550617

With psyco enabled I
got:
style="font-family: monospace;">Time for josephus_class_ring
0.851871117698
style="font-family: monospace;">Time for josephus_modulo
0.0548374164873
style="font-family: monospace;">Time for josephus_list_ring
0.0254780984734
style="font-family: monospace;">Time for josephus_iter_class
2.86889059922

Notice how
josephus_iter_class goes slower
with psyco.

The other bits

 color="#804040">124 
color="#804040">125
color="#804040">126
color="#804040">127
color="#804040">128 if __name__ == ' color="#ff00ff">__main__':
color="#804040">129 n = 25 color="#0000ff"># n people in a circle
color="#804040">130 m = 3 color="#0000ff"># kill every mth person
color="#804040">131
color="#804040">132 print " color="#ff00ff">In a circle of %d people, killing number %d" % (n,m)
color="#804040">133
color="#804040">134 print " color="#ff00ff">Winner by josephus_class_ring: ", josephus_class_ring(m,n)
color="#804040">135 print " color="#ff00ff">Winner by josephus_modulo: ", josephus_modulo(m,n)
color="#804040">136 print " color="#ff00ff">Winner by josephus_list_ring: ", josephus_list_ring(m,n)
color="#804040">137 print " color="#ff00ff">Winner by josephus_iter_class: ", josephus_iter_class(m,n)
color="#804040">138
color="#804040">139 #raise Exception('Stop')
color="#804040">140 print
color="#804040">141 times()
color="#804040">142
color="#804040">143 # Equality check
color="#804040">144 for n color="#804040">in range(1,25):
color="#804040">145 color="#804040">for m color="#804040">in range(1,n-1):
color="#804040">146 color="#804040">assert (josephus_class_ring(m,n) == josephus_modulo(m,n)
color="#804040">147 == josephus_list_ring(m,n) == josephus_iter_class(m,n)), " color="#ff00ff">Whoops"
color="#804040">148
Lines 128 to
137 I used most when developing each function, together with copious
print statements.
Modifications to  lines 143 to
148 ensured that each function was giving the same results, (as well as
giving stack errors for the reccursive solution).

END.

style="font-style: italic;">

Tuesday, February 10, 2009

Carol Thatchers golliwog remark

Having
lived through "The Thatcher years", when Carol Thatcher's mother was
Prime Minister, I thought then that most of the racists were
 openly members of her Tory party. It is no surprise then to
hear that href="http://www.dailymail.co.uk/news/article-1136005/Chiles-reveals-truth-Carol-Thatchers-golliwog-gaffe.html">Carol
thinks it is OK to call black people Golliwogs when "where
all white here" - nudge-nudge. Together with href="http://news.bbc.co.uk/1/hi/world/africa/4169557.stm">her
brothers antics in Africa, you get a picture of the views
circulating in the Thatcher household during their formative years that
is sickeningly racist.

I guess we have advanced to
the stage that even someone who is the daughter of a Baroness; who's
mother still wields great power in the UK; felt it unwise to make her
comments on-screen, but by witnessing the cries of
'outrage' at her mere dismissal from the BBC. I think British
society can't be complacent on its way to an integrated society.

-
Paddy.

P.S. Big-up Jo Brand for walking out.
P.P.S.
There was a member of href="http://www.comicrelief.com/what_we_do/what_we_fund_and_why">Comic
Relief  in the green room too, yet still Carol
thinks its OK?

Thursday, January 22, 2009

DVClub yah!

I attended the second (as well as the first), meeting of the Bristol chapter of DVClub yesterday.



This is a meeting, organised by and for Verification Engineers.
Held at a local company (my workplace this time around - Infineon
Technologies
), it is a lunchtime meeting of approximately 55 engineers
this time and the subject for the day was Formal Verification (of
Hardware).







Once we all were seated, I couldn't help getting a feeling of
'belonging'. It isn't often you get to meet local engineers working in
similar fields to yourself. Mike Bartley, the organiser kept the event
on schedule, and the speakers knew their stuff. A question of
mine, lead to an interesting conversation with Douglas Fisher who is a
verification specialist at Synopsys - it's a rare opportunity when you
get to swap ideas with a developer from an EDA company so I must
thank DVClub and its Bristol members for a well organised gig.






Links to presented papers




Mike Bartley has all the presented papers available on his company's site for the Bristol Branch of the BCS








- Paddy.

Tuesday, January 06, 2009

word differ

My brother was over from France this new year and he asked me to write him a small script for the first time.



He had a problem in that he had some prose in a foreign language, and a list of words that the students should learn. He wanted stats on what words from the list were actually used in his prose.



I wrote the following small script. Interestingly enough, the bit that needed a bit of research was doing the graphical file opening for Windows - I really am that used to a Unix environment :-)



It is a very simple program but, on reflection, I can remember doing similar work twenty years ago in AWK: finding out what is different between two 'things' and printing the differences in a sorted order. Back then I would have used Suns original awk, keeping the two sets of data in associative arrays, doing the set arithmatic in explicit for loops, and calling out to the Unix sort utility to print sorted output. Such specialized diff'ing tasks have crept up regularly for me, and they are now very easy to do in Python (or Perl); and I have never had to resort to C for even the largest datasets I've encountered.



word_differ.pyw


Here is the program word_differ.pyw:


# -*- coding: cp1252 -*-

'''
Find the difference in words used in two documents

Asks for two files which then have their words extracted and compared.

Most punctuation, and all words of just numbers are dropped.

(C) 2009 Donald 'Paddy' McCarthy. paddy3118@gmail.com
'''

import Tkinter, tkFileDialog, re, datetime, sys
root = Tkinter.Tk()
root.withdraw()


def wordsinfile(f):
words = set()
txt = f.read()
words = set( w.lower() for w in re.split(r"""[ \t\n\-,;:\.\?@#~=+_!"£$%^&\*\(\)<>|\\\[\]{}`]""", txt)
if w and not all( c in '0123456789' for c in w) )
return words


f1 = tkFileDialog.askopenfile(parent=root,initialdir="/",title='First text file for word diffing')
if not f1: sys.exit()
f2 = tkFileDialog.askopenfile(parent=root,initialdir="/",title='Second text file for word diffing')
if not f2: sys.exit()
f3 = tkFileDialog.asksaveasfile(parent=root,initialdir="/",title='Save output as file (end with .txt to create a windows text file)')
if not f3: sys.exit()

set1 = wordsinfile(f1)
set2 = wordsinfile(f2)

print >>f3, "Output from program word_differ.py (Author Donald McCarthy: (C) 2009)."
print >>f3, " Generated on: " + datetime.datetime.now().isoformat()

print >>f3, " First File: " + f1.name
print >>f3, " Second File: " + f2.name
print >>f3, " Output to: " + f3.name

print >>f3, "\nWords in the first file that are not in the second:"
for w in sorted(set1 - set2):
print >>f3, " ", w
print >>f3, "\nWords in the second file that are not in the first:"
for w in sorted(set2 - set1):
print >>f3, " ", w
print >>f3, "\nWords common to both files:"
for w in sorted(set1 & set2):
print >>f3, " ", w

f1.close(); f2.close(); f3.close()


Example word list: word list.txt


  back
black
jump
bored
hair
missile
lamb
little
played
sent

Example prose: rhyme.txt


Mary had a little lamb
Who's hair was long and black.
She played with it,
Got bored with it
And then she sent it back!

By Paddy McCarthy - 2009-01-04.

Example output: diff.txt


Output from program word_differ.py (Author Donald McCarthy: (C) 2009).
Generated on: 2009-01-04T07:20:18.890000
First File: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/word list.txt
Second File: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/rhyme.txt
Output to: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/diff.txt

Words in the first file that are not in the second:
jump
missile

Words in the second file that are not in the first:
a
and
by
got
had
it
long
mary
mccarthy
paddy
she
then
was
who's
with

Words common to both files:
back
black
bored
hair
lamb
little
played
sent

About Me

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive