Sunday, October 18, 2009

Template for forking a Python program

I am doing a large regression run of some 30,000 simulations producing aroung 30 Gigs of logs to extract test parameter and simulation result from each.

I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.

A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.

I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.

Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.

After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.
 1 import os
2 import sys
3 import time
4
5 maxprocs = 4 # procs - 1 children and the parent to do the work
6 jobarray = range(25) # Work to be split between proccesses
7
8 def do1job(job):
9 'Process one item from the jobarray'
10 return job
11
12 def picklesave(results):
13 'save partialresults to file'
14 time.sleep(3)
15
16 def unpickleread(proc):
17 'read partial results previousely saved by process number procc'
18 return [] # dummy
19
20 def dowork(proc, maxprocs=maxprocs, jobarray=jobarray):
21 ' Split jobarray tasks between maxprocs by proci number'
22 time.sleep(proc) # convenience processing time
23 results = []
24 for count, job in enumerate(jobarray):
25 if count % maxprocs == proc:
26 results.append(do1job(job))
27 picklesave(results)
28 print ("Process: %i completed jobs: %s" % (proc, results))
29 return results
30
31
32 def createworkerchildren(maxprocs=maxprocs):
33 for proc in range(1, maxprocs):
34 print 'PARENT: Forking %s' % proc
35 worker_pid = os.fork()
36 if not worker_pid:
37 'This is a child process'
38 dowork(proc)
39 # Children exit here!
40 sys.exit(proc)
41
42 # Start and don't wait for child proccesses to do their work
43 createworkerchildren()
44
45 # Do parent processes share of the work
46 results = []
47 results += dowork(0) # don't have to pickle/unpickle Parent processes share
48
49 # wait for children
50 for proc in range(1, maxprocs):
51 print 'PARENT: Waiting for any child'
52 done = os.wait()
53 print 'PARENT: got child', done
54
55 # read and integrate child results
56 for proc in range(1, maxprocs):
57 results += unpickleread(proc)
58
59 # Calculate and print data summary
60 pass
61

Running the above on cygwin produced:
bash$ python testfork.py
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$

Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).

Sunday, October 11, 2009

Extend functools.partial to nested function calls?

I am reading the 1977 ACM Turing Award Lecture "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs" by John Backus.

In section 5.1 he is contrasting a von Neumann program for inner product, which he gives as:
c := 0
for i := 1 step 1 until n do
c := c + c[i] x b[i]
The above converts quit nicely to the python equivalent:
>>> v1, v2 = [1,2,3], [6,5,4]
>>> def ip_von_neumann(a, b):
c = 0
for i in range(3):
c = c + a[i] * b[i]
return c

>>> ip_von_neumann(v1, v2)
28
Backus then goes on to contrast the von Neumann style with his functional program:
Def Innerproduct ≡ (Insert +)o(ApplyToAll x)oTranspose
Which I converted to the following functional style Python:
>>> from operator import add, mul
>>> from functools import reduce as insert
>>>
>>> def ip_functional(*x):
return insert(add, map(mul, *x))

>>> ip_functional(v1, v2)
28

Well, the Python works, but it still has this name x that I would like to eliminate.

I think I can go part of the way by using functools.partial. For example, for the single function call of map(mul, *x) I can do:
>>> from functools import partial
>>> p1 = partial(map, mul)
>>> assert p1(v1, v2) == map(mul, *(v1, v2))
>>> map(mul, *(v1, v2))
[6, 10, 12]
>>>
But I don't know how to go that extra step and remove x from the definition fully, ie end up with just some assignment like:
ip_purefunctional = {A pure function of functions without any argument placeholders}


All help would be appreciated, Thanks.

Friday, October 02, 2009

The Otter Algorithm

I was reading about a Richard Dawkins program that through random mutation and selection of best fit, would gradually converge a random string into a target string. It is called the Weasel Program,and a rough outline of it is:
  1. Start with a random string of 28 characters
  2. Copy this string 100 times, with a 5% chance per character of that character being replaced with a random character
  3. Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score
  4. If any of the new strings has a perfect score, halt
  5. Otherwise, take the highest scoring string, and go to step 2
After coding it up, I found it just would not converge to a final match for me. I suspected I had a mistake somewhere , but just did not like how long it was taking to converge.

I decided to change things.

It seemed to me that if you are further away from the ideal, then you should change more letters; and, after seeing the score for how good the current best string  is sometimes decrease, I made a modification to choose the best of the copies and the best string that they are mutations of.

With that modification, convergence was swift.

Oh, I call it the Otter program, as it is more than the Weasel.

The prog

 1 
 2 '''
 3 From: http://en.wikipedia.org/wiki/Weasel_program
 4       With modifications
 5 '''
 6 
 7 from string import ascii_uppercase
 8 from random import choice, random
 9 
10 target = list("METHINKS IT IS LIKE A WEASEL")
11 chance = .05
12 times  = 100
13 
14 charset      = ascii_uppercase + ' '
15 iterations   = 0
16 highscore    = [choice(charset) for _ in range(len(target))]
17 perfectscore = len(target)
18 
19 def score(trial):
20     return sum(t==h for t,h in zip(trial, target))
21 
22 while highscore != target:
23     thischance =  1-((perfectscore - score(highscore)) / perfectscore * (1 - chance))
24     #thischance = chance
25     iterations += 1
26     if iterations % 50 == 0:
27         print (iterations, score(highscore), ''.join(highscore))
28     copies = [ [(ch if random() <= thischance else choice(charset)) for ch in highscore]
29                for _ in range(times) ]  + [highscore]
30     highscore = max(copies, key=score)
31 print (iterations, ''.join(highscore))

Function score at line 19 uses the fact that booleans True and False are also numbers and calculates how many characters are in the right position as its result.

Line 23 modifies the amount of randomness injected, based on how bad the last highscore was: Change more if you are further away from the target - less as you approach.
Name thischance is used in  line 28. random returns a float between 0.0 and 1.0. if it is less than thischance then a particular character is not changed.

The function for thischance in line 23 is equal to chance when no characters match, and 1.0 if all characters were to match. This forces less change as you approach the target string.

Here is a little snippet to show how thischance varies with the score (sc):


>>> for sc in range(0,29,7):
        print( "score=%2i, thischance=%5.2f" % (sc,
                1-((perfectscore - sc) / perfectscore * (1 - chance))) )


score= 0, thischance= 0.05
score= 7, thischance= 0.29
score=14, thischance= 0.53
score=21, thischance= 0.76
score=28, thischance= 1.00 
 
In line 29 the addition of highscore into copies means that the max taken in line 30 can never decrease.

P.S. The post is only tangentially linked to Darwinism. I am secure in the knowledge that my children's school teaches it, and I am not really interested in it being mentioned in any comments. Treat it just as a way to go from a random choice of letters and home in on a target string by  repeated random alterations, and selection of the "best".