I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.
A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.
I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.
Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.
After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.
1 import os
2 import sys
3 import time
4
5 maxprocs = 4 # procs - 1 children and the parent to do the work
6 jobarray = range(25) # Work to be split between proccesses
7
8 def do1job(job):
9 'Process one item from the jobarray'
10 return job
11
12 def picklesave(results):
13 'save partialresults to file'
14 time.sleep(3)
15
16 def unpickleread(proc):
17 'read partial results previousely saved by process number procc'
18 return [] # dummy
19
20 def dowork(proc, maxprocs=maxprocs, jobarray=jobarray):
21 ' Split jobarray tasks between maxprocs by proci number'
22 time.sleep(proc) # convenience processing time
23 results = []
24 for count, job in enumerate(jobarray):
25 if count % maxprocs == proc:
26 results.append(do1job(job))
27 picklesave(results)
28 print ("Process: %i completed jobs: %s" % (proc, results))
29 return results
30
31
32 def createworkerchildren(maxprocs=maxprocs):
33 for proc in range(1, maxprocs):
34 print 'PARENT: Forking %s' % proc
35 worker_pid = os.fork()
36 if not worker_pid:
37 'This is a child process'
38 dowork(proc)
39 # Children exit here!
40 sys.exit(proc)
41
42 # Start and don't wait for child proccesses to do their work
43 createworkerchildren()
44
45 # Do parent processes share of the work
46 results = []
47 results += dowork(0) # don't have to pickle/unpickle Parent processes share
48
49 # wait for children
50 for proc in range(1, maxprocs):
51 print 'PARENT: Waiting for any child'
52 done = os.wait()
53 print 'PARENT: got child', done
54
55 # read and integrate child results
56 for proc in range(1, maxprocs):
57 results += unpickleread(proc)
58
59 # Calculate and print data summary
60 pass
61
Running the above on cygwin produced:
bash$ python testfork.py
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$
Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).