Go deh!: Template for forking a Python program

Sunday, October 18, 2009

Template for forking a Python program

I am doing a large regression run of some 30,000 simulations producing aroung 30 Gigs of logs to extract test parameter and simulation result from each.

I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.

A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.

I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.

Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.

After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.

 1 import os
 2 import sys
 3 import time
 4 
 5 maxprocs = 4 # procs - 1 children and the parent to do the work
 6 jobarray = range(25) # Work to be split between proccesses
 7 
 8 def do1job(job):
 9  'Process one item from the jobarray'
10  return job
11 
12 def picklesave(results):
13  'save partialresults to file'
14  time.sleep(3)
15 
16 def unpickleread(proc):
17  'read partial results previousely saved by process number procc'
18  return [] # dummy
19 
20 def dowork(proc, maxprocs=maxprocs, jobarray=jobarray):
21  ' Split jobarray tasks between maxprocs by proci number'
22  time.sleep(proc) # convenience processing time 
23  results = []
24  for count, job in enumerate(jobarray):
25  if count % maxprocs == proc:
26  results.append(do1job(job))
27  picklesave(results)
28  print ("Process: %i completed jobs: %s" % (proc, results))
29  return results
30 
31 
32 def createworkerchildren(maxprocs=maxprocs):
33  for proc in range(1, maxprocs):
34  print 'PARENT: Forking %s' % proc
35  worker_pid = os.fork()
36  if not worker_pid:
37  'This is a child process'
38  dowork(proc)
39  # Children exit here!
40  sys.exit(proc)
41 
42 # Start and don't wait for child proccesses to do their work
43 createworkerchildren()
44 
45 # Do parent processes share of the work
46 results = []
47 results += dowork(0) # don't have to pickle/unpickle Parent processes share
48 
49 # wait for children
50 for proc in range(1, maxprocs):
51  print 'PARENT: Waiting for any child'
52  done = os.wait()
53  print 'PARENT: got child', done
54 
55 # read and integrate child results
56 for proc in range(1, maxprocs):
57  results += unpickleread(proc)
58 
59 # Calculate and print data summary
60 pass
61

Running the above on cygwin produced:

bash$ python testfork.py
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$

Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).

3 comments:

lorgMon Oct 19, 12:01:00 am
Why not use the subprocess module, or something else which is platform independent/more Pythonic?

If I were in need of such a tool, I'd first look at subprocess, then pp (parallel python), then rpyc.

I'm not "just saying that", I'm really curious as to why you prefer os.fork().
ReplyDelete
Replies
JesseMon Oct 19, 02:32:00 am
Just to echo the previous poster, why not just use the multiprocessing module in 2.6?
ReplyDelete
Replies
Paddy3118Mon Oct 19, 07:59:00 am
Hi lorg, Jesse,
I do know the alternatives you mention exist, but was intrigued by the idea of Unix making a copy of the full current state of the parent in each child process. This means that the children only have to exporttheir results back to the parent rather than also having to import shared state. But really, I wanted to dabble :-)

Would similar be easier in multiprocessing/subprocess/pp/... (as you can see, I'm trying to establish some kind of design pattern here).

Thanks for your interest, Paddy.
ReplyDelete
Replies

Add comment

Go deh!

Sunday, October 18, 2009

Template for forking a Python program

3 comments:

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive