16 CPU's to Redhat Linux. I have my href="http://paddy3118.blogspot.com/search?q=process">process
runner script, that I use with a list of 100000 simulations
to do overnight and have experimented with running different numbers of
simulations in parallel to get good throughput.
The process runner is good, I still use it, but I also wanted something
for more general command-line use. for example, I have around 750
directories, each containing files relating to one test. I organize the
directories to have regular names, and am used to using shell for loops
to process the dataset.
For example, to gzip all the */transcript files I would do:
foreach f (*/transcript)
gzip $f
end
(Yep, I know it's cshell, but that is the 'standard' at work.)
This isn't making much use of the processing power at hand, so I went
googling for a lightweight way to run such tasks in parallel.
I found two methods. One, to use make and its -P option to run several
jobs in parallel would need the creation of a make file in each case,
which is onerous. The other, which was new to me, rather than a
forgotten feature, is that xargs has a -P option that does a similar
thing - in combination with other options it will run up to a
given number of jobs in parallel:
/bin/ls */transcript \
| xargs -n1 -iX -P16 gzip X
The above lists all the files, pipes them to xargs which takes them one
at a time (-n1) to form a job by substituting all occurrences of
character X (-iX) in its arguments after its options. a maximum of 16
(-P16) jobs are then run at any one time.
When I want to simulate all 750 tests and create a coverage listing of
each test I use xargs to call the bash shell and write something like:
/bin/ls -d * \
| xargs -n1 -iX -P16 bash \
-c 'cd X && vsim tb -c -coverage -do "coverage save -onexit -code t coverX.ucdb;do ../run.do" 2>&1 >/dev/null && vcover report -file coverX.rpt coverX.ucdb 2>&1 >/dev/null'
Yep, I do create such long monstrosities at times. It's because I am
exploring new avenues at the moment, and things are likely to change,
plus I need to know, and change, the details quite frequently. Although
i do make a note of these long one-liners, (and keep a lot of history
in my shells), I tend to only wrap things in a (bash), script when it
becomes stable.
Note:
Following the links from the Wikipedia page, it seems that the -P
option to xargs is not mandated by href="http://www.opengroup.org/onlinepubs/9699919799/utilities/xargs.html">The
Open Group, and does not appear in href="http://docs.sun.com/app/docs/doc/816-5165/xargs-1?a=view">Suns
version of xargs. I work in a mixed Solaris/Linux environment
and have been using Unix for many years which makes it hard to keep up
with the GNU extensions to what were familiar commands. But such is
progress. Bring it on!
Hi:
ReplyDeleteI've got a cluster with multicore nodes and I had the need for easy paralellism too. I've develop a little tool in python that is capable of splitting a job in pieces, running multiple subjobs and joining the outputs. I've not released yet, but maybe it could be of some interest to you. You can take it a look at http://bioinf.comav.upv.es/svn/psubprocess/trunk/src/
jblanca _at__ btc dot upv dot es
Hi jblanca,
ReplyDeleteI took a look at psubprocess. Although it doesn't fit my needs, I wish you well with psubprocess.
I do have my process_runner script. We have LSF available, and now the use of xargs -P. I feel i am filling rapildy diminishing holes with parallelism solutions :-)
P.S.
Reddit just pointed me in the direction of http://vxargs.sourceforge.net/ which is aimed at multi-machine parallelism of arbitrary commands.
Both xargs and vxargs deal badly with filenames containing special characters. To see the problem try this:
ReplyDeletetouch important_file
touch 'not important_file'
ls not* | xargs rm
You may consider Parallel https://savannah.nongnu.org/projects/parallel/ instead.
Thanks Ole for your comment. Being "old school" Unix, I am used to only creating Unix friendly file names, but I take your point.
ReplyDeleteI'll take a look at Parallel.