Thursday, October 11, 2007

Multi-Processing Design Pattern

I am currently off work with a cold, nose dripping onto my keyboard, in front of a pretty old laptop, with no real way to test the following.



For the Wide Finder project, a way to realize the task would be to split the large file, fork sub-process to process each section in parallel then join all the outputs together. Fork/join is Verilog speak. You might also call it mapping file processors onto each file section in parallel then reducing their output.

I have the following bash shell that I need to test on a real machine:

    #!/bin/bash -x

    ##
    ## Multi-subprocessing of Wide Finder file
    ##

    # Number of sub processes
    #subprocs="$1"
    subprocs="4"
    # Input file
    #infile="$2"
    infile="o10k.ap"

    subprocscript="clv5.sh"
    splitfileprefix=_multi_proc_tmp

    rm -f ${splitfileprefix}*
    size=$(stat -c%s "$infile")
    splitsize=`gawk -v x=$size -v y=$subprocs 'BEGIN{printf\"%i\", (x/y)*1.01}'`

    ## Split
    split --line-bytes=$splitsize "$infile" $splitfileprefix

    for f in ${splitfileprefix}*; do
      $subprocscript $f > $f.sub
    done
    jobs
    wait

    ## Join
    gawk '{x[$2]+=$1}END{for (n in x){print x[n], n}}' ${splitfileprefix}*.sub \
        | sort -n \
        | tail

No comments:

Post a Comment