Go deh!: October 2007

Saturday, October 13, 2007

What's in a name?

I wrote a tool for work that would order VHDL files for compilation. VHDL is one of those languages in which, (annoyingly), everything that file a.vhd depends on must be compiled before a.vhd. After scanning all the source files I extract a dependency list which I stored as a dictionary mapping source file names to the source files they depended on.

I then needed to sort the files so that no file was compiled before any file it depended on.

Now I did google for such a pre-written routine using phrases that included dependency, dependancy :-), depends etc; sort and maybe Python. I got nothing! Eventually I wrote my own which worked very well and I was proud of it.

A couple of months later, whilst reading a newsgroup reply on something or other, someone mentioned a topological sort and so I went and looked that up. If only I knew the term 'topological' I could have found a ready-made solution such as this (which has a good explanation too).

So, "What's in a name"?

Quite a lot actually.

:-) Paddy.

P.S. There is also the tsort Unix utility.

Thursday, October 11, 2007

Multi-Processing Design Pattern

I am currently off work with a cold, ~~nose dripping onto my keyboard~~, in front of a pretty old laptop, with no real way to test the following.

For the Wide Finder project, a way to realize the task would be to split the large file, fork sub-process to process each section in parallel then join all the outputs together. Fork/join is Verilog speak. You might also call it mapping file processors onto each file section in parallel then reducing their output.

I have the following bash shell that I need to test on a real machine:


    #!/bin/bash -x

    ##
    ## Multi-subprocessing of Wide Finder file
    ##

    # Number of sub processes
    #subprocs="$1"
    subprocs="4"
    # Input file
    #infile="$2"
    infile="o10k.ap"

    subprocscript="clv5.sh"
    splitfileprefix=_multi_proc_tmp

    rm -f ${splitfileprefix}*
    size=$(stat -c%s "$infile")
    splitsize=`gawk -v x=$size -v y=$subprocs 'BEGIN{printf\"%i\", (x/y)*1.01}'`

    ## Split
    split --line-bytes=$splitsize "$infile" $splitfileprefix

    for f in ${splitfileprefix}*; do
      $subprocscript $f > $f.sub
    done
    jobs
    wait

    ## Join
    gawk '{x[$2]+=$1}END{for (n in x){print x[n], n}}' ${splitfileprefix}*.sub \
        | sort -n \
        | tail

Wide Finder on the command line

Having followed articles on Wide Finder for the past week, I thought the command line pipe version here, gave fast results straight away but could be improved.

Unfortunately, my main computer at home, my 17" laptop had to go for repairs and so I decided to start using an old Toshiba Portege 7200 that runs XP but has a tiny disk.
The temporary laptop doesn't have the resources to run Cygwin as its unix-on-windows toolset which I am comfortable with so I installed GNU tools for windows fromhere and bash for windows from here then set about re-running and improving the command-line example.

   ##
   ## clv1.sh
   ##

   # Original 'command line tools version from
   # http://www.dalkescientific.com/writings/diary/archive/2007/10/10/other_wide_finder_implementations.html
   time (
       grep -E 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap \
       | gawk '{print $7}' \
       | sort \
       | uniq -c \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv2.sh
   ##

   # include grep functionality in gawk call
   time (
       gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{print $7}' o1000k.ap \
       | sort \
       | uniq -c \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv3.sh
   ##

   # uniq -c function moved into gawk
   time (
       gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{x[$7]++}END{for (n in x){print x[n], n}}' o1000k.ap \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv4.awk
   ##

   /GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{
       field2count[$7]++
       }
   END{
       # asort requires no duplicate array values
       # Accumulate fields with same count
       for (field in field2count){
           count = field2count[field]
           count2fields[count+0] = count2fields[count] count " " field "\n";
       }
       # invert and force value type to be numeric
       for (count in count2fields){
       allfields2count[count2fields[count]] = count+0
       }
       n = asort(allfields2count,orderedcounts)
       for (i=n-10; i<=n; i++){
           printf "%s", count2fields[orderedcounts[i]]
       }
   }

   ##
   ## clv4.sh
   ##

   # All but final tail moved into gawk
   time (
       gawk --re-interval -f clv4.awk o1000k.ap \
       | tail \
       )

   ##
   ## clv5.awk
   ##

   /GET \/ongoing\/When\/[0-9][0-9][0-9]x\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]\/[0-9][0-9]\/[^ .]+ /{
       field2count[$7]++
       }
   END{
       # asort requires no duplicate array values
       # Accumulate fields with same count
       for (field in field2count){
           count = field2count[field]
           count2fields[count+0] = count2fields[count] count " " field "\n";
       }
       # invert and force value type to be numeric
       for (count in count2fields){
       allfields2count[count2fields[count]] = count+0
       }
       n = asort(allfields2count,orderedcounts)
       for (i=n-10; i<=n; i++){
           printf "%s", count2fields[orderedcounts[i]]
       }
   }

   ##
   ## clv5.sh
   ##

   # All but final tail moved into gawk
   # Change regexp to not use interval expressions
   time (
       gawk -f clv5.awk o1000k.ap \
       | tail \
       )

Here are the timings for each version (best of three runs):

clv1.sh 43.3s
clv2.sh 42.1s
clv3.sh 32.5s
clv4.sh 32.7s
clv5.sh 32.5s

Conclusion

The biggest gain came when I used gawk to do the frequency counting in clv3.sh which is the kind of thing I would naturally do and saves a third of the time which would make it the fastest in the table here with a proportionally scaled run time of 0.62seconds

STOP Press!

I decided to compare run times with Frederic Lundh's work. On my initial run I got times above 32s for everything, and with wf-6.py taking over 90s.
The memory mapping seemed to be sticky as a subsequent run gave the following timings:

wf-1-gala.py 30.4s
wf-2.py 8.2s
wf-3.py 9.2s
wf-4.py 9.5s
wf-5.py 10.0s
wf-6.py 6.9s
This is much faster. I decided to test the persistance of the memory mapping of the file by re-running the command-line tests which gave improved results of:

clv1.sh 39.5s
clv2.sh 43.0s
clv3.sh 17.0s
clv4.sh 14.8s
clv5.sh 14.8s

It seems that memory mapping a file decreases run time but on my slow laptop, if the file is to be used once then the time to make the file memory resident may kill any time savings.

If this task was the only thing I had to do in a cron job on the hour, every hour then, despite knowing Python and Perl I would probably go for an awk solution. I would have probably stopped at clv3.sh.

Tuesday, October 02, 2007

Indenting python for blog comments

Note: Script updated 5 Oct 2007

Normally I use vim to turn my python code into nicely formatted and syntax-coloured examples for my posts, but blog comment fields can swallow leading spaces leaving any posted Python code without indentation.

Most blog comment fields only accept a small set of HTML tags (if any), so I cannot use the vim output.

I have written blogger.py to turn leading spaces into HTML space characters which seems to work on blogger and wordpress.

Here is the code:



'''
blogspace.py
Turns leading spaces into HTML &nbsp tokens which shows 
as correct indentation on Blogger comment fields 
(and maybe other blogs).

Donald. 'Paddy' McCarthy Sept,Oct 2011
'''
import fileinput, re, cgi

def escape(lineiterator=fileinput.input,
           cgi_escape=True, tag_with_pre=True, tag_with_br=False):
  output = 
  if tag_with_pre: output.append('<pre>\n')
  for line in lineiterator():
    if cgi_escape: line = cgi.escape(line)
    line =  re.sub(r'^(\s*)\s', 
                 lambda m: '&nbsp;'*m.end(), 
                 line.rstrip())
    output.append(line)
    if tag_with_br:
      output.append('<br>\n')
    else:
      output.append('\n')
  if tag_with_pre: output.append('</pre>\n')
  return ''.join(output)

if __name__ == '__main__':
  print escape()

- Paddy.

( Also posted here and here)