I wrote a tool for work that would order VHDL files for compilation. VHDL is one of those languages in which, (annoyingly), everything that file a.vhd depends on must be compiled before a.vhd. After scanning all the source files I extract a dependency list which I stored as a dictionary mapping source file names to the source files they depended on.
I then needed to sort the files so that no file was compiled before any file it depended on.
Now I did google for such a pre-written routine using phrases that included dependency, dependancy :-), depends etc; sort and maybe Python. I got nothing! Eventually I wrote my own which worked very well and I was proud of it.
A couple of months later, whilst reading a newsgroup reply on something or other, someone mentioned a topological sort and so I went and looked that up. If only I knew the term 'topological' I could have found a ready-made solution such as this (which has a good explanation too).
So, "What's in a name"?
Quite a lot actually.
:-) Paddy.
P.S. There is also the tsort Unix utility.
Saturday, October 13, 2007
Thursday, October 11, 2007
Multi-Processing Design Pattern
I am currently off work with a cold, nose dripping onto my keyboard, in front of a pretty old laptop, with no real way to test the following.
For the Wide Finder project, a way to realize the task would be to split the large file, fork sub-process to process each section in parallel then join all the outputs together. Fork/join is Verilog speak. You might also call it mapping file processors onto each file section in parallel then reducing their output.
I have the following bash shell that I need to test on a real machine:
For the Wide Finder project, a way to realize the task would be to split the large file, fork sub-process to process each section in parallel then join all the outputs together. Fork/join is Verilog speak. You might also call it mapping file processors onto each file section in parallel then reducing their output.
I have the following bash shell that I need to test on a real machine:
#!/bin/bash -x
##
## Multi-subprocessing of Wide Finder file
##
# Number of sub processes
#subprocs="$1"
subprocs="4"
# Input file
#infile="$2"
infile="o10k.ap"
subprocscript="clv5.sh"
splitfileprefix=_multi_proc_tmp
rm -f ${splitfileprefix}*
size=$(stat -c%s "$infile")
splitsize=`gawk -v x=$size -v y=$subprocs 'BEGIN{printf\"%i\", (x/y)*1.01}'`
## Split
split --line-bytes=$splitsize "$infile" $splitfileprefix
for f in ${splitfileprefix}*; do
$subprocscript $f > $f.sub
done
jobs
wait
## Join
gawk '{x[$2]+=$1}END{for (n in x){print x[n], n}}' ${splitfileprefix}*.sub \
| sort -n \
| tail
Wide Finder on the command line
Having followed articles on Wide Finder for the past week, I thought the command line pipe version here, gave fast results straight away but could be improved.
Unfortunately, my main computer at home, my 17" laptop had to go for repairs and so I decided to start using an old Toshiba Portege 7200 that runs XP but has a tiny disk.
The temporary laptop doesn't have the resources to run Cygwin as its unix-on-windows toolset which I am comfortable with so I installed GNU tools for windows fromhere and bash for windows from here then set about re-running and improving the command-line example.
##
## clv1.sh
##
# Original 'command line tools version from
# http://www.dalkescientific.com/writings/diary/archive/2007/10/10/other_wide_finder_implementations.html
time (
grep -E 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap \
| gawk '{print $7}' \
| sort \
| uniq -c \
| sort -n \
| tail -10\
)
##
## clv2.sh
##
# include grep functionality in gawk call
time (
gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{print $7}' o1000k.ap \
| sort \
| uniq -c \
| sort -n \
| tail -10\
)
##
## clv3.sh
##
# uniq -c function moved into gawk
time (
gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{x[$7]++}END{for (n in x){print x[n], n}}' o1000k.ap \
| sort -n \
| tail -10\
)
##
## clv4.awk
##
/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{
field2count[$7]++
}
END{
# asort requires no duplicate array values
# Accumulate fields with same count
for (field in field2count){
count = field2count[field]
count2fields[count+0] = count2fields[count] count " " field "\n";
}
# invert and force value type to be numeric
for (count in count2fields){
allfields2count[count2fields[count]] = count+0
}
n = asort(allfields2count,orderedcounts)
for (i=n-10; i<=n; i++){
printf "%s", count2fields[orderedcounts[i]]
}
}
##
## clv4.sh
##
# All but final tail moved into gawk
time (
gawk --re-interval -f clv4.awk o1000k.ap \
| tail \
)
##
## clv5.awk
##
/GET \/ongoing\/When\/[0-9][0-9][0-9]x\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]\/[0-9][0-9]\/[^ .]+ /{
field2count[$7]++
}
END{
# asort requires no duplicate array values
# Accumulate fields with same count
for (field in field2count){
count = field2count[field]
count2fields[count+0] = count2fields[count] count " " field "\n";
}
# invert and force value type to be numeric
for (count in count2fields){
allfields2count[count2fields[count]] = count+0
}
n = asort(allfields2count,orderedcounts)
for (i=n-10; i<=n; i++){
printf "%s", count2fields[orderedcounts[i]]
}
}
##
## clv5.sh
##
# All but final tail moved into gawk
# Change regexp to not use interval expressions
time (
gawk -f clv5.awk o1000k.ap \
| tail \
)
Here are the timings for each version (best of three runs):
clv1.sh 43.3s
clv2.sh 42.1s
clv3.sh 32.5s
clv4.sh 32.7s
clv5.sh 32.5s
Conclusion
The biggest gain came when I used gawk to do the frequency counting in clv3.sh which is the kind of thing I would naturally do and saves a third of the time which would make it the fastest in the table here with a proportionally scaled run time of 0.62seconds
STOP Press!
I decided to compare run times with Frederic Lundh's work. On my initial run I got times above 32s for everything, and with wf-6.py taking over 90s.
The memory mapping seemed to be sticky as a subsequent run gave the following timings:
wf-1-gala.py 30.4s
wf-2.py 8.2s
wf-3.py 9.2s
wf-4.py 9.5s
wf-5.py 10.0s
wf-6.py 6.9s
This is much faster. I decided to test the persistance of the memory mapping of the file by re-running the command-line tests which gave improved results of:
clv1.sh 39.5s
clv2.sh 43.0s
clv3.sh 17.0s
clv4.sh 14.8s
clv5.sh 14.8s
It seems that memory mapping a file decreases run time but on my slow laptop, if the file is to be used once then the time to make the file memory resident may kill any time savings.
If this task was the only thing I had to do in a cron job on the hour, every hour then, despite knowing Python and Perl I would probably go for an awk solution. I would have probably stopped at clv3.sh.
Tuesday, October 02, 2007
Indenting python for blog comments
Note: Script updated 5 Oct 2007
Normally I use vim to turn my python code into nicely formatted and syntax-coloured examples for my posts, but blog comment fields can swallow leading spaces leaving any posted Python code without indentation.
Most blog comment fields only accept a small set of HTML tags (if any), so I cannot use the vim output.
I have written blogger.py to turn leading spaces into HTML space characters which seems to work on blogger and wordpress.
Here is the code:
'''
blogspace.py
Turns leading spaces into HTML   tokens which shows
as correct indentation on Blogger comment fields
(and maybe other blogs).
Donald. 'Paddy' McCarthy Sept,Oct 2011
'''
import fileinput, re, cgi
def escape(lineiterator=fileinput.input,
cgi_escape=True, tag_with_pre=True, tag_with_br=False):
output =
if tag_with_pre: output.append('<pre>\n')
for line in lineiterator():
if cgi_escape: line = cgi.escape(line)
line = re.sub(r'^(\s*)\s',
lambda m: ' '*m.end(),
line.rstrip())
output.append(line)
if tag_with_br:
output.append('<br>\n')
else:
output.append('\n')
if tag_with_pre: output.append('</pre>\n')
return ''.join(output)
if __name__ == '__main__':
print escape()
- Paddy.
( Also posted here and here)