Mainly Tech projects on Python and Electronic Design Automation.
Wednesday, April 29, 2009
Fight for the community you want
Someone took offence at a puerile and sexist presentation given at what is supposed to be a technical event in the Rails community. Good on them. The community is what you make of it.
"Using sex to sell spanners" doesn't make me think they are good spanners.
- Paddy.
"Using sex to sell spanners" doesn't make me think they are good spanners.
- Paddy.
Tuesday, April 21, 2009
Monitoring a linux process as it reads files
I am processing 20 thousand files with some proprietary software and
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.
The proprietary program took a list of 20000 files to process as its
first argument and I remembered that on Linux, the /proc
directory had info on running processes and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on one of the file descriptors, I could gauge how many
files been read so far.
I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.
Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment
This script just holds each file open for reading for twenty seconds
before closing the file.
And here is a python script to monitor that fd directories
I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.
END.
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.
The proprietary program took a list of 20000 files to process as its
first argument and I remembered that on Linux, the /proc
directory had info on running processes and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on one of the file descriptors, I could gauge how many
files been read so far.
I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.
Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment
First create some test files to process
style="font-family: monospace;">bash$ style="font-weight: bold;">mkdir -p /tmp/test
style="font-family: monospace;">
bash$ style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done
style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst
style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst
style="font-family: monospace;">
/tmp/test/file0
style="font-family: monospace;">
/tmp/test/file1
style="font-family: monospace;">
/tmp/test/file10
style="font-family: monospace;">
/tmp/test/file11
style="font-family: monospace;">
/tmp/test/file12
style="font-family: monospace;">
/tmp/test/file13
style="font-family: monospace;">
/tmp/test/file14
style="font-family: monospace;">
/tmp/test/file15
style="font-family: monospace;">
/tmp/test/file16
style="font-family: monospace;">
/tmp/test/file17
style="font-family: monospace;">
bash$
style="font-family: monospace;">
bash$ style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done
style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst
style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst
style="font-family: monospace;">
/tmp/test/file0
style="font-family: monospace;">
/tmp/test/file1
style="font-family: monospace;">
/tmp/test/file10
style="font-family: monospace;">
/tmp/test/file11
style="font-family: monospace;">
/tmp/test/file12
style="font-family: monospace;">
/tmp/test/file13
style="font-family: monospace;">
/tmp/test/file14
style="font-family: monospace;">
/tmp/test/file15
style="font-family: monospace;">
/tmp/test/file16
style="font-family: monospace;">
/tmp/test/file17
style="font-family: monospace;">
bash$
Now lets create a test executable to monitor
This script just holds each file open for reading for twenty seconds
before closing the file.
style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time
style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]):
style="font-family: monospace; color: rgb(0, 0, 153);">
f = file(name.strip())
style="font-family: monospace; color: rgb(0, 0, 153);">
time.sleep(45)
style="font-family: monospace; color: rgb(0, 0, 153);">
f.close()
' style="font-weight: bold;">/tmp/test/all_files.lst
&
style="font-family: monospace;">
[2] style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984
style="font-family: monospace;">
bash$
style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]):
style="font-family: monospace; color: rgb(0, 0, 153);">
f = file(name.strip())
style="font-family: monospace; color: rgb(0, 0, 153);">
time.sleep(45)
style="font-family: monospace; color: rgb(0, 0, 153);">
f.close()
' style="font-weight: bold;">/tmp/test/all_files.lst
&
style="font-family: monospace;">
[2] style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984
style="font-family: monospace;">
bash$
here is what the fd directory looks like
style="font-family: monospace;">bash$ ls -l /proc/ style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">/fd
style="font-family: monospace;">
total 0
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0
style="font-family: monospace;">
bash$
style="font-family: monospace;">
total 0
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0
style="font-family: monospace;">
bash$
And here is a python script to monitor that fd directories
link number 4 periodically
style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time,os,datetime
style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1])))
style="color: rgb(0, 0, 153);">
all = len(name2index)
style="color: rgb(0, 0, 153);">
while True:
style="color: rgb(0, 0, 153);">
path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4").strip()
style="color: rgb(0, 0, 153);">
print
name2index[path],"/",all, path, datetime.datetime.now().isoformat()
style="color: rgb(0, 0, 153);">
time.sleep(30)
' /tmp/test/all_files.lst
22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750
23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750
24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750
24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750
style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1])))
style="color: rgb(0, 0, 153);">
all = len(name2index)
style="color: rgb(0, 0, 153);">
while True:
style="color: rgb(0, 0, 153);">
path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4").strip()
style="color: rgb(0, 0, 153);">
name2index[path],"/",all, path, datetime.datetime.now().isoformat()
style="color: rgb(0, 0, 153);">
time.sleep(30)
' /tmp/test/all_files.lst
22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750
23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750
24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750
24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750
I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.
END.
Thursday, April 16, 2009
Friday, April 10, 2009
Memoization and stack use.
I was caught out today when investigating how to apply memoization to
allow computation of larger values of a recursive function.
I had thought up a new task for Rosetta Code called href="http://www.rosettacode.org/wiki/Mutual_Recursion">Mutual
Recursion, as I had remembered that some languages needed
at least the signature of a function to be defined before it
could be called, and so creating mutually recursive functions would
allow you to compare the languages in an interesting way - which is
what R.C. is all about.
For the example for implementation I used href="http://en.wikipedia.org/wiki/Hofstadter_sequence#Hofstadter_Female_and_Male_sequences"
target="_blank">Hoffstadters Female and Male
sequences, and produced the following href="http://www.rosettacode.org/wiki/Mutual_Recursion#Python">Python
definitions:
It seems that other members of R.C. liked the task and soon added many
examples in other languages.
Apart from the computation of the series for 0<=n<20, I
did try to compute F(200) but it blew the stack so I went to bed and
decided to memoize the functions another day.
Memoization came very easily to mind, so I thought that Python would
have such available as a decorator waiting to be applied. No such luck,
I would have to craft my own . I did remember correctly however that
there was some helper function that made wrapped functions
look a lot more like what had been wrapped, and so used functools.wrap:
I haven't got over my (probably infantile) lust for attaching
attributes to functions so justified memoize0 as it allows you to look
at the contents of the cache. Combined with the fact that you cannot do
a memoization decorator without applying it to a href="http://en.wikipedia.org/wiki/Fibonacci_function">Fibonacci
number function:
It is important that caches are not shared for mutually recursive
functions.
I had initial success with the Male and Female sequence and was easily
able to compute F(200) with:
But I got cocky, as F(2000) ran out of stack.
A little thought made me realise that I only had a cached results for F
and M up to around 200 and so a call of F(2000) was going to eat up
huge amounts of stack before recursing down to cached values.
With that knowledge in mind I thought I might gradually jump to higher
values of n and, indeed, the following worked:
In real-world applicatrions, you might have to have some scheme to
reduce or limit cache size if memory rather than the stack becomes an
issue, and I have a great solution that I have just completed in the
margin :-)
- Paddy.
allow computation of larger values of a recursive function.
Mutual Recursion
I had thought up a new task for Rosetta Code called href="http://www.rosettacode.org/wiki/Mutual_Recursion">Mutual
Recursion, as I had remembered that some languages needed
at least the signature of a function to be defined before it
could be called, and so creating mutually recursive functions would
allow you to compare the languages in an interesting way - which is
what R.C. is all about.
For the example for implementation I used href="http://en.wikipedia.org/wiki/Hofstadter_sequence#Hofstadter_Female_and_Male_sequences"
target="_blank">Hoffstadters Female and Male
sequences, and produced the following href="http://www.rosettacode.org/wiki/Mutual_Recursion#Python">Python
definitions:
def color="#008080">F(n): return 1 color="#804040">if n == 0 color="#804040">else n - M(F(n-1))
color="#804040">def color="#008080">M(n): return 0 color="#804040">if n == 0 color="#804040">else n - F(M(n-1))
It seems that other members of R.C. liked the task and soon added many
examples in other languages.
Apart from the computation of the series for 0<=n<20, I
did try to compute F(200) but it blew the stack so I went to bed and
decided to memoize the functions another day.
target="_blank">Wot, no memoize?
Memoization came very easily to mind, so I thought that Python would
have such available as a decorator waiting to be applied. No such luck,
I would have to craft my own . I did remember correctly however that
there was some helper function that made wrapped functions
look a lot more like what had been wrapped, and so used functools.wrap:
from functools color="#a020f0">import wraps
color="#804040">def color="#008080">memoize0(func):
' color="#ff00ff"> Adds cache as attribute to wrapped function for ease of access'
color="#a020f0">@wraps(func)
color="#804040">def color="#008080">wrapper(*args):
argsl = tuple(args)
color="#804040">if argsl color="#804040">in wrapper._cache:
color="#804040">return wrapper._cache[argsl]
color="#804040">else:
color="#804040">return wrapper._cache.setdefault(argsl, func(*args))
wrapper._cache = dict()
color="#804040">return wrapper
color="#804040">def color="#008080">memoize(func):
' color="#ff00ff"> Creates closure around locally created cache'
cache = dict()
color="#a020f0">@wraps(func)
color="#804040">def color="#008080">wrapper(*args):
argsl = tuple(args)
color="#804040">if argsl color="#804040">in cache:
color="#804040">return cache[argsl]
color="#804040">else:
color="#804040">return cache.setdefault(argsl, func(*args))
color="#804040">return wrapper
I haven't got over my (probably infantile) lust for attaching
attributes to functions so justified memoize0 as it allows you to look
at the contents of the cache. Combined with the fact that you cannot do
a memoization decorator without applying it to a href="http://en.wikipedia.org/wiki/Fibonacci_function">Fibonacci
number function:
if __name__ == ' color="#ff00ff">__main__':
color="#a020f0">@memoize0
color="#804040">def color="#008080">fib(n): return n color="#804040">if n color="#804040">in (0,1) color="#804040">else fib(n - 1) + fib(n - 2)
color="#a020f0">@memoize0
color="#804040">def color="#008080">fib2(n): color="#804040">return n color="#804040">if n color="#804040">in (0,1) color="#804040">else fib2(n - 1) + fib2(n - 2)
color="#804040">assert fib._cache color="#804040">is color="#804040">not fib2._cache
fib(50); fib2(50)
color="#804040">assert fib._cache == fib2._cache color="#804040">and (fib._cache color="#804040">is color="#804040">not fib2._cache)
It is important that caches are not shared for mutually recursive
functions.
A blown stack
I had initial success with the Male and Female sequence and was easily
able to compute F(200) with:
@ color="#008080">memoize
color="#804040">def color="#008080">F(n):
' color="#ff00ff"> Hofstadters Female Male sequences'
color="#804040">return 1 color="#804040">if n == 0 color="#804040">else n - M(F(n-1))
color="#a020f0">@memoize
color="#804040">def color="#008080">M(n):
' color="#ff00ff"> Hofstadters Female Male sequences'
color="#804040">return 0 color="#804040">if n == 0 color="#804040">else n - F(M(n-1))
color="#804040">assert F(200) == 124
But I got cocky, as F(2000) ran out of stack.
A little thought made me realise that I only had a cached results for F
and M up to around 200 and so a call of F(2000) was going to eat up
huge amounts of stack before recursing down to cached values.
With that knowledge in mind I thought I might gradually jump to higher
values of n and, indeed, the following worked:
print ([F(n) color="#804040">for n color="#804040">in range(0,20000+1,100)])
In real-world applicatrions, you might have to have some scheme to
reduce or limit cache size if memory rather than the stack becomes an
issue, and I have a great solution that I have just completed in the
margin :-)
- Paddy.
Thursday, April 02, 2009
(Ab)use of Pythons in-built data structures
I like to use Pythons in-built data structures quit a lot, and tend to
force myself to ask whether I should create my own classes,
which allows you to use meaningful names for fields and add
comments/docstrings to the datastructure but usually at the cost of
adding more lines of text.
After writing about Huffman codes and posting solutions to href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Python">Rosetta
code and href="http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html">my
blog:
Someone posted a href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Java">
Java solution, that was over three times as long. Instead of
just counting and comparing line counts, I took a closer
look to see what the extra code was for.
I
noticed that the Java was tidy, and that they had defined their own
classes for a HuffmanLeaf structure used when constructing a
HuffmanTree.
In my original Python, I used a nested list as
the equivalent to the HuffmanLeaf of the Java example. It worked, but I
had to introduce 'magic constants' to access the fields of the Python
leaf , which is also un-named as a structure in the program (lline 33
of
the Python code creates a list of Leaf structures):
Now I didn't want to go to the trouble of creating my own
classes, but that's were the new href="http://docs.python.org/dev/py3k/library/collections.html#collections.namedtuple">namedtuple
class factory of Python 2.6 came to the rescue.
namedtuple
will generate a subtype of the tuple class that allows the fields of
the tuple to be named and accessed via subscription, [], or as if the
fields are instance variable names.
I modified my Python program to use two named tuples:
The
class generation is succinct in lines 3 and 4, and I use the field
names for clarity for example when creating the original list of Leaves
that will form the heap in line 8
The print statements stay the same, but before you got output like this:
Now you get the clearer:
Although
you can do so much with Python lists, if each position in the list has
a fixed meaning then you might be best to use tuples, and if you should
be using tuples, then namedtuples can make your code more readable
without bloating it with long class definitions.
- Paddy.
force myself to ask whether I should create my own classes,
which allows you to use meaningful names for fields and add
comments/docstrings to the datastructure but usually at the cost of
adding more lines of text.
After writing about Huffman codes and posting solutions to href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Python">Rosetta
code and href="http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html">my
blog:
29 color="#804040">def color="#008080">codecreate(symbol2weights, tutor= False):
color="#804040"> 30 ''' color="#ff00ff"> Huffman encode the given dict mapping symbols to weights '''
color="#804040"> 31 global decode
color="#804040"> 32
color="#804040"> 33 heap = [ [float(wt), [[sym, []]], repr(sym)] color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
color="#804040"> 34 heapify(heap)
color="#804040"> 35 if tutor: color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
color="#804040"> 36 while len(heap) >1:
color="#804040"> 37 lo = heappop(heap)
color="#804040"> 38 hi = heappop(heap)
color="#804040"> 39 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> COMBINING:", lo, ' color="#6a5acd">\n AND:', hi
color="#804040"> 40 color="#804040">for i color="#804040">in lo[1]: i[1].insert(0, ' color="#ff00ff">0')
color="#804040"> 41 color="#804040">for i color="#804040">in hi[1]: i[1].insert(0, ' color="#ff00ff">1')
color="#804040"> 42 lohi = [ lo[0] + hi[0] ] + [lo[1] + hi[1]]
color="#804040"> 43 lohi.append(' color="#ff00ff">(%s if nextbit() else %s)' % (hi[2], lo[2]))
color="#804040"> 44 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> PRODUCING:", lohi, ' color="#6a5acd">\n'
color="#804040"> 45 heappush(heap, lohi)
color="#804040"> 46 wt, codes, decoder = heappop(heap)
color="#804040"> 47 decode = eval(' color="#ff00ff">lambda :' + decoder, globals())
color="#804040"> 48 decode.__doc__ = decoder
color="#804040"> 49 for i color="#804040">in codes: i[1] = ''.join(i[1])
color="#804040"> 50 #for i in codes: i[::] = i[:2]
color="#804040"> 51 return sorted(codes, key= color="#804040">lambda x: (len(x[-1]), x))
color="#804040"> 52
Someone posted a href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Java">
Java solution, that was over three times as long. Instead of
just counting and comparing line counts, I took a closer
look to see what the extra code was for.
import java.util.*;
color="#2e8b57">abstract color="#2e8b57">class HuffmanTree color="#2e8b57">implements Comparable<HuffmanTree> {
color="#2e8b57">public color="#2e8b57">int frequency; color="#0000ff">// the frequency of this tree
color="#2e8b57">public HuffmanTree( color="#2e8b57">int freq) { frequency = freq; }
color="#0000ff">// compares on the frequency
color="#2e8b57">public color="#2e8b57">int compareTo(HuffmanTree tree) {
color="#804040">return frequency - tree.frequency;
}
}
color="#2e8b57">class HuffmanLeaf color="#2e8b57">extends HuffmanTree {
color="#2e8b57">public color="#2e8b57">char value; color="#0000ff">// the character this leaf represents
color="#2e8b57">public HuffmanLeaf( color="#2e8b57">int freq, color="#2e8b57">char val) {
color="#2e8b57">super(freq);
value = val;
}
}
color="#2e8b57">class HuffmanNode color="#2e8b57">extends HuffmanTree {
color="#2e8b57">public HuffmanTree left, right; color="#0000ff">// subtrees
color="#2e8b57">public HuffmanNode(HuffmanTree l, HuffmanTree r) {
color="#2e8b57">super(l.frequency + r.frequency);
left = l;
right = r;
}
}
color="#2e8b57">public color="#2e8b57">class HuffmanCode {
color="#0000ff">// input is an array of frequencies, indexed by character code
color="#2e8b57">public color="#2e8b57">static HuffmanTree buildTree( color="#2e8b57">int[] charFreqs) {
PriorityQueue<HuffmanTree> trees = color="#804040">new PriorityQueue<HuffmanTree>();
color="#0000ff">// initially, we have a forest of leaves
color="#0000ff">// one for each non-empty character
color="#804040">for ( color="#2e8b57">int i = color="#ff00ff">0; i < charFreqs.length; i++)
color="#804040">if (charFreqs[i] > color="#ff00ff">0)
trees.offer( color="#804040">new HuffmanLeaf(charFreqs[i], ( color="#2e8b57">char)i));
color="#804040">assert trees.size() > color="#ff00ff">0;
color="#0000ff">// loop until there is only one tree left
color="#804040">while (trees.size() > color="#ff00ff">1) {
color="#0000ff">// two trees with least frequency
HuffmanTree a = trees.poll();
HuffmanTree b = trees.poll();
color="#0000ff">// put into new node and re-insert into queue
trees.offer( color="#804040">new HuffmanNode(a, b));
}
color="#804040">return trees.poll();
}
color="#2e8b57">public color="#2e8b57">static color="#2e8b57">void printCodes(HuffmanTree tree, Stack<Character> prefix) {
color="#804040">assert tree != color="#ff00ff">null;
color="#804040">if (tree color="#804040">instanceof HuffmanLeaf) {
HuffmanLeaf leaf = (HuffmanLeaf)tree;
color="#0000ff">// print out character and frequency
System.out.print(leaf.value + color="#ff00ff">"\t color="#ff00ff">" + leaf.frequency + color="#ff00ff">"\t color="#ff00ff">");
color="#0000ff">// print out code for this leaf, which is just the prefix
color="#804040">for ( color="#2e8b57">char bit : prefix)
System.out.print(bit);
System.out.println();
} color="#804040">else color="#804040">if (tree color="#804040">instanceof HuffmanNode) {
HuffmanNode node = (HuffmanNode)tree;
color="#0000ff">// traverse left
prefix.push( color="#ff00ff">'0');
printCodes(node.left, prefix);
prefix.pop();
color="#0000ff">// traverse right
prefix.push( color="#ff00ff">'1');
printCodes(node.right, prefix);
prefix.pop();
}
}
color="#2e8b57">public color="#2e8b57">static color="#2e8b57">void main(String[] args) {
String test = color="#ff00ff">"this is an example for huffman encoding";
color="#0000ff">// we will assume that all our characters will have
color="#0000ff">// code less than 256, for simplicity
color="#2e8b57">int[] charFreqs = color="#804040">new color="#2e8b57">int[ color="#ff00ff">256];
color="#0000ff">// read each character and record the frequencies
color="#804040">for ( color="#2e8b57">char c : test.toCharArray())
charFreqs[c]++;
color="#0000ff">// build tree
HuffmanTree tree = buildTree(charFreqs);
color="#0000ff">// print out results
System.out.println( color="#ff00ff">"SYMBOL\t color="#ff00ff">WEIGHT\t color="#ff00ff">HUFFMAN CODE");
printCodes(tree, color="#804040">new Stack<Character>());
}
}
I
noticed that the Java was tidy, and that they had defined their own
classes for a HuffmanLeaf structure used when constructing a
HuffmanTree.
In my original Python, I used a nested list as
the equivalent to the HuffmanLeaf of the Java example. It worked, but I
had to introduce 'magic constants' to access the fields of the Python
leaf , which is also un-named as a structure in the program (lline 33
of
the Python code creates a list of Leaf structures):
33 heap = [ [float(wt), [sym, []]] color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
Now I didn't want to go to the trouble of creating my own
classes, but that's were the new href="http://docs.python.org/dev/py3k/library/collections.html#collections.namedtuple">namedtuple
class factory of Python 2.6 came to the rescue.
namedtuple
namedtuple
will generate a subtype of the tuple class that allows the fields of
the tuple to be named and accessed via subscription, [], or as if the
fields are instance variable names.
I modified my Python program to use two named tuples:
- For the Leaf structure as a whole in line 3.
- For a component of the Leaf structure that holds the symbol
and the Huffman code (accumulated so far), in line 4.
1 color="#a020f0">from collections color="#a020f0">import namedtuple
color="#804040"> 2
color="#804040"> 3 Leaf = namedtuple(' color="#ff00ff">Leaf', 'weight, symbols')
color="#804040"> 4 SH = namedtuple(' color="#ff00ff">SH', 'sym, huff')
color="#804040"> 5
color="#804040"> 6 def color="#008080">codecreate2(symbol2weights, tutor= False):
color="#804040"> 7 ''' color="#ff00ff"> Huffman codecreate2 the given dict mapping symbols to weights '''
color="#804040"> 8 heap = [ Leaf(weight=float(wt), symbols=[ SH(sym, []) ])
color="#804040"> 9 color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
color="#804040">10 heapify(heap)
color="#804040">11 if tutor: color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
color="#804040">12 while len(heap) >1:
color="#804040">13 lo = heappop(heap)
color="#804040">14 hi = heappop(heap)
color="#804040">15 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> COMBINING:", lo, ' color="#6a5acd">\n AND:', hi
color="#804040">16 color="#804040">for sh color="#804040">in lo.symbols: sh.huff.insert(0, ' color="#ff00ff">0')
color="#804040">17 color="#804040">for sh color="#804040">in hi.symbols: sh.huff.insert(0, ' color="#ff00ff">1')
color="#804040">18 lohi = Leaf(weight = lo.weight + hi.weight,
color="#804040">19 symbols = lo.symbols + hi.symbols)
color="#804040">20 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> PRODUCING:", lohi, ' color="#6a5acd">\n'
color="#804040">21 heappush(heap, lohi)
color="#804040">22 symbols = heappop(heap).symbols
color="#804040">23 symbols = [SH(sym, ''.join(huff)) color="#804040">for sym, huff color="#804040">in symbols]
color="#804040">24 return sorted(symbols, key= color="#804040">lambda sh: (len(sh.huff), sh))
color="#804040">25
The
class generation is succinct in lines 3 and 4, and I use the field
names for clarity for example when creating the original list of Leaves
that will form the heap in line 8
Printing namedtuples
The print statements stay the same, but before you got output like this:
style="font-family: monospace;"> style="font-family: monospace;">...
COMBINING: [2.5, ['C', []]]
style="font-family: monospace;">
AND: [5.0, ['A', []]]
PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]]
style="font-family: monospace;">
...
COMBINING: [2.5, ['C', []]]
style="font-family: monospace;">
AND: [5.0, ['A', []]]
PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]]
style="font-family: monospace;">
...
Now you get the clearer:
style="font-family: monospace;"> ENCODING:
[('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]
COMBINING: Leaf(weight=2.5, symbols=[SH(sym='C', huff=[])])
AND: Leaf(weight=5.0, symbols=[SH(sym='A', huff=[])])
PRODUCING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
COMBINING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
AND: Leaf(weight=12.5, symbols=[SH(sym='D', huff=[])])
PRODUCING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
COMBINING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
AND: Leaf(weight=25.0, symbols=[SH(sym='B', huff=[])])
PRODUCING: Leaf(weight=45.0, symbols=[SH(sym='C', huff=['0', '0',
'0']), SH(sym='A', huff=['0', '0', '1']), SH(sym='D', huff=['0', '1']),
SH(sym='B', huff=['1'])])
[('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]
COMBINING: Leaf(weight=2.5, symbols=[SH(sym='C', huff=[])])
AND: Leaf(weight=5.0, symbols=[SH(sym='A', huff=[])])
PRODUCING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
COMBINING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
AND: Leaf(weight=12.5, symbols=[SH(sym='D', huff=[])])
PRODUCING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
COMBINING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
AND: Leaf(weight=25.0, symbols=[SH(sym='B', huff=[])])
PRODUCING: Leaf(weight=45.0, symbols=[SH(sym='C', huff=['0', '0',
'0']), SH(sym='A', huff=['0', '0', '1']), SH(sym='D', huff=['0', '1']),
SH(sym='B', huff=['1'])])
Conclusion
Although
you can do so much with Python lists, if each position in the list has
a fixed meaning then you might be best to use tuples, and if you should
be using tuples, then namedtuples can make your code more readable
without bloating it with long class definitions.
- Paddy.
Subscribe to:
Posts (Atom)