Mainly Tech projects on Python and Electronic Design Automation.

Wednesday, April 29, 2009

Fight for the community you want

Someone took offence at a puerile and sexist presentation given at what is supposed to be a technical event in the Rails community. Good on them. The community is what you make of it.

"Using sex to sell spanners" doesn't make me think they are good spanners.

- Paddy.

Tuesday, April 21, 2009

Monitoring a linux process as it reads files

I am processing 20 thousand files with some proprietary software and
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.



The proprietary program took a list of 20000 files to process as its
first argument and I remembered that  on Linux, the /proc
directory had info on running processes  and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on  one of the file descriptors, I could gauge how many
files been read so far.



I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.



Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment



First create some test files to process


style="font-family: monospace;">bash$ style="font-weight: bold;">mkdir -p /tmp/test style="font-family: monospace;">
bash$  style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done
style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst
style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst style="font-family: monospace;">
/tmp/test/file0 style="font-family: monospace;">
/tmp/test/file1 style="font-family: monospace;">
/tmp/test/file10 style="font-family: monospace;">
/tmp/test/file11 style="font-family: monospace;">
/tmp/test/file12 style="font-family: monospace;">
/tmp/test/file13 style="font-family: monospace;">
/tmp/test/file14 style="font-family: monospace;">
/tmp/test/file15 style="font-family: monospace;">
/tmp/test/file16 style="font-family: monospace;">
/tmp/test/file17 style="font-family: monospace;">
bash$            



Now lets create a test executable to monitor


This script just holds each file open for reading for twenty seconds
before closing the file.

style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]):
style="font-family: monospace; color: rgb(0, 0, 153);">
 
f = file(name.strip())
style="font-family: monospace; color: rgb(0, 0, 153);">
 
time.sleep(45)
style="font-family: monospace; color: rgb(0, 0, 153);">
 
f.close()


' style="font-weight: bold;">/tmp/test/all_files.lst 
&
style="font-family: monospace;">
[2]  style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">
bash$               



here is what the fd directory looks like


style="font-family: monospace;">bash$ ls -l /proc/ style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">/fd style="font-family: monospace;">
total 0 style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0
style="font-family: monospace;">
bash$                                                                        



And here is a python script to monitor that fd directories
link number 4 periodically


style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time,os,datetime style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1])))
style="color: rgb(0, 0, 153);">
all = len(name2index) style="color: rgb(0, 0, 153);">
while True: style="color: rgb(0, 0, 153);">
  path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4
").strip() style="color: rgb(0, 0, 153);">
  print
name2index[path],"/",all, path, datetime.datetime.now().isoformat()
style="color: rgb(0, 0, 153);">
 
time.sleep(30)


' /tmp/test/all_files.lst

22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750

23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750

24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750

24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750

                             



I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.



END.

Thursday, April 16, 2009

Bimap. Bi-directional mapping/dictionary for Python?

I had a chance encounter with the boost C++ library documentation and
came across a description of their href="http://cablemodem.fibertel.com.ar/mcape/boost/libs/bimap/index.html">bimap
(that is bimap without a 't') .



It seems to be a bi-directional mapping.



In a Python dict, keys map to values. In a bimap, values would also map
to keys. It seems they put keys/values on an equal footing by renaming
them as left and right members, and having methods that work on either
the right to left or the left to right mapping.



I had just re-visited a program of mine that dealt with what I would
consider large sets of values. I was mapping test names to test
coverage where their were tens of trhousands of test names, each of
~100 characters long and for each file, I had many thousands of code
coverage points, which themselves were strings of ~150 characters each.
I new I had to work with a dict mapping test_names to sets of covered
points for each test and do a lot of set arithmatic to rank the tests
in order of contribution to the overall coverage figure, so I decided
to use indices; replacing files by negative numbers and coverage points
by positive numbers. My program works, and is very fast but I note that
I create both a testname2index mapping dict and the inverse
index2testname mapping, (and coverpoint2index and index2coverpoint
mapping dicts).



What I had been constructing, unawares, was a bimap by using two dicts.



I then spent time googling for Python bimaps to no avail, (although I
think there are haskel and Java implementations out there - and Google
'helps' by sugesting you meant bitmap with a 't').



So my questions are:


  1.  Is their a Python bimap implementation out there?

  2.  Would anyone want one? (I already code around its
    absense).

  3.  What would its methods be?




Methods



On my third question, I guess you could have all the normal methods of
a dict but preceeded by either
left_
or right_
with calling of a method without those prefixes raising an exception,
as it seems important to not favour one mapping direction over another.



This would give:

style="font-family: monospace; font-weight: bold; text-decoration: underline;">      
DICT      BIMOD
(LEFT)  BIMOD (RIGHT) style="font-family: monospace; font-weight: bold;">
style="font-family: monospace;">     
clear       
left_clear  right_clear   
 

      
copy        
left_copy  right_copy    
 


  
fromkeys    
left_fromkeys  right_fromkeys  
style="font-family: monospace;">
       
get         
left_get 
right_get       
style="font-family: monospace;">
   
has_key     
left_has_key  right_has_key   
style="font-family: monospace;">
     
items       
left_items  right_items   
 


 
iteritems    left_iteritems 
right_iteritems


  
iterkeys    
left_iterkeys  right_iterkeys  
style="font-family: monospace;">
 itervalues  
left_itervalues  right_itervalues
style="font-family: monospace;">
      
keys        
left_keys  right_keys    
 


       
pop         
left_pop 
right_pop       
style="font-family: monospace;">
   
popitem     
left_popitem  right_popitem   
style="font-family: monospace;">
 setdefault  
left_setdefault  right_setdefault
style="font-family: monospace;">
    
update      
left_update  right_update    
style="font-family: monospace;">
    
values      
left_values  right_values    





The above couldn't be extended for indexing however, unless you did
something like:

  style="font-family: monospace;">bimod_instance.left[leftindex]
and bimod_instance..right[rightindex]


but then wouldn't it be better to use the dotted form for all the
methods and have:

style="font-family: monospace;"> 
bimod_instance.left.haskey  # (with a dot after left), ...




I note that their are redundancies above, for example, left_clear ==
right_clear ...



Strewth, I've only scratched the surface :-)



- Paddy.

Friday, April 10, 2009

Knapsack solution by OO Calc

I had previously solved the href="http://www.rosettacode.org/wiki/Knapsack_Problem"
target="_blank">knapsack problem in Python. I came
across the Solver function of OpenOffice Calc and so tried to use it to
solve knapsack.



A quick recap of the problem



A traveller gets diverted and has to make an unscheduled stop
in
what turns out to be Shangri La. Opting to leave, he is allowed to take
as much as he likes of the following items, so long as it will fit in
his knapsack, and he can carry it.
He knows that he can carry no more than 25 'weights' in total; and that
the capacity of his knapsack is 0.25 'cubic lengths'.


Looking just above the bar codes on the items he finds their
weights and volumes. He digs out his recent copy of a financial paper
and gets the value of each item.


cellpadding="2" cellspacing="2">





































nowrap="nowrap" valign="middle">Item nowrap="nowrap" valign="middle">Explanation nowrap="nowrap" valign="middle">Value (each) nowrap="nowrap" valign="middle">weight nowrap="nowrap" valign="middle">Volume (each)
panacea
(vials of)
Incredible
healing properties
3000 0.3 0.025
ichor
(ampules of)
Vampires
blood
1800 0.2 0.015
gold
(bars)
Shiney
shiney
2500 2.0 0.002
align="left" nowrap="nowrap" valign="middle">Knapsack align="left" nowrap="nowrap" valign="middle">For
the carrying of
align="left" nowrap="nowrap" valign="middle">- align="left" nowrap="nowrap" valign="middle"><=25 align="left" nowrap="nowrap" valign="middle"><=0.25 

He can only take whole units of any item,, but there is much
more of any item than he could ever carry


How many of each item does he take to maximise the
value of items he is carrying away with him?


Note:



  1. There are four solutions that maximise the value taken.
    Only one need be given.



OO Calc setup


I am using version 3.0.1. I opened a new spreadsheat and literally
cut-n-pasted the above table into the sheet. The solver needs to be
able to change some cells (the variables) that affect one cell
containing the value to optimise.. I added the table in blue, at the
right of the  figure below to calculate the value, weight and
volume of what is in the knapsack based on the amount of each item
chosen, so you can change what is in the style="font-weight: bold;">Number column and
get new totals calculated in the Totals
row

style="width: 914px; height: 444px;"
alt="Knapsack problem formulae" title="(Showing cell formulae)"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSTTwTk_nEMgY1qLVv_QXSamy1aaWjm-H73uDgKHNjbvVB0QeTQR-EVT5tCoZtaWPjmLw3RwEbFD5kBXUJ31NSYXgM-JeYFub6uyU1ZIjFKXj1e6ltYRnG3beA0o7jgjED6D3i/">

I show the formulae in the cells in the image above.



The Solver


Menu Tools-> Solver shows a form for filling in. from top to
bottom:


  • The target cell is the cell I want to maximise, which is
    the total value of all items in cell H8

  • I want to Maximise.

  • I want to change the Number
    of each item i.e. vary cells G4:G6

  • My limiting conditions are:


    • The weight is <= 25

    • The volume is <= 0.25

    • The Number of each item cannot be negative.



style="width: 455px; height: 373px;" alt="Solver Menu"
title="Solver menu"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_hFwnFTntRnmBiKs5Be81nQN6-LVoVgsi94Yl6FTgi1eM3a2IijSDPcmUOdjbh2ME_jNSpWqkScBW-9KHrl952p7MXhYT8lV_csPjRzeciF1i3Qij3K3m5JAgXQQF4by_bGoH/">



You also need to click the Solvers Options
button and ensure the options are set like this:

style="width: 431px; height: 286px;" alt="Solver Options"
title="Solver Options"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYg4BmKl0Ye99EgQ7bGvj4D_xEkf-oUNIug2hvpp9PAmjsa048nuFPSSdtMsYysNvh98NAMMpphy7jWQHRZhXWHg8fgvcUY_iH2WnsDAc5cUITezPm93HXGvNmQpx1Fj99Gxk-/">



Ok the Options, then hit Solve on the Solver window and eventually you
will get the following result:

style="width: 914px; height: 444px;"
alt="Result of Knapsack constraints problem"
title="Result of Knapsack constraints problem"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoq_CC0HfR9EzsFU6E9dXcxmLk1zXhqZxkuu2tCe1DWK-_REuVKnTHBGO9KhzvQ-cIJfskAb24DzRT8bMv2Wb5lshQw7ec5YGT6IN4VEgvR8MlCZNBM0wRJDOAgXqzn-tSZu2J/">



The solver has correctly determined that carrying no panacea, fifteen
ampules of ichor, and eleven gold bars will maximise the value, subject
to the given constraints..



Comment


Their are actually other values that give the same total value under
the same constraints but I can't find an easy way for the solver to
give them all.



END.

Memoization and stack use.

I was caught out today when investigating how to apply memoization to
allow computation of larger values of a recursive function.



Mutual Recursion


I had thought up a new task for Rosetta Code called href="http://www.rosettacode.org/wiki/Mutual_Recursion">Mutual
Recursion, as I had remembered that some languages needed
 at least the signature of a function to be defined before it
could be called, and so creating mutually recursive functions would
allow you to compare the languages in an interesting way - which is
what R.C. is all about.



For the example for implementation I used href="http://en.wikipedia.org/wiki/Hofstadter_sequence#Hofstadter_Female_and_Male_sequences"
target="_blank">Hoffstadters Female and Male
sequences, and produced the following href="http://www.rosettacode.org/wiki/Mutual_Recursion#Python">Python
definitions:


def  color="#008080">F(n): return 1  color="#804040">if n == 0  color="#804040">else n - M(F(n-1))
color="#804040">def color="#008080">M(n): return 0 color="#804040">if n == 0 color="#804040">else n - F(M(n-1))


It seems that other members of R.C. liked the task and soon added many
examples in other languages.



Apart from the computation of the series for 0<=n<20, I
did try to compute F(200) but it blew the stack so I went to bed and
decided to memoize the functions another day.



target="_blank">Wot, no memoize?


Memoization came very easily to mind, so I thought that Python would
have such available as a decorator waiting to be applied. No such luck,
I would have to craft my own . I did remember correctly however that
there was some  helper function that made wrapped functions
look a lot more like what had been wrapped, and so used functools.wrap:


from functools  color="#a020f0">import wraps

color="#804040">def color="#008080">memoize0(func):
' color="#ff00ff"> Adds cache as attribute to wrapped function for ease of access'
color="#a020f0">@wraps(func)
color="#804040">def color="#008080">wrapper(*args):
argsl = tuple(args)
color="#804040">if argsl color="#804040">in wrapper._cache:
color="#804040">return wrapper._cache[argsl]
color="#804040">else:
color="#804040">return wrapper._cache.setdefault(argsl, func(*args))
wrapper._cache = dict()
color="#804040">return wrapper

color="#804040">def color="#008080">memoize(func):
' color="#ff00ff"> Creates closure around locally created cache'
cache = dict()
color="#a020f0">@wraps(func)
color="#804040">def color="#008080">wrapper(*args):
argsl = tuple(args)
color="#804040">if argsl color="#804040">in cache:
color="#804040">return cache[argsl]
color="#804040">else:
color="#804040">return cache.setdefault(argsl, func(*args))
color="#804040">return wrapper




I haven't got over my (probably infantile) lust for attaching
attributes to functions so justified memoize0 as it allows you to look
at the contents of the cache. Combined with the fact that you cannot do
a memoization decorator without applying it to a href="http://en.wikipedia.org/wiki/Fibonacci_function">Fibonacci
number function:


if __name__ == ' color="#ff00ff">__main__':
color="#a020f0">@memoize0
color="#804040">def color="#008080">fib(n): return n color="#804040">if n color="#804040">in (0,1) color="#804040">else fib(n - 1) + fib(n - 2)
color="#a020f0">@memoize0
color="#804040">def color="#008080">fib2(n): color="#804040">return n color="#804040">if n color="#804040">in (0,1) color="#804040">else fib2(n - 1) + fib2(n - 2)
color="#804040">assert fib._cache color="#804040">is color="#804040">not fib2._cache
fib(50); fib2(50)
color="#804040">assert fib._cache == fib2._cache color="#804040">and (fib._cache color="#804040">is color="#804040">not fib2._cache)




It is important that caches are not shared for mutually recursive
functions.

A blown stack


I had initial success with the Male and Female sequence and was easily
able to compute F(200) with:



@ color="#008080">memoize
color="#804040">def color="#008080">F(n):
' color="#ff00ff"> Hofstadters Female Male sequences'
color="#804040">return 1 color="#804040">if n == 0 color="#804040">else n - M(F(n-1))

color="#a020f0">@memoize
color="#804040">def color="#008080">M(n):
' color="#ff00ff"> Hofstadters Female Male sequences'
color="#804040">return 0 color="#804040">if n == 0 color="#804040">else n - F(M(n-1))

color="#804040">assert F(200) == 124




But I got cocky, as F(2000) ran out of stack.



A little thought made me realise that I only had a cached results for F
and M up to around 200 and so a call of F(2000) was going to eat up
huge amounts of stack before recursing down to cached values.

With that knowledge in mind I thought I might gradually jump to higher
values of  n and, indeed, the following worked:


    print ([F(n)  color="#804040">for n  color="#804040">in range(0,20000+1,100)])




In real-world applicatrions, you might have to have some scheme to
reduce or limit cache size if memory rather than the stack becomes an
issue, and I have a great solution that I have just completed in the
margin :-)



- Paddy.

Thursday, April 02, 2009

(Ab)use of Pythons in-built data structures

I like to use Pythons in-built data structures quit a lot, and tend to
force myself to ask whether I  should create my own classes,
which allows you to use meaningful names for fields and add
comments/docstrings to the datastructure but usually at the cost of
adding more lines of text.



After writing about Huffman codes and posting solutions to href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Python">Rosetta
code and href="http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html">my
blog:


 29  color="#804040">def  color="#008080">codecreate(symbol2weights, tutor= False):
color="#804040"> 30 ''' color="#ff00ff"> Huffman encode the given dict mapping symbols to weights '''
color="#804040"> 31 global decode
color="#804040"> 32
color="#804040"> 33 heap = [ [float(wt), [[sym, []]], repr(sym)] color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
color="#804040"> 34 heapify(heap)
color="#804040"> 35 if tutor: color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
color="#804040"> 36 while len(heap) >1:
color="#804040"> 37 lo = heappop(heap)
color="#804040"> 38 hi = heappop(heap)
color="#804040"> 39 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> COMBINING:", lo, ' color="#6a5acd">\n AND:', hi
color="#804040"> 40 color="#804040">for i color="#804040">in lo[1]: i[1].insert(0, ' color="#ff00ff">0')
color="#804040"> 41 color="#804040">for i color="#804040">in hi[1]: i[1].insert(0, ' color="#ff00ff">1')
color="#804040"> 42 lohi = [ lo[0] + hi[0] ] + [lo[1] + hi[1]]
color="#804040"> 43 lohi.append(' color="#ff00ff">(%s if nextbit() else %s)' % (hi[2], lo[2]))
color="#804040"> 44 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> PRODUCING:", lohi, ' color="#6a5acd">\n'
color="#804040"> 45 heappush(heap, lohi)
color="#804040"> 46 wt, codes, decoder = heappop(heap)
color="#804040"> 47 decode = eval(' color="#ff00ff">lambda :' + decoder, globals())
color="#804040"> 48 decode.__doc__ = decoder
color="#804040"> 49 for i color="#804040">in codes: i[1] = ''.join(i[1])
color="#804040"> 50 #for i in codes: i[::] = i[:2]
color="#804040"> 51 return sorted(codes, key= color="#804040">lambda x: (len(x[-1]), x))
color="#804040"> 52


Someone posted a href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Java">
Java solution, that was over three times as long. Instead of
just counting and comparing line counts,  I took a closer
look to see what the extra code was for.


import java.util.*;

color="#2e8b57">abstract color="#2e8b57">class HuffmanTree color="#2e8b57">implements Comparable<HuffmanTree> {
color="#2e8b57">public color="#2e8b57">int frequency; color="#0000ff">// the frequency of this tree
color="#2e8b57">public HuffmanTree( color="#2e8b57">int freq) { frequency = freq; }

color="#0000ff">// compares on the frequency
color="#2e8b57">public color="#2e8b57">int compareTo(HuffmanTree tree) {
color="#804040">return frequency - tree.frequency;
}
}

color="#2e8b57">class HuffmanLeaf color="#2e8b57">extends HuffmanTree {
color="#2e8b57">public color="#2e8b57">char value; color="#0000ff">// the character this leaf represents

color="#2e8b57">public HuffmanLeaf( color="#2e8b57">int freq, color="#2e8b57">char val) {
color="#2e8b57">super(freq);
value = val;
}
}

color="#2e8b57">class HuffmanNode color="#2e8b57">extends HuffmanTree {
color="#2e8b57">public HuffmanTree left, right; color="#0000ff">// subtrees

color="#2e8b57">public HuffmanNode(HuffmanTree l, HuffmanTree r) {
color="#2e8b57">super(l.frequency + r.frequency);
left = l;
right = r;
}
}

color="#2e8b57">public color="#2e8b57">class HuffmanCode {
color="#0000ff">// input is an array of frequencies, indexed by character code
color="#2e8b57">public color="#2e8b57">static HuffmanTree buildTree( color="#2e8b57">int[] charFreqs) {
PriorityQueue<HuffmanTree> trees = color="#804040">new PriorityQueue<HuffmanTree>();
color="#0000ff">// initially, we have a forest of leaves
color="#0000ff">// one for each non-empty character
color="#804040">for ( color="#2e8b57">int i = color="#ff00ff">0; i < charFreqs.length; i++)
color="#804040">if (charFreqs[i] > color="#ff00ff">0)
trees.offer( color="#804040">new HuffmanLeaf(charFreqs[i], ( color="#2e8b57">char)i));

color="#804040">assert trees.size() > color="#ff00ff">0;
color="#0000ff">// loop until there is only one tree left
color="#804040">while (trees.size() > color="#ff00ff">1) {
color="#0000ff">// two trees with least frequency
HuffmanTree a = trees.poll();
HuffmanTree b = trees.poll();

color="#0000ff">// put into new node and re-insert into queue
trees.offer( color="#804040">new HuffmanNode(a, b));
}
color="#804040">return trees.poll();
}

color="#2e8b57">public color="#2e8b57">static color="#2e8b57">void printCodes(HuffmanTree tree, Stack<Character> prefix) {
color="#804040">assert tree != color="#ff00ff">null;
color="#804040">if (tree color="#804040">instanceof HuffmanLeaf) {
HuffmanLeaf leaf = (HuffmanLeaf)tree;

color="#0000ff">// print out character and frequency
System.out.print(leaf.value + color="#ff00ff">"\t color="#ff00ff">" + leaf.frequency + color="#ff00ff">"\t color="#ff00ff">");

color="#0000ff">// print out code for this leaf, which is just the prefix
color="#804040">for ( color="#2e8b57">char bit : prefix)
System.out.print(bit);
System.out.println();

} color="#804040">else color="#804040">if (tree color="#804040">instanceof HuffmanNode) {
HuffmanNode node = (HuffmanNode)tree;

color="#0000ff">// traverse left
prefix.push( color="#ff00ff">'0');
printCodes(node.left, prefix);
prefix.pop();

color="#0000ff">// traverse right
prefix.push( color="#ff00ff">'1');
printCodes(node.right, prefix);
prefix.pop();
}
}

color="#2e8b57">public color="#2e8b57">static color="#2e8b57">void main(String[] args) {
String test = color="#ff00ff">"this is an example for huffman encoding";

color="#0000ff">// we will assume that all our characters will have
color="#0000ff">// code less than 256, for simplicity
color="#2e8b57">int[] charFreqs = color="#804040">new color="#2e8b57">int[ color="#ff00ff">256];
color="#0000ff">// read each character and record the frequencies
color="#804040">for ( color="#2e8b57">char c : test.toCharArray())
charFreqs[c]++;

color="#0000ff">// build tree
HuffmanTree tree = buildTree(charFreqs);

color="#0000ff">// print out results
System.out.println( color="#ff00ff">"SYMBOL\t color="#ff00ff">WEIGHT\t color="#ff00ff">HUFFMAN CODE");
printCodes(tree, color="#804040">new Stack<Character>());
}
}




I
noticed that the Java was tidy, and that they had defined their own
classes for a HuffmanLeaf structure used when constructing a
HuffmanTree.



In my original Python, I used a nested list as
the equivalent to the HuffmanLeaf of the Java example. It worked, but I
had to introduce 'magic constants' to access the fields of the Python
leaf , which is also un-named as a structure in the program (lline 33
of
the Python code creates a list of Leaf structures):


33  heap = [ [float(wt), [sym, []]]  color="#804040">for sym, wt  color="#804040">in symbol2weights.iteritems() ]




Now I didn't want to go to the trouble of creating my own
classes, but that's were the new href="http://docs.python.org/dev/py3k/library/collections.html#collections.namedtuple">namedtuple
class factory of Python 2.6 came to the rescue.

namedtuple


namedtuple
will generate a subtype of the tuple class that allows the fields of
the tuple to be named and accessed via subscription, [], or as if the
fields are instance variable names.



I modified my Python program to use two named tuples:


  1. For the Leaf structure as a whole in line 3.

  2. For a component of the Leaf structure that holds the symbol
    and the Huffman code (accumulated so far), in line 4.



 1  color="#a020f0">from collections  color="#a020f0">import namedtuple
color="#804040"> 2
color="#804040"> 3 Leaf = namedtuple(' color="#ff00ff">Leaf', 'weight, symbols')
color="#804040"> 4 SH = namedtuple(' color="#ff00ff">SH', 'sym, huff')
color="#804040"> 5
color="#804040"> 6 def color="#008080">codecreate2(symbol2weights, tutor= False):
color="#804040"> 7 ''' color="#ff00ff"> Huffman codecreate2 the given dict mapping symbols to weights '''
color="#804040"> 8 heap = [ Leaf(weight=float(wt), symbols=[ SH(sym, []) ])
color="#804040"> 9 color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
color="#804040">10 heapify(heap)
color="#804040">11 if tutor: color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
color="#804040">12 while len(heap) >1:
color="#804040">13 lo = heappop(heap)
color="#804040">14 hi = heappop(heap)
color="#804040">15 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> COMBINING:", lo, ' color="#6a5acd">\n AND:', hi
color="#804040">16 color="#804040">for sh color="#804040">in lo.symbols: sh.huff.insert(0, ' color="#ff00ff">0')
color="#804040">17 color="#804040">for sh color="#804040">in hi.symbols: sh.huff.insert(0, ' color="#ff00ff">1')
color="#804040">18 lohi = Leaf(weight = lo.weight + hi.weight,
color="#804040">19 symbols = lo.symbols + hi.symbols)
color="#804040">20 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> PRODUCING:", lohi, ' color="#6a5acd">\n'
color="#804040">21 heappush(heap, lohi)
color="#804040">22 symbols = heappop(heap).symbols
color="#804040">23 symbols = [SH(sym, ''.join(huff)) color="#804040">for sym, huff color="#804040">in symbols]
color="#804040">24 return sorted(symbols, key= color="#804040">lambda sh: (len(sh.huff), sh))
color="#804040">25




The
class generation is succinct in lines 3 and 4, and I use the field
names for clarity for example when creating the original list of Leaves
that will form the heap in line 8

Printing namedtuples


The print statements stay the same, but before you got output like this:

style="font-family: monospace;"> style="font-family: monospace;">...

  COMBINING: [2.5, ['C', []]] style="font-family: monospace;">
       
AND: [5.0, ['A', []]]


 
PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]]
style="font-family: monospace;">
...



Now you get the clearer:

style="font-family: monospace;"> ENCODING:
[('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]

  COMBINING: Leaf(weight=2.5, symbols=[SH(sym='C', huff=[])])

       
AND: Leaf(weight=5.0, symbols=[SH(sym='A', huff=[])])

  PRODUCING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])



  COMBINING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])

       
AND: Leaf(weight=12.5, symbols=[SH(sym='D', huff=[])])

 
PRODUCING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])



 
COMBINING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])

       
AND: Leaf(weight=25.0, symbols=[SH(sym='B', huff=[])])

 
PRODUCING: Leaf(weight=45.0, symbols=[SH(sym='C', huff=['0', '0',
'0']), SH(sym='A', huff=['0', '0', '1']), SH(sym='D', huff=['0', '1']),
SH(sym='B', huff=['1'])])



Conclusion


 Although
you can do so much with Python lists, if each position in the list has
a fixed meaning then you might be best to use tuples, and if you should
be using tuples, then namedtuples can make your code more readable
without bloating it with long class definitions.



- Paddy.

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive