Go deh!: April 2009

Wednesday, April 29, 2009

Fight for the community you want

Someone took offence at a puerile and sexist presentation given at what is supposed to be a technical event in the Rails community. Good on them. The community is what you make of it.

"Using sex to sell spanners" doesn't make me think they are good spanners.

- Paddy.

Tuesday, April 21, 2009

Monitoring a linux process as it reads files

I am processing 20 thousand files with some proprietary software and
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.

The proprietary program took a list of 20000 files to process as its
first argument and I remembered that on Linux, the /proc
directory had info on running processes and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on one of the file descriptors, I could gauge how many
files been read so far.

I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.

Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment

First create some test files to process

style="font-family: monospace;">bash$ style="font-weight: bold;">mkdir -p /tmp/test style="font-family: monospace;">
bash$ style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst style="font-family: monospace;">
/tmp/test/file0 style="font-family: monospace;">
/tmp/test/file1 style="font-family: monospace;">
/tmp/test/file10 style="font-family: monospace;">
/tmp/test/file11 style="font-family: monospace;">
/tmp/test/file12 style="font-family: monospace;">
/tmp/test/file13 style="font-family: monospace;">
/tmp/test/file14 style="font-family: monospace;">
/tmp/test/file15 style="font-family: monospace;">
/tmp/test/file16 style="font-family: monospace;">
/tmp/test/file17 style="font-family: monospace;">
bash$

Now lets create a test executable to monitor

This script just holds each file open for reading for twenty seconds
before closing the file.

style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]): style="font-family: monospace; color: rgb(0, 0, 153);">

f = file(name.strip()) style="font-family: monospace; color: rgb(0, 0, 153);">

time.sleep(45) style="font-family: monospace; color: rgb(0, 0, 153);">

f.close()

' style="font-weight: bold;">/tmp/test/all_files.lst
& style="font-family: monospace;">
[2] style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">
bash$

here is what the fd directory looks like

style="font-family: monospace;">bash$ ls -l /proc/ style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">/fd style="font-family: monospace;">
total 0 style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1 style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1 style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1 style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0 style="font-family: monospace;">
bash$

And here is a python script to monitor that fd directories
link number 4 periodically

style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time,os,datetime style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1]))) style="color: rgb(0, 0, 153);">
all = len(name2index) style="color: rgb(0, 0, 153);">
while True: style="color: rgb(0, 0, 153);">
path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4").strip() style="color: rgb(0, 0, 153);">
print
name2index[path],"/",all, path, datetime.datetime.now().isoformat() style="color: rgb(0, 0, 153);">

time.sleep(30)

' /tmp/test/all_files.lst

22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750

23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750

24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750

24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750

I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.

END.

Thursday, April 16, 2009

Bimap. Bi-directional mapping/dictionary for Python?

I had a chance encounter with the boost C++ library documentation and
came across a description of their href="http://cablemodem.fibertel.com.ar/mcape/boost/libs/bimap/index.html">bimap
(that is bimap without a 't') .

It seems to be a bi-directional mapping.

In a Python dict, keys map to values. In a bimap, values would also map
to keys. It seems they put keys/values on an equal footing by renaming
them as left and right members, and having methods that work on either
the right to left or the left to right mapping.

I had just re-visited a program of mine that dealt with what I would
consider large sets of values. I was mapping test names to test
coverage where their were tens of trhousands of test names, each of
~100 characters long and for each file, I had many thousands of code
coverage points, which themselves were strings of ~150 characters each.
I new I had to work with a dict mapping test_names to sets of covered
points for each test and do a lot of set arithmatic to rank the tests
in order of contribution to the overall coverage figure, so I decided
to use indices; replacing files by negative numbers and coverage points
by positive numbers. My program works, and is very fast but I note that
I create both a testname2index mapping dict and the inverse
index2testname mapping, (and coverpoint2index and index2coverpoint
mapping dicts).

What I had been constructing, unawares, was a bimap by using two dicts.

I then spent time googling for Python bimaps to no avail, (although I
think there are haskel and Java implementations out there - and Google
'helps' by sugesting you meant bitmap with a 't').

So my questions are:

Is their a Python bimap implementation out there?

Would anyone want one? (I already code around its
absense).

What would its methods be?

Methods

On my third question, I guess you could have all the normal methods of
a dict but preceeded by either
left_ or right_
with calling of a method without those prefixes raising an exception,
as it seems important to not favour one mapping direction over another.

This would give:

style="font-family: monospace; font-weight: bold; text-decoration: underline;">
DICT      BIMOD
(LEFT) BIMOD (RIGHT) style="font-family: monospace; font-weight: bold;">
style="font-family: monospace;">
clear
left_clear right_clear


copy
left_copy right_copy


fromkeys
left_fromkeys right_fromkeys style="font-family: monospace;">

get
left_get
right_get      style="font-family: monospace;">

has_key
left_has_key right_has_key style="font-family: monospace;">

items
left_items right_items

iteritems    left_iteritems
right_iteritems


iterkeys
left_iterkeys right_iterkeys style="font-family: monospace;">
itervalues
left_itervalues right_itervalues style="font-family: monospace;">

keys
left_keys right_keys


pop
left_pop
right_pop      style="font-family: monospace;">

popitem
left_popitem right_popitem style="font-family: monospace;">
setdefault
left_setdefault right_setdefault style="font-family: monospace;">

update
left_update right_update   style="font-family: monospace;">

values
left_values right_values

The above couldn't be extended for indexing however, unless you did
something like:

style="font-family: monospace;">bimod_instance.left[leftindex]
and bimod_instance..right[rightindex]

but then wouldn't it be better to use the dotted form for all the
methods and have:

style="font-family: monospace;">
bimod_instance.left.haskey # (with a dot after left), ...

I note that their are redundancies above, for example, left_clear ==
right_clear ...

Strewth, I've only scratched the surface :-)

- Paddy.

Friday, April 10, 2009

Knapsack solution by OO Calc

I had previously solved the href="http://www.rosettacode.org/wiki/Knapsack_Problem"
target="_blank">knapsack problem in Python. I came
across the Solver function of OpenOffice Calc and so tried to use it to
solve knapsack.

A quick recap of the problem

A traveller gets diverted and has to make an unscheduled stop
in
what turns out to be Shangri La. Opting to leave, he is allowed to take
as much as he likes of the following items, so long as it will fit in
his knapsack, and he can carry it.
He knows that he can carry no more than 25 'weights' in total; and that
the capacity of his knapsack is 0.25 'cubic lengths'.

Looking just above the bar codes on the items he finds their
weights and volumes. He digs out his recent copy of a financial paper
and gets the value of each item.

cellpadding="2" cellspacing="2">

nowrap="nowrap" valign="middle">Item	nowrap="nowrap" valign="middle">Explanation	nowrap="nowrap" valign="middle">Value (each)	nowrap="nowrap" valign="middle">weight	nowrap="nowrap" valign="middle">Volume (each)
panacea (vials of)	Incredible healing properties	3000	0.3	0.025
ichor (ampules of)	Vampires blood	1800	0.2	0.015
gold (bars)	Shiney shiney	2500	2.0	0.002
align="left" nowrap="nowrap" valign="middle">Knapsack	align="left" nowrap="nowrap" valign="middle">For the carrying of	align="left" nowrap="nowrap" valign="middle">-	align="left" nowrap="nowrap" valign="middle"><=25	align="left" nowrap="nowrap" valign="middle"><=0.25

He can only take whole units of any item,, but there is much
more of any item than he could ever carry

How many of each item does he take to maximise the
value of items he is carrying away with him?

Note:

There are four solutions that maximise the value taken.
Only one need be given.

OO Calc setup

I am using version 3.0.1. I opened a new spreadsheat and literally
cut-n-pasted the above table into the sheet. The solver needs to be
able to change some cells (the variables) that affect one cell
containing the value to optimise.. I added the table in blue, at the
right of the figure below to calculate the value, weight and
volume of what is in the knapsack based on the amount of each item
chosen, so you can change what is in the style="font-weight: bold;">Number column and
get new totals calculated in the Totals
row

style="width: 914px; height: 444px;"
alt="Knapsack problem formulae" title="(Showing cell formulae)"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSTTwTk_nEMgY1qLVv_QXSamy1aaWjm-H73uDgKHNjbvVB0QeTQR-EVT5tCoZtaWPjmLw3RwEbFD5kBXUJ31NSYXgM-JeYFub6uyU1ZIjFKXj1e6ltYRnG3beA0o7jgjED6D3i/">

I show the formulae in the cells in the image above.

The Solver

Menu Tools-> Solver shows a form for filling in. from top to
bottom:

The target cell is the cell I want to maximise, which is
the total value of all items in cell H8

I want to Maximise.

I want to change the Number
of each item i.e. vary cells G4:G6

My limiting conditions are:

The weight is <= 25

The volume is <= 0.25

The Number of each item cannot be negative.

style="width: 455px; height: 373px;" alt="Solver Menu"
title="Solver menu"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_hFwnFTntRnmBiKs5Be81nQN6-LVoVgsi94Yl6FTgi1eM3a2IijSDPcmUOdjbh2ME_jNSpWqkScBW-9KHrl952p7MXhYT8lV_csPjRzeciF1i3Qij3K3m5JAgXQQF4by_bGoH/">

You also need to click the Solvers Options
button and ensure the options are set like this:

style="width: 431px; height: 286px;" alt="Solver Options"
title="Solver Options"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYg4BmKl0Ye99EgQ7bGvj4D_xEkf-oUNIug2hvpp9PAmjsa048nuFPSSdtMsYysNvh98NAMMpphy7jWQHRZhXWHg8fgvcUY_iH2WnsDAc5cUITezPm93HXGvNmQpx1Fj99Gxk-/">

Ok the Options, then hit Solve on the Solver window and eventually you
will get the following result:

style="width: 914px; height: 444px;"
alt="Result of Knapsack constraints problem"
title="Result of Knapsack constraints problem"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoq_CC0HfR9EzsFU6E9dXcxmLk1zXhqZxkuu2tCe1DWK-_REuVKnTHBGO9KhzvQ-cIJfskAb24DzRT8bMv2Wb5lshQw7ec5YGT6IN4VEgvR8MlCZNBM0wRJDOAgXqzn-tSZu2J/">

The solver has correctly determined that carrying no panacea, fifteen
ampules of ichor, and eleven gold bars will maximise the value, subject
to the given constraints..

Comment

Their are actually other values that give the same total value under
the same constraints but I can't find an easy way for the solver to
give them all.

END.

Memoization and stack use.

I was caught out today when investigating how to apply memoization to
allow computation of larger values of a recursive function.

Mutual Recursion

I had thought up a new task for Rosetta Code called href="http://www.rosettacode.org/wiki/Mutual_Recursion">Mutual
Recursion, as I had remembered that some languages needed
at least the signature of a function to be defined before it
could be called, and so creating mutually recursive functions would
allow you to compare the languages in an interesting way - which is
what R.C. is all about.

For the example for implementation I used href="http://en.wikipedia.org/wiki/Hofstadter_sequence#Hofstadter_Female_and_Male_sequences"
target="_blank">Hoffstadters Female and Male
sequences, and produced the following href="http://www.rosettacode.org/wiki/Mutual_Recursion#Python">Python
definitions:

def  color="#008080">F(n): return 1  color="#804040">if n == 0  color="#804040">else n - M(F(n-1))
 color="#804040">def  color="#008080">M(n): return 0  color="#804040">if n == 0  color="#804040">else n - F(M(n-1))

It seems that other members of R.C. liked the task and soon added many
examples in other languages.

Apart from the computation of the series for 0<=n<20, I
did try to compute F(200) but it blew the stack so I went to bed and
decided to memoize the functions another day.

target="_blank">Wot, no memoize?

Memoization came very easily to mind, so I thought that Python would
have such available as a decorator waiting to be applied. No such luck,
I would have to craft my own . I did remember correctly however that
there was some helper function that made wrapped functions
look a lot more like what had been wrapped, and so used functools.wrap:

from functools  color="#a020f0">import wraps

 color="#804040">def  color="#008080">memoize0(func):
    ' color="#ff00ff"> Adds cache as attribute to wrapped function for ease of access'
     color="#a020f0">@wraps(func)
     color="#804040">def  color="#008080">wrapper(*args):
        argsl = tuple(args)
         color="#804040">if argsl  color="#804040">in wrapper._cache:
             color="#804040">return wrapper._cache[argsl]
         color="#804040">else:
             color="#804040">return wrapper._cache.setdefault(argsl, func(*args))
    wrapper._cache = dict()
     color="#804040">return wrapper

 color="#804040">def  color="#008080">memoize(func):
    ' color="#ff00ff"> Creates closure around locally created cache'
    cache = dict()
     color="#a020f0">@wraps(func)
     color="#804040">def  color="#008080">wrapper(*args):
        argsl = tuple(args)
         color="#804040">if argsl  color="#804040">in cache:
             color="#804040">return cache[argsl]
         color="#804040">else:
             color="#804040">return cache.setdefault(argsl, func(*args))
     color="#804040">return wrapper

I haven't got over my (probably infantile) lust for attaching
attributes to functions so justified memoize0 as it allows you to look
at the contents of the cache. Combined with the fact that you cannot do
a memoization decorator without applying it to a href="http://en.wikipedia.org/wiki/Fibonacci_function">Fibonacci
number function:

if __name__ == ' color="#ff00ff">__main__':
     color="#a020f0">@memoize0
     color="#804040">def  color="#008080">fib(n): return n  color="#804040">if n  color="#804040">in (0,1)  color="#804040">else fib(n - 1) + fib(n - 2)
     color="#a020f0">@memoize0
     color="#804040">def  color="#008080">fib2(n):  color="#804040">return n  color="#804040">if n  color="#804040">in (0,1)  color="#804040">else fib2(n - 1) + fib2(n - 2)
     color="#804040">assert fib._cache  color="#804040">is  color="#804040">not fib2._cache
    fib(50); fib2(50)
     color="#804040">assert fib._cache == fib2._cache  color="#804040">and (fib._cache  color="#804040">is  color="#804040">not fib2._cache)

It is important that caches are not shared for mutually recursive
functions.

A blown stack

I had initial success with the Male and Female sequence and was easily
able to compute F(200) with:


    @ color="#008080">memoize
     color="#804040">def  color="#008080">F(n):
        ' color="#ff00ff"> Hofstadters Female Male sequences'
         color="#804040">return 1  color="#804040">if n == 0  color="#804040">else n - M(F(n-1))

     color="#a020f0">@memoize
     color="#804040">def  color="#008080">M(n):
        ' color="#ff00ff"> Hofstadters Female Male sequences'
         color="#804040">return 0  color="#804040">if n == 0  color="#804040">else n - F(M(n-1))

     color="#804040">assert F(200) == 124

But I got cocky, as F(2000) ran out of stack.

A little thought made me realise that I only had a cached results for F
and M up to around 200 and so a call of F(2000) was going to eat up
huge amounts of stack before recursing down to cached values.

With that knowledge in mind I thought I might gradually jump to higher
values of n and, indeed, the following worked:

    print ([F(n)  color="#804040">for n  color="#804040">in range(0,20000+1,100)])

In real-world applicatrions, you might have to have some scheme to
reduce or limit cache size if memory rather than the stack becomes an
issue, and I have a great solution that I have just completed in the
margin :-)

- Paddy.

Thursday, April 02, 2009

(Ab)use of Pythons in-built data structures

I like to use Pythons in-built data structures quit a lot, and tend to
force myself to ask whether I should create my own classes,
which allows you to use meaningful names for fields and add
comments/docstrings to the datastructure but usually at the cost of
adding more lines of text.

After writing about Huffman codes and posting solutions to href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Python">Rosetta
code and href="http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html">my
blog:

 29  color="#804040">def  color="#008080">codecreate(symbol2weights, tutor= False):
 color="#804040"> 30     ''' color="#ff00ff"> Huffman encode the given dict mapping symbols to weights '''
 color="#804040"> 31     global decode
 color="#804040"> 32 
 color="#804040"> 33     heap = [ [float(wt), [[sym, []]], repr(sym)]  color="#804040">for sym, wt  color="#804040">in symbol2weights.iteritems() ]
 color="#804040"> 34     heapify(heap)
 color="#804040"> 35     if tutor:  color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
 color="#804040"> 36     while len(heap) >1:
 color="#804040"> 37         lo = heappop(heap)
 color="#804040"> 38         hi = heappop(heap)
 color="#804040"> 39          color="#804040">if tutor:  color="#804040">print " color="#ff00ff">  COMBINING:", lo, ' color="#6a5acd">\n        AND:', hi
 color="#804040"> 40          color="#804040">for i  color="#804040">in lo[1]: i[1].insert(0, ' color="#ff00ff">0')
 color="#804040"> 41          color="#804040">for i  color="#804040">in hi[1]: i[1].insert(0, ' color="#ff00ff">1')
 color="#804040"> 42         lohi = [ lo[0] + hi[0] ] + [lo[1] + hi[1]]
 color="#804040"> 43         lohi.append(' color="#ff00ff">(%s if nextbit() else %s)' % (hi[2], lo[2]))
 color="#804040"> 44          color="#804040">if tutor:  color="#804040">print " color="#ff00ff">  PRODUCING:", lohi, ' color="#6a5acd">\n'
 color="#804040"> 45         heappush(heap, lohi)
 color="#804040"> 46     wt, codes, decoder = heappop(heap)
 color="#804040"> 47     decode = eval(' color="#ff00ff">lambda :' + decoder, globals())
 color="#804040"> 48     decode.__doc__ = decoder
 color="#804040"> 49     for i  color="#804040">in codes: i[1] = ''.join(i[1])
 color="#804040"> 50     #for i in codes: i[::] = i[:2]
 color="#804040"> 51     return sorted(codes, key= color="#804040">lambda x: (len(x[-1]), x))
 color="#804040"> 52

Someone posted a href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Java">
Java solution, that was over three times as long. Instead of
just counting and comparing line counts, I took a closer
look to see what the extra code was for.

import java.util.*;

 color="#2e8b57">abstract  color="#2e8b57">class HuffmanTree  color="#2e8b57">implements Comparable<HuffmanTree> {
     color="#2e8b57">public  color="#2e8b57">int frequency;  color="#0000ff">// the frequency of this tree
     color="#2e8b57">public HuffmanTree( color="#2e8b57">int freq) { frequency = freq; }

     color="#0000ff">// compares on the frequency
     color="#2e8b57">public  color="#2e8b57">int compareTo(HuffmanTree tree) {
         color="#804040">return frequency - tree.frequency;
    }
}

 color="#2e8b57">class HuffmanLeaf  color="#2e8b57">extends HuffmanTree {
     color="#2e8b57">public  color="#2e8b57">char value;  color="#0000ff">// the character this leaf represents

     color="#2e8b57">public HuffmanLeaf( color="#2e8b57">int freq,  color="#2e8b57">char val) {
         color="#2e8b57">super(freq);
        value = val;
    }
}

 color="#2e8b57">class HuffmanNode  color="#2e8b57">extends HuffmanTree {
     color="#2e8b57">public HuffmanTree left, right;  color="#0000ff">// subtrees

     color="#2e8b57">public HuffmanNode(HuffmanTree l, HuffmanTree r) {
         color="#2e8b57">super(l.frequency + r.frequency);
        left = l;
        right = r;
    }
}

 color="#2e8b57">public  color="#2e8b57">class HuffmanCode {
     color="#0000ff">// input is an array of frequencies, indexed by character code
     color="#2e8b57">public  color="#2e8b57">static HuffmanTree buildTree( color="#2e8b57">int[] charFreqs) {
        PriorityQueue<HuffmanTree> trees =  color="#804040">new PriorityQueue<HuffmanTree>();
         color="#0000ff">// initially, we have a forest of leaves
         color="#0000ff">// one for each non-empty character
         color="#804040">for ( color="#2e8b57">int i =  color="#ff00ff">0; i < charFreqs.length; i++)
             color="#804040">if (charFreqs[i] >  color="#ff00ff">0)
                trees.offer( color="#804040">new HuffmanLeaf(charFreqs[i], ( color="#2e8b57">char)i));

         color="#804040">assert trees.size() >  color="#ff00ff">0;
         color="#0000ff">// loop until there is only one tree left
         color="#804040">while (trees.size() >  color="#ff00ff">1) {
             color="#0000ff">// two trees with least frequency
            HuffmanTree a = trees.poll();
            HuffmanTree b = trees.poll();

             color="#0000ff">// put into new node and re-insert into queue
            trees.offer( color="#804040">new HuffmanNode(a, b));
        }
         color="#804040">return trees.poll();
    }

     color="#2e8b57">public  color="#2e8b57">static  color="#2e8b57">void printCodes(HuffmanTree tree, Stack<Character> prefix) {
         color="#804040">assert tree !=  color="#ff00ff">null;
         color="#804040">if (tree  color="#804040">instanceof HuffmanLeaf) {
            HuffmanLeaf leaf = (HuffmanLeaf)tree;

             color="#0000ff">// print out character and frequency
            System.out.print(leaf.value +  color="#ff00ff">"\t color="#ff00ff">" + leaf.frequency +  color="#ff00ff">"\t color="#ff00ff">");

             color="#0000ff">// print out code for this leaf, which is just the prefix
             color="#804040">for ( color="#2e8b57">char bit : prefix)
                System.out.print(bit);
            System.out.println();

        }  color="#804040">else  color="#804040">if (tree  color="#804040">instanceof HuffmanNode) {
            HuffmanNode node = (HuffmanNode)tree;

             color="#0000ff">// traverse left
            prefix.push( color="#ff00ff">'0');
            printCodes(node.left, prefix);
            prefix.pop();

             color="#0000ff">// traverse right
            prefix.push( color="#ff00ff">'1');
            printCodes(node.right, prefix);
            prefix.pop();
        }
    }

     color="#2e8b57">public  color="#2e8b57">static  color="#2e8b57">void main(String[] args) {
        String test =  color="#ff00ff">"this is an example for huffman encoding";

         color="#0000ff">// we will assume that all our characters will have
         color="#0000ff">// code less than 256, for simplicity
         color="#2e8b57">int[] charFreqs =  color="#804040">new  color="#2e8b57">int[ color="#ff00ff">256];
         color="#0000ff">// read each character and record the frequencies
         color="#804040">for ( color="#2e8b57">char c : test.toCharArray())
            charFreqs[c]++;

         color="#0000ff">// build tree
        HuffmanTree tree = buildTree(charFreqs);

         color="#0000ff">// print out results
        System.out.println( color="#ff00ff">"SYMBOL\t color="#ff00ff">WEIGHT\t color="#ff00ff">HUFFMAN CODE");
        printCodes(tree,  color="#804040">new Stack<Character>());
    }
}

I
noticed that the Java was tidy, and that they had defined their own
classes for a HuffmanLeaf structure used when constructing a
HuffmanTree.

In my original Python, I used a nested list as
the equivalent to the HuffmanLeaf of the Java example. It worked, but I
had to introduce 'magic constants' to access the fields of the Python
leaf , which is also un-named as a structure in the program (lline 33
of
the Python code creates a list of Leaf structures):

33  heap = [ [float(wt), [sym, []]]  color="#804040">for sym, wt  color="#804040">in symbol2weights.iteritems() ]

Now I didn't want to go to the trouble of creating my own
classes, but that's were the new href="http://docs.python.org/dev/py3k/library/collections.html#collections.namedtuple">namedtuple
class factory of Python 2.6 came to the rescue.

namedtuple

namedtuple
will generate a subtype of the tuple class that allows the fields of
the tuple to be named and accessed via subscription, [], or as if the
fields are instance variable names.

I modified my Python program to use two named tuples:

For the Leaf structure as a whole in line 3.

For a component of the Leaf structure that holds the symbol
and the Huffman code (accumulated so far), in line 4.

 1  color="#a020f0">from collections  color="#a020f0">import namedtuple
 color="#804040"> 2 
 color="#804040"> 3 Leaf = namedtuple(' color="#ff00ff">Leaf', 'weight, symbols')
 color="#804040"> 4 SH = namedtuple(' color="#ff00ff">SH', 'sym, huff')
 color="#804040"> 5 
 color="#804040"> 6 def  color="#008080">codecreate2(symbol2weights, tutor= False):
 color="#804040"> 7     ''' color="#ff00ff"> Huffman codecreate2 the given dict mapping symbols to weights '''
 color="#804040"> 8     heap = [ Leaf(weight=float(wt), symbols=[ SH(sym, []) ])
 color="#804040"> 9               color="#804040">for sym, wt  color="#804040">in symbol2weights.iteritems() ]
 color="#804040">10     heapify(heap)
 color="#804040">11     if tutor:  color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
 color="#804040">12     while len(heap) >1:
 color="#804040">13         lo = heappop(heap)
 color="#804040">14         hi = heappop(heap)
 color="#804040">15          color="#804040">if tutor:  color="#804040">print " color="#ff00ff">  COMBINING:", lo, ' color="#6a5acd">\n        AND:', hi
 color="#804040">16          color="#804040">for sh  color="#804040">in lo.symbols: sh.huff.insert(0, ' color="#ff00ff">0')
 color="#804040">17          color="#804040">for sh  color="#804040">in hi.symbols: sh.huff.insert(0, ' color="#ff00ff">1')
 color="#804040">18         lohi = Leaf(weight  = lo.weight + hi.weight,
 color="#804040">19                     symbols = lo.symbols + hi.symbols)
 color="#804040">20          color="#804040">if tutor:  color="#804040">print " color="#ff00ff">  PRODUCING:", lohi, ' color="#6a5acd">\n'
 color="#804040">21         heappush(heap, lohi)
 color="#804040">22     symbols = heappop(heap).symbols
 color="#804040">23     symbols = [SH(sym, ''.join(huff))  color="#804040">for sym, huff  color="#804040">in symbols]
 color="#804040">24     return sorted(symbols, key= color="#804040">lambda sh: (len(sh.huff), sh))
 color="#804040">25

The
class generation is succinct in lines 3 and 4, and I use the field
names for clarity for example when creating the original list of Leaves
that will form the heap in line 8

Printing namedtuples

The print statements stay the same, but before you got output like this:

style="font-family: monospace;"> style="font-family: monospace;">...

COMBINING: [2.5, ['C', []]] style="font-family: monospace;">

AND: [5.0, ['A', []]]

PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]] style="font-family: monospace;">
...

Now you get the clearer:

style="font-family: monospace;"> ENCODING:
[('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]

COMBINING: Leaf(weight=2.5, symbols=[SH(sym='C', huff=[])])


AND: Leaf(weight=5.0, symbols=[SH(sym='A', huff=[])])

PRODUCING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])

COMBINING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])


AND: Leaf(weight=12.5, symbols=[SH(sym='D', huff=[])])

PRODUCING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])

COMBINING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])


AND: Leaf(weight=25.0, symbols=[SH(sym='B', huff=[])])

PRODUCING: Leaf(weight=45.0, symbols=[SH(sym='C', huff=['0', '0',
'0']), SH(sym='A', huff=['0', '0', '1']), SH(sym='D', huff=['0', '1']),
SH(sym='B', huff=['1'])])

Conclusion

Although
you can do so much with Python lists, if each position in the list has
a fixed meaning then you might be best to use tuples, and if you should
be using tuples, then namedtuples can make your code more readable
without bloating it with long class definitions.

- Paddy.

Go deh!

Wednesday, April 29, 2009

Fight for the community you want

Tuesday, April 21, 2009

Monitoring a linux process as it reads files

First create some test files to process

Now lets create a test executable to monitor

here is what the fd directory looks like

And here is a python script to monitor that fd directories
link number 4 periodically

Thursday, April 16, 2009

Bimap. Bi-directional mapping/dictionary for Python?

Friday, April 10, 2009

Knapsack solution by OO Calc

A quick recap of the problem

OO Calc setup

The Solver

Comment

Memoization and stack use.

Mutual Recursion

target="_blank">Wot, no memoize?

A blown stack

Thursday, April 02, 2009

(Ab)use of Pythons in-built data structures

namedtuple

Printing namedtuples

Conclusion

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive

Go deh!

Wednesday, April 29, 2009

Fight for the community you want

Tuesday, April 21, 2009

Monitoring a linux process as it reads files

First create some test files to process

Now lets create a test executable to monitor

here is what the fd directory looks like

And here is a python script to monitor that fd directorieslink number 4 periodically

Thursday, April 16, 2009

Bimap. Bi-directional mapping/dictionary for Python?

Friday, April 10, 2009

Knapsack solution by OO Calc

A quick recap of the problem

OO Calc setup

The Solver

Comment

Memoization and stack use.

Mutual Recursion

target="_blank">Wot, no memoize?

A blown stack

Thursday, April 02, 2009

(Ab)use of Pythons in-built data structures

namedtuple

Printing namedtuples

Conclusion

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive

And here is a python script to monitor that fd directories
link number 4 periodically