Wednesday, April 29, 2009
Fight for the community you want
Someone took offence at a puerile and sexist presentation given at what is supposed to be a technical event in the Rails community. Good on them. The community is what you make of it.
"Using sex to sell spanners" doesn't make me think they are good spanners.
- Paddy.
"Using sex to sell spanners" doesn't make me think they are good spanners.
- Paddy.
Tuesday, April 21, 2009
Monitoring a linux process as it reads files
I am processing 20 thousand files with some proprietary software and
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.
The proprietary program took a list of 20000 files to process as its
first argument and I remembered that on Linux, the /proc
directory had info on running processes and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on one of the file descriptors, I could gauge how many
files been read so far.
I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.
Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment
This script just holds each file open for reading for twenty seconds
before closing the file.
And here is a python script to monitor that fd directories
I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.
END.
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.
The proprietary program took a list of 20000 files to process as its
first argument and I remembered that on Linux, the /proc
directory had info on running processes and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on one of the file descriptors, I could gauge how many
files been read so far.
I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.
Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment
First create some test files to process
style="font-family: monospace;">bash$ style="font-weight: bold;">mkdir -p /tmp/test
style="font-family: monospace;">
bash$ style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done
style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst
style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst
style="font-family: monospace;">
/tmp/test/file0
style="font-family: monospace;">
/tmp/test/file1
style="font-family: monospace;">
/tmp/test/file10
style="font-family: monospace;">
/tmp/test/file11
style="font-family: monospace;">
/tmp/test/file12
style="font-family: monospace;">
/tmp/test/file13
style="font-family: monospace;">
/tmp/test/file14
style="font-family: monospace;">
/tmp/test/file15
style="font-family: monospace;">
/tmp/test/file16
style="font-family: monospace;">
/tmp/test/file17
style="font-family: monospace;">
bash$
style="font-family: monospace;">
bash$ style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done
style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst
style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst
style="font-family: monospace;">
/tmp/test/file0
style="font-family: monospace;">
/tmp/test/file1
style="font-family: monospace;">
/tmp/test/file10
style="font-family: monospace;">
/tmp/test/file11
style="font-family: monospace;">
/tmp/test/file12
style="font-family: monospace;">
/tmp/test/file13
style="font-family: monospace;">
/tmp/test/file14
style="font-family: monospace;">
/tmp/test/file15
style="font-family: monospace;">
/tmp/test/file16
style="font-family: monospace;">
/tmp/test/file17
style="font-family: monospace;">
bash$
Now lets create a test executable to monitor
This script just holds each file open for reading for twenty seconds
before closing the file.
style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time
style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]):
style="font-family: monospace; color: rgb(0, 0, 153);">
f = file(name.strip())
style="font-family: monospace; color: rgb(0, 0, 153);">
time.sleep(45)
style="font-family: monospace; color: rgb(0, 0, 153);">
f.close()
' style="font-weight: bold;">/tmp/test/all_files.lst
&
style="font-family: monospace;">
[2] style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984
style="font-family: monospace;">
bash$
style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]):
style="font-family: monospace; color: rgb(0, 0, 153);">
f = file(name.strip())
style="font-family: monospace; color: rgb(0, 0, 153);">
time.sleep(45)
style="font-family: monospace; color: rgb(0, 0, 153);">
f.close()
' style="font-weight: bold;">/tmp/test/all_files.lst
&
style="font-family: monospace;">
[2] style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984
style="font-family: monospace;">
bash$
here is what the fd directory looks like
style="font-family: monospace;">bash$ ls -l /proc/ style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">/fd
style="font-family: monospace;">
total 0
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0
style="font-family: monospace;">
bash$
style="font-family: monospace;">
total 0
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0
style="font-family: monospace;">
bash$
And here is a python script to monitor that fd directories
link number 4 periodically
style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time,os,datetime
style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1])))
style="color: rgb(0, 0, 153);">
all = len(name2index)
style="color: rgb(0, 0, 153);">
while True:
style="color: rgb(0, 0, 153);">
path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4").strip()
style="color: rgb(0, 0, 153);">
print
name2index[path],"/",all, path, datetime.datetime.now().isoformat()
style="color: rgb(0, 0, 153);">
time.sleep(30)
' /tmp/test/all_files.lst
22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750
23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750
24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750
24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750
style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1])))
style="color: rgb(0, 0, 153);">
all = len(name2index)
style="color: rgb(0, 0, 153);">
while True:
style="color: rgb(0, 0, 153);">
path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4").strip()
style="color: rgb(0, 0, 153);">
name2index[path],"/",all, path, datetime.datetime.now().isoformat()
style="color: rgb(0, 0, 153);">
time.sleep(30)
' /tmp/test/all_files.lst
22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750
23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750
24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750
24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750
I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.
END.
Thursday, April 16, 2009
Bimap. Bi-directional mapping/dictionary for Python?
I had a chance encounter with the boost C++ library documentation and
came across a description of their href="http://cablemodem.fibertel.com.ar/mcape/boost/libs/bimap/index.html">bimap
(that is bimap without a 't') .
It seems to be a bi-directional mapping.
In a Python dict, keys map to values. In a bimap, values would also map
to keys. It seems they put keys/values on an equal footing by renaming
them as left and right members, and having methods that work on either
the right to left or the left to right mapping.
I had just re-visited a program of mine that dealt with what I would
consider large sets of values. I was mapping test names to test
coverage where their were tens of trhousands of test names, each of
~100 characters long and for each file, I had many thousands of code
coverage points, which themselves were strings of ~150 characters each.
I new I had to work with a dict mapping test_names to sets of covered
points for each test and do a lot of set arithmatic to rank the tests
in order of contribution to the overall coverage figure, so I decided
to use indices; replacing files by negative numbers and coverage points
by positive numbers. My program works, and is very fast but I note that
I create both a testname2index mapping dict and the inverse
index2testname mapping, (and coverpoint2index and index2coverpoint
mapping dicts).
What I had been constructing, unawares, was a bimap by using two dicts.
I then spent time googling for Python bimaps to no avail, (although I
think there are haskel and Java implementations out there - and Google
'helps' by sugesting you meant bitmap with a 't').
So my questions are:
Methods
On my third question, I guess you could have all the normal methods of
a dict but preceeded by either
left_ or right_
with calling of a method without those prefixes raising an exception,
as it seems important to not favour one mapping direction over another.
This would give:
The above couldn't be extended for indexing however, unless you did
something like:
but then wouldn't it be better to use the dotted form for all the
methods and have:
I note that their are redundancies above, for example, left_clear ==
right_clear ...
Strewth, I've only scratched the surface :-)
- Paddy.
came across a description of their href="http://cablemodem.fibertel.com.ar/mcape/boost/libs/bimap/index.html">bimap
(that is bimap without a 't') .
It seems to be a bi-directional mapping.
In a Python dict, keys map to values. In a bimap, values would also map
to keys. It seems they put keys/values on an equal footing by renaming
them as left and right members, and having methods that work on either
the right to left or the left to right mapping.
I had just re-visited a program of mine that dealt with what I would
consider large sets of values. I was mapping test names to test
coverage where their were tens of trhousands of test names, each of
~100 characters long and for each file, I had many thousands of code
coverage points, which themselves were strings of ~150 characters each.
I new I had to work with a dict mapping test_names to sets of covered
points for each test and do a lot of set arithmatic to rank the tests
in order of contribution to the overall coverage figure, so I decided
to use indices; replacing files by negative numbers and coverage points
by positive numbers. My program works, and is very fast but I note that
I create both a testname2index mapping dict and the inverse
index2testname mapping, (and coverpoint2index and index2coverpoint
mapping dicts).
What I had been constructing, unawares, was a bimap by using two dicts.
I then spent time googling for Python bimaps to no avail, (although I
think there are haskel and Java implementations out there - and Google
'helps' by sugesting you meant bitmap with a 't').
So my questions are:
- Is their a Python bimap implementation out there?
- Would anyone want one? (I already code around its
absense). - What would its methods be?
Methods
On my third question, I guess you could have all the normal methods of
a dict but preceeded by either
left_ or right_
with calling of a method without those prefixes raising an exception,
as it seems important to not favour one mapping direction over another.
This would give:
style="font-family: monospace; font-weight: bold; text-decoration: underline;">
DICT BIMOD
(LEFT) BIMOD (RIGHT)
style="font-family: monospace; font-weight: bold;">
style="font-family: monospace;">
clear
left_clear right_clear
copy
left_copy right_copy
fromkeys
left_fromkeys right_fromkeys
style="font-family: monospace;">
get
left_get
right_get
style="font-family: monospace;">
has_key
left_has_key right_has_key
style="font-family: monospace;">
items
left_items right_items
iteritems left_iteritems
right_iteritems
iterkeys
left_iterkeys right_iterkeys
style="font-family: monospace;">
itervalues
left_itervalues right_itervalues
style="font-family: monospace;">
keys
left_keys right_keys
pop
left_pop
right_pop
style="font-family: monospace;">
popitem
left_popitem right_popitem
style="font-family: monospace;">
setdefault
left_setdefault right_setdefault
style="font-family: monospace;">
update
left_update right_update
style="font-family: monospace;">
values
left_values right_values
DICT BIMOD
(LEFT) BIMOD (RIGHT)
style="font-family: monospace; font-weight: bold;">
style="font-family: monospace;">
clear
left_clear right_clear
copy
left_copy right_copy
fromkeys
left_fromkeys right_fromkeys
style="font-family: monospace;">
get
left_get
right_get
style="font-family: monospace;">
has_key
left_has_key right_has_key
style="font-family: monospace;">
items
left_items right_items
iteritems left_iteritems
right_iteritems
iterkeys
left_iterkeys right_iterkeys
style="font-family: monospace;">
itervalues
left_itervalues right_itervalues
style="font-family: monospace;">
keys
left_keys right_keys
pop
left_pop
right_pop
style="font-family: monospace;">
popitem
left_popitem right_popitem
style="font-family: monospace;">
setdefault
left_setdefault right_setdefault
style="font-family: monospace;">
update
left_update right_update
style="font-family: monospace;">
values
left_values right_values
The above couldn't be extended for indexing however, unless you did
something like:
style="font-family: monospace;">bimod_instance.left[leftindex]
and bimod_instance..right[rightindex]
and bimod_instance..right[rightindex]
but then wouldn't it be better to use the dotted form for all the
methods and have:
style="font-family: monospace;">
bimod_instance.left.haskey # (with a dot after left), ...
bimod_instance.left.haskey # (with a dot after left), ...
I note that their are redundancies above, for example, left_clear ==
right_clear ...
Strewth, I've only scratched the surface :-)
- Paddy.
Friday, April 10, 2009
Knapsack solution by OO Calc
I had previously solved the href="http://www.rosettacode.org/wiki/Knapsack_Problem"
target="_blank">knapsack problem in Python. I came
across the Solver function of OpenOffice Calc and so tried to use it to
solve knapsack.
cellpadding="2" cellspacing="2">
nowrap="nowrap" valign="middle">Item
nowrap="nowrap" valign="middle">Explanation
nowrap="nowrap" valign="middle">Value (each)
nowrap="nowrap" valign="middle">weight
nowrap="nowrap" valign="middle">Volume (each)
panacea
(vials of)
Incredible
healing properties
3000
0.3
0.025
ichor
(ampules of)
Vampires
blood
1800
0.2
0.015
gold
(bars)
Shiney
shiney
2500
2.0
0.002
align="left" nowrap="nowrap" valign="middle">Knapsack
align="left" nowrap="nowrap" valign="middle">For
the carrying of
align="left" nowrap="nowrap" valign="middle">-
align="left" nowrap="nowrap" valign="middle"><=25
align="left" nowrap="nowrap" valign="middle"><=0.25
I am using version 3.0.1. I opened a new spreadsheat and literally
cut-n-pasted the above table into the sheet. The solver needs to be
able to change some cells (the variables) that affect one cell
containing the value to optimise.. I added the table in blue, at the
right of the figure below to calculate the value, weight and
volume of what is in the knapsack based on the amount of each item
chosen, so you can change what is in the style="font-weight: bold;">Number column and
get new totals calculated in the Totals
row
I show the formulae in the cells in the image above.
Menu Tools-> Solver shows a form for filling in. from top to
bottom:
You also need to click the Solvers Options
button and ensure the options are set like this:
Ok the Options, then hit Solve on the Solver window and eventually you
will get the following result:
The solver has correctly determined that carrying no panacea, fifteen
ampules of ichor, and eleven gold bars will maximise the value, subject
to the given constraints..
Their are actually other values that give the same total value under
the same constraints but I can't find an easy way for the solver to
give them all.
END.
target="_blank">knapsack problem in Python. I came
across the Solver function of OpenOffice Calc and so tried to use it to
solve knapsack.
A quick recap of the problem
A traveller gets diverted and has to make an unscheduled stop
in
what turns out to be Shangri La. Opting to leave, he is allowed to take
as much as he likes of the following items, so long as it will fit in
his knapsack, and he can carry it.
He knows that he can carry no more than 25 'weights' in total; and that
the capacity of his knapsack is 0.25 'cubic lengths'.
Looking just above the bar codes on the items he finds their
weights and volumes. He digs out his recent copy of a financial paper
and gets the value of each item.
(vials of)
healing properties
(ampules of)
blood
(bars)
shiney
the carrying of
He can only take whole units of any item,, but there is much
more of any item than he could ever carry
How many of each item does he take to maximise the
value of items he is carrying away with him?
Note:
- There are four solutions that maximise the value taken.
Only one need be given.
OO Calc setup
I am using version 3.0.1. I opened a new spreadsheat and literally
cut-n-pasted the above table into the sheet. The solver needs to be
able to change some cells (the variables) that affect one cell
containing the value to optimise.. I added the table in blue, at the
right of the figure below to calculate the value, weight and
volume of what is in the knapsack based on the amount of each item
chosen, so you can change what is in the style="font-weight: bold;">Number column and
get new totals calculated in the Totals
row
style="width: 914px; height: 444px;"
alt="Knapsack problem formulae" title="(Showing cell formulae)"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSTTwTk_nEMgY1qLVv_QXSamy1aaWjm-H73uDgKHNjbvVB0QeTQR-EVT5tCoZtaWPjmLw3RwEbFD5kBXUJ31NSYXgM-JeYFub6uyU1ZIjFKXj1e6ltYRnG3beA0o7jgjED6D3i/">
alt="Knapsack problem formulae" title="(Showing cell formulae)"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSTTwTk_nEMgY1qLVv_QXSamy1aaWjm-H73uDgKHNjbvVB0QeTQR-EVT5tCoZtaWPjmLw3RwEbFD5kBXUJ31NSYXgM-JeYFub6uyU1ZIjFKXj1e6ltYRnG3beA0o7jgjED6D3i/">
I show the formulae in the cells in the image above.
The Solver
Menu Tools-> Solver shows a form for filling in. from top to
bottom:
- The target cell is the cell I want to maximise, which is
the total value of all items in cell H8 - I want to Maximise.
- I want to change the Number
of each item i.e. vary cells G4:G6 - My limiting conditions are:
- The weight is <= 25
- The volume is <= 0.25
- The Number of each item cannot be negative.
style="width: 455px; height: 373px;" alt="Solver Menu"
title="Solver menu"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_hFwnFTntRnmBiKs5Be81nQN6-LVoVgsi94Yl6FTgi1eM3a2IijSDPcmUOdjbh2ME_jNSpWqkScBW-9KHrl952p7MXhYT8lV_csPjRzeciF1i3Qij3K3m5JAgXQQF4by_bGoH/">
title="Solver menu"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_hFwnFTntRnmBiKs5Be81nQN6-LVoVgsi94Yl6FTgi1eM3a2IijSDPcmUOdjbh2ME_jNSpWqkScBW-9KHrl952p7MXhYT8lV_csPjRzeciF1i3Qij3K3m5JAgXQQF4by_bGoH/">
You also need to click the Solvers Options
button and ensure the options are set like this:
style="width: 431px; height: 286px;" alt="Solver Options"
title="Solver Options"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYg4BmKl0Ye99EgQ7bGvj4D_xEkf-oUNIug2hvpp9PAmjsa048nuFPSSdtMsYysNvh98NAMMpphy7jWQHRZhXWHg8fgvcUY_iH2WnsDAc5cUITezPm93HXGvNmQpx1Fj99Gxk-/">
title="Solver Options"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYg4BmKl0Ye99EgQ7bGvj4D_xEkf-oUNIug2hvpp9PAmjsa048nuFPSSdtMsYysNvh98NAMMpphy7jWQHRZhXWHg8fgvcUY_iH2WnsDAc5cUITezPm93HXGvNmQpx1Fj99Gxk-/">
Ok the Options, then hit Solve on the Solver window and eventually you
will get the following result:
style="width: 914px; height: 444px;"
alt="Result of Knapsack constraints problem"
title="Result of Knapsack constraints problem"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoq_CC0HfR9EzsFU6E9dXcxmLk1zXhqZxkuu2tCe1DWK-_REuVKnTHBGO9KhzvQ-cIJfskAb24DzRT8bMv2Wb5lshQw7ec5YGT6IN4VEgvR8MlCZNBM0wRJDOAgXqzn-tSZu2J/">
alt="Result of Knapsack constraints problem"
title="Result of Knapsack constraints problem"
src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoq_CC0HfR9EzsFU6E9dXcxmLk1zXhqZxkuu2tCe1DWK-_REuVKnTHBGO9KhzvQ-cIJfskAb24DzRT8bMv2Wb5lshQw7ec5YGT6IN4VEgvR8MlCZNBM0wRJDOAgXqzn-tSZu2J/">
The solver has correctly determined that carrying no panacea, fifteen
ampules of ichor, and eleven gold bars will maximise the value, subject
to the given constraints..
Comment
Their are actually other values that give the same total value under
the same constraints but I can't find an easy way for the solver to
give them all.
END.
Memoization and stack use.
I was caught out today when investigating how to apply memoization to
allow computation of larger values of a recursive function.
I had thought up a new task for Rosetta Code called href="http://www.rosettacode.org/wiki/Mutual_Recursion">Mutual
Recursion, as I had remembered that some languages needed
at least the signature of a function to be defined before it
could be called, and so creating mutually recursive functions would
allow you to compare the languages in an interesting way - which is
what R.C. is all about.
For the example for implementation I used href="http://en.wikipedia.org/wiki/Hofstadter_sequence#Hofstadter_Female_and_Male_sequences"
target="_blank">Hoffstadters Female and Male
sequences, and produced the following href="http://www.rosettacode.org/wiki/Mutual_Recursion#Python">Python
definitions:
It seems that other members of R.C. liked the task and soon added many
examples in other languages.
Apart from the computation of the series for 0<=n<20, I
did try to compute F(200) but it blew the stack so I went to bed and
decided to memoize the functions another day.
Memoization came very easily to mind, so I thought that Python would
have such available as a decorator waiting to be applied. No such luck,
I would have to craft my own . I did remember correctly however that
there was some helper function that made wrapped functions
look a lot more like what had been wrapped, and so used functools.wrap:
I haven't got over my (probably infantile) lust for attaching
attributes to functions so justified memoize0 as it allows you to look
at the contents of the cache. Combined with the fact that you cannot do
a memoization decorator without applying it to a href="http://en.wikipedia.org/wiki/Fibonacci_function">Fibonacci
number function:
It is important that caches are not shared for mutually recursive
functions.
I had initial success with the Male and Female sequence and was easily
able to compute F(200) with:
But I got cocky, as F(2000) ran out of stack.
A little thought made me realise that I only had a cached results for F
and M up to around 200 and so a call of F(2000) was going to eat up
huge amounts of stack before recursing down to cached values.
With that knowledge in mind I thought I might gradually jump to higher
values of n and, indeed, the following worked:
In real-world applicatrions, you might have to have some scheme to
reduce or limit cache size if memory rather than the stack becomes an
issue, and I have a great solution that I have just completed in the
margin :-)
- Paddy.
allow computation of larger values of a recursive function.
Mutual Recursion
I had thought up a new task for Rosetta Code called href="http://www.rosettacode.org/wiki/Mutual_Recursion">Mutual
Recursion, as I had remembered that some languages needed
at least the signature of a function to be defined before it
could be called, and so creating mutually recursive functions would
allow you to compare the languages in an interesting way - which is
what R.C. is all about.
For the example for implementation I used href="http://en.wikipedia.org/wiki/Hofstadter_sequence#Hofstadter_Female_and_Male_sequences"
target="_blank">Hoffstadters Female and Male
sequences, and produced the following href="http://www.rosettacode.org/wiki/Mutual_Recursion#Python">Python
definitions:
def color="#008080">F(n): return 1 color="#804040">if n == 0 color="#804040">else n - M(F(n-1))
color="#804040">def color="#008080">M(n): return 0 color="#804040">if n == 0 color="#804040">else n - F(M(n-1))
It seems that other members of R.C. liked the task and soon added many
examples in other languages.
Apart from the computation of the series for 0<=n<20, I
did try to compute F(200) but it blew the stack so I went to bed and
decided to memoize the functions another day.
target="_blank">Wot, no memoize?
Memoization came very easily to mind, so I thought that Python would
have such available as a decorator waiting to be applied. No such luck,
I would have to craft my own . I did remember correctly however that
there was some helper function that made wrapped functions
look a lot more like what had been wrapped, and so used functools.wrap:
from functools color="#a020f0">import wraps
color="#804040">def color="#008080">memoize0(func):
' color="#ff00ff"> Adds cache as attribute to wrapped function for ease of access'
color="#a020f0">@wraps(func)
color="#804040">def color="#008080">wrapper(*args):
argsl = tuple(args)
color="#804040">if argsl color="#804040">in wrapper._cache:
color="#804040">return wrapper._cache[argsl]
color="#804040">else:
color="#804040">return wrapper._cache.setdefault(argsl, func(*args))
wrapper._cache = dict()
color="#804040">return wrapper
color="#804040">def color="#008080">memoize(func):
' color="#ff00ff"> Creates closure around locally created cache'
cache = dict()
color="#a020f0">@wraps(func)
color="#804040">def color="#008080">wrapper(*args):
argsl = tuple(args)
color="#804040">if argsl color="#804040">in cache:
color="#804040">return cache[argsl]
color="#804040">else:
color="#804040">return cache.setdefault(argsl, func(*args))
color="#804040">return wrapper
I haven't got over my (probably infantile) lust for attaching
attributes to functions so justified memoize0 as it allows you to look
at the contents of the cache. Combined with the fact that you cannot do
a memoization decorator without applying it to a href="http://en.wikipedia.org/wiki/Fibonacci_function">Fibonacci
number function:
if __name__ == ' color="#ff00ff">__main__':
color="#a020f0">@memoize0
color="#804040">def color="#008080">fib(n): return n color="#804040">if n color="#804040">in (0,1) color="#804040">else fib(n - 1) + fib(n - 2)
color="#a020f0">@memoize0
color="#804040">def color="#008080">fib2(n): color="#804040">return n color="#804040">if n color="#804040">in (0,1) color="#804040">else fib2(n - 1) + fib2(n - 2)
color="#804040">assert fib._cache color="#804040">is color="#804040">not fib2._cache
fib(50); fib2(50)
color="#804040">assert fib._cache == fib2._cache color="#804040">and (fib._cache color="#804040">is color="#804040">not fib2._cache)
It is important that caches are not shared for mutually recursive
functions.
A blown stack
I had initial success with the Male and Female sequence and was easily
able to compute F(200) with:
@ color="#008080">memoize
color="#804040">def color="#008080">F(n):
' color="#ff00ff"> Hofstadters Female Male sequences'
color="#804040">return 1 color="#804040">if n == 0 color="#804040">else n - M(F(n-1))
color="#a020f0">@memoize
color="#804040">def color="#008080">M(n):
' color="#ff00ff"> Hofstadters Female Male sequences'
color="#804040">return 0 color="#804040">if n == 0 color="#804040">else n - F(M(n-1))
color="#804040">assert F(200) == 124
But I got cocky, as F(2000) ran out of stack.
A little thought made me realise that I only had a cached results for F
and M up to around 200 and so a call of F(2000) was going to eat up
huge amounts of stack before recursing down to cached values.
With that knowledge in mind I thought I might gradually jump to higher
values of n and, indeed, the following worked:
print ([F(n) color="#804040">for n color="#804040">in range(0,20000+1,100)])
In real-world applicatrions, you might have to have some scheme to
reduce or limit cache size if memory rather than the stack becomes an
issue, and I have a great solution that I have just completed in the
margin :-)
- Paddy.
Thursday, April 02, 2009
(Ab)use of Pythons in-built data structures
I like to use Pythons in-built data structures quit a lot, and tend to
force myself to ask whether I should create my own classes,
which allows you to use meaningful names for fields and add
comments/docstrings to the datastructure but usually at the cost of
adding more lines of text.
After writing about Huffman codes and posting solutions to href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Python">Rosetta
code and href="http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html">my
blog:
Someone posted a href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Java">
Java solution, that was over three times as long. Instead of
just counting and comparing line counts, I took a closer
look to see what the extra code was for.
I
noticed that the Java was tidy, and that they had defined their own
classes for a HuffmanLeaf structure used when constructing a
HuffmanTree.
In my original Python, I used a nested list as
the equivalent to the HuffmanLeaf of the Java example. It worked, but I
had to introduce 'magic constants' to access the fields of the Python
leaf , which is also un-named as a structure in the program (lline 33
of
the Python code creates a list of Leaf structures):
Now I didn't want to go to the trouble of creating my own
classes, but that's were the new href="http://docs.python.org/dev/py3k/library/collections.html#collections.namedtuple">namedtuple
class factory of Python 2.6 came to the rescue.
namedtuple
will generate a subtype of the tuple class that allows the fields of
the tuple to be named and accessed via subscription, [], or as if the
fields are instance variable names.
I modified my Python program to use two named tuples:
The
class generation is succinct in lines 3 and 4, and I use the field
names for clarity for example when creating the original list of Leaves
that will form the heap in line 8
The print statements stay the same, but before you got output like this:
Now you get the clearer:
Although
you can do so much with Python lists, if each position in the list has
a fixed meaning then you might be best to use tuples, and if you should
be using tuples, then namedtuples can make your code more readable
without bloating it with long class definitions.
- Paddy.
force myself to ask whether I should create my own classes,
which allows you to use meaningful names for fields and add
comments/docstrings to the datastructure but usually at the cost of
adding more lines of text.
After writing about Huffman codes and posting solutions to href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Python">Rosetta
code and href="http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html">my
blog:
29 color="#804040">def color="#008080">codecreate(symbol2weights, tutor= False):
color="#804040"> 30 ''' color="#ff00ff"> Huffman encode the given dict mapping symbols to weights '''
color="#804040"> 31 global decode
color="#804040"> 32
color="#804040"> 33 heap = [ [float(wt), [[sym, []]], repr(sym)] color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
color="#804040"> 34 heapify(heap)
color="#804040"> 35 if tutor: color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
color="#804040"> 36 while len(heap) >1:
color="#804040"> 37 lo = heappop(heap)
color="#804040"> 38 hi = heappop(heap)
color="#804040"> 39 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> COMBINING:", lo, ' color="#6a5acd">\n AND:', hi
color="#804040"> 40 color="#804040">for i color="#804040">in lo[1]: i[1].insert(0, ' color="#ff00ff">0')
color="#804040"> 41 color="#804040">for i color="#804040">in hi[1]: i[1].insert(0, ' color="#ff00ff">1')
color="#804040"> 42 lohi = [ lo[0] + hi[0] ] + [lo[1] + hi[1]]
color="#804040"> 43 lohi.append(' color="#ff00ff">(%s if nextbit() else %s)' % (hi[2], lo[2]))
color="#804040"> 44 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> PRODUCING:", lohi, ' color="#6a5acd">\n'
color="#804040"> 45 heappush(heap, lohi)
color="#804040"> 46 wt, codes, decoder = heappop(heap)
color="#804040"> 47 decode = eval(' color="#ff00ff">lambda :' + decoder, globals())
color="#804040"> 48 decode.__doc__ = decoder
color="#804040"> 49 for i color="#804040">in codes: i[1] = ''.join(i[1])
color="#804040"> 50 #for i in codes: i[::] = i[:2]
color="#804040"> 51 return sorted(codes, key= color="#804040">lambda x: (len(x[-1]), x))
color="#804040"> 52
Someone posted a href="http://www.rosettacode.org/w/index.php?title=Huffman_codes&oldid=27804#Java">
Java solution, that was over three times as long. Instead of
just counting and comparing line counts, I took a closer
look to see what the extra code was for.
import java.util.*;
color="#2e8b57">abstract color="#2e8b57">class HuffmanTree color="#2e8b57">implements Comparable<HuffmanTree> {
color="#2e8b57">public color="#2e8b57">int frequency; color="#0000ff">// the frequency of this tree
color="#2e8b57">public HuffmanTree( color="#2e8b57">int freq) { frequency = freq; }
color="#0000ff">// compares on the frequency
color="#2e8b57">public color="#2e8b57">int compareTo(HuffmanTree tree) {
color="#804040">return frequency - tree.frequency;
}
}
color="#2e8b57">class HuffmanLeaf color="#2e8b57">extends HuffmanTree {
color="#2e8b57">public color="#2e8b57">char value; color="#0000ff">// the character this leaf represents
color="#2e8b57">public HuffmanLeaf( color="#2e8b57">int freq, color="#2e8b57">char val) {
color="#2e8b57">super(freq);
value = val;
}
}
color="#2e8b57">class HuffmanNode color="#2e8b57">extends HuffmanTree {
color="#2e8b57">public HuffmanTree left, right; color="#0000ff">// subtrees
color="#2e8b57">public HuffmanNode(HuffmanTree l, HuffmanTree r) {
color="#2e8b57">super(l.frequency + r.frequency);
left = l;
right = r;
}
}
color="#2e8b57">public color="#2e8b57">class HuffmanCode {
color="#0000ff">// input is an array of frequencies, indexed by character code
color="#2e8b57">public color="#2e8b57">static HuffmanTree buildTree( color="#2e8b57">int[] charFreqs) {
PriorityQueue<HuffmanTree> trees = color="#804040">new PriorityQueue<HuffmanTree>();
color="#0000ff">// initially, we have a forest of leaves
color="#0000ff">// one for each non-empty character
color="#804040">for ( color="#2e8b57">int i = color="#ff00ff">0; i < charFreqs.length; i++)
color="#804040">if (charFreqs[i] > color="#ff00ff">0)
trees.offer( color="#804040">new HuffmanLeaf(charFreqs[i], ( color="#2e8b57">char)i));
color="#804040">assert trees.size() > color="#ff00ff">0;
color="#0000ff">// loop until there is only one tree left
color="#804040">while (trees.size() > color="#ff00ff">1) {
color="#0000ff">// two trees with least frequency
HuffmanTree a = trees.poll();
HuffmanTree b = trees.poll();
color="#0000ff">// put into new node and re-insert into queue
trees.offer( color="#804040">new HuffmanNode(a, b));
}
color="#804040">return trees.poll();
}
color="#2e8b57">public color="#2e8b57">static color="#2e8b57">void printCodes(HuffmanTree tree, Stack<Character> prefix) {
color="#804040">assert tree != color="#ff00ff">null;
color="#804040">if (tree color="#804040">instanceof HuffmanLeaf) {
HuffmanLeaf leaf = (HuffmanLeaf)tree;
color="#0000ff">// print out character and frequency
System.out.print(leaf.value + color="#ff00ff">"\t color="#ff00ff">" + leaf.frequency + color="#ff00ff">"\t color="#ff00ff">");
color="#0000ff">// print out code for this leaf, which is just the prefix
color="#804040">for ( color="#2e8b57">char bit : prefix)
System.out.print(bit);
System.out.println();
} color="#804040">else color="#804040">if (tree color="#804040">instanceof HuffmanNode) {
HuffmanNode node = (HuffmanNode)tree;
color="#0000ff">// traverse left
prefix.push( color="#ff00ff">'0');
printCodes(node.left, prefix);
prefix.pop();
color="#0000ff">// traverse right
prefix.push( color="#ff00ff">'1');
printCodes(node.right, prefix);
prefix.pop();
}
}
color="#2e8b57">public color="#2e8b57">static color="#2e8b57">void main(String[] args) {
String test = color="#ff00ff">"this is an example for huffman encoding";
color="#0000ff">// we will assume that all our characters will have
color="#0000ff">// code less than 256, for simplicity
color="#2e8b57">int[] charFreqs = color="#804040">new color="#2e8b57">int[ color="#ff00ff">256];
color="#0000ff">// read each character and record the frequencies
color="#804040">for ( color="#2e8b57">char c : test.toCharArray())
charFreqs[c]++;
color="#0000ff">// build tree
HuffmanTree tree = buildTree(charFreqs);
color="#0000ff">// print out results
System.out.println( color="#ff00ff">"SYMBOL\t color="#ff00ff">WEIGHT\t color="#ff00ff">HUFFMAN CODE");
printCodes(tree, color="#804040">new Stack<Character>());
}
}
I
noticed that the Java was tidy, and that they had defined their own
classes for a HuffmanLeaf structure used when constructing a
HuffmanTree.
In my original Python, I used a nested list as
the equivalent to the HuffmanLeaf of the Java example. It worked, but I
had to introduce 'magic constants' to access the fields of the Python
leaf , which is also un-named as a structure in the program (lline 33
of
the Python code creates a list of Leaf structures):
33 heap = [ [float(wt), [sym, []]] color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
Now I didn't want to go to the trouble of creating my own
classes, but that's were the new href="http://docs.python.org/dev/py3k/library/collections.html#collections.namedtuple">namedtuple
class factory of Python 2.6 came to the rescue.
namedtuple
namedtuple
will generate a subtype of the tuple class that allows the fields of
the tuple to be named and accessed via subscription, [], or as if the
fields are instance variable names.
I modified my Python program to use two named tuples:
- For the Leaf structure as a whole in line 3.
- For a component of the Leaf structure that holds the symbol
and the Huffman code (accumulated so far), in line 4.
1 color="#a020f0">from collections color="#a020f0">import namedtuple
color="#804040"> 2
color="#804040"> 3 Leaf = namedtuple(' color="#ff00ff">Leaf', 'weight, symbols')
color="#804040"> 4 SH = namedtuple(' color="#ff00ff">SH', 'sym, huff')
color="#804040"> 5
color="#804040"> 6 def color="#008080">codecreate2(symbol2weights, tutor= False):
color="#804040"> 7 ''' color="#ff00ff"> Huffman codecreate2 the given dict mapping symbols to weights '''
color="#804040"> 8 heap = [ Leaf(weight=float(wt), symbols=[ SH(sym, []) ])
color="#804040"> 9 color="#804040">for sym, wt color="#804040">in symbol2weights.iteritems() ]
color="#804040">10 heapify(heap)
color="#804040">11 if tutor: color="#804040">print " color="#ff00ff">ENCODING:", sorted(symbol2weights.iteritems())
color="#804040">12 while len(heap) >1:
color="#804040">13 lo = heappop(heap)
color="#804040">14 hi = heappop(heap)
color="#804040">15 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> COMBINING:", lo, ' color="#6a5acd">\n AND:', hi
color="#804040">16 color="#804040">for sh color="#804040">in lo.symbols: sh.huff.insert(0, ' color="#ff00ff">0')
color="#804040">17 color="#804040">for sh color="#804040">in hi.symbols: sh.huff.insert(0, ' color="#ff00ff">1')
color="#804040">18 lohi = Leaf(weight = lo.weight + hi.weight,
color="#804040">19 symbols = lo.symbols + hi.symbols)
color="#804040">20 color="#804040">if tutor: color="#804040">print " color="#ff00ff"> PRODUCING:", lohi, ' color="#6a5acd">\n'
color="#804040">21 heappush(heap, lohi)
color="#804040">22 symbols = heappop(heap).symbols
color="#804040">23 symbols = [SH(sym, ''.join(huff)) color="#804040">for sym, huff color="#804040">in symbols]
color="#804040">24 return sorted(symbols, key= color="#804040">lambda sh: (len(sh.huff), sh))
color="#804040">25
The
class generation is succinct in lines 3 and 4, and I use the field
names for clarity for example when creating the original list of Leaves
that will form the heap in line 8
Printing namedtuples
The print statements stay the same, but before you got output like this:
style="font-family: monospace;"> style="font-family: monospace;">...
COMBINING: [2.5, ['C', []]]
style="font-family: monospace;">
AND: [5.0, ['A', []]]
PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]]
style="font-family: monospace;">
...
COMBINING: [2.5, ['C', []]]
style="font-family: monospace;">
AND: [5.0, ['A', []]]
PRODUCING: [7.5, ['C', ['0']], ['A', ['1']]]
style="font-family: monospace;">
...
Now you get the clearer:
style="font-family: monospace;"> ENCODING:
[('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]
COMBINING: Leaf(weight=2.5, symbols=[SH(sym='C', huff=[])])
AND: Leaf(weight=5.0, symbols=[SH(sym='A', huff=[])])
PRODUCING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
COMBINING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
AND: Leaf(weight=12.5, symbols=[SH(sym='D', huff=[])])
PRODUCING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
COMBINING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
AND: Leaf(weight=25.0, symbols=[SH(sym='B', huff=[])])
PRODUCING: Leaf(weight=45.0, symbols=[SH(sym='C', huff=['0', '0',
'0']), SH(sym='A', huff=['0', '0', '1']), SH(sym='D', huff=['0', '1']),
SH(sym='B', huff=['1'])])
[('A', '5'), ('B', '25'), ('C', '2.5'), ('D', '12.5')]
COMBINING: Leaf(weight=2.5, symbols=[SH(sym='C', huff=[])])
AND: Leaf(weight=5.0, symbols=[SH(sym='A', huff=[])])
PRODUCING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
COMBINING: Leaf(weight=7.5, symbols=[SH(sym='C',
huff=['0']), SH(sym='A', huff=['1'])])
AND: Leaf(weight=12.5, symbols=[SH(sym='D', huff=[])])
PRODUCING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
COMBINING: Leaf(weight=20.0, symbols=[SH(sym='C', huff=['0', '0']),
SH(sym='A', huff=['0', '1']), SH(sym='D', huff=['1'])])
AND: Leaf(weight=25.0, symbols=[SH(sym='B', huff=[])])
PRODUCING: Leaf(weight=45.0, symbols=[SH(sym='C', huff=['0', '0',
'0']), SH(sym='A', huff=['0', '0', '1']), SH(sym='D', huff=['0', '1']),
SH(sym='B', huff=['1'])])
Conclusion
Although
you can do so much with Python lists, if each position in the list has
a fixed meaning then you might be best to use tuples, and if you should
be using tuples, then namedtuples can make your code more readable
without bloating it with long class definitions.
- Paddy.