Thursday, December 03, 2009

What is a first-class function?

There is a debate about it on Wikipedia, and I am curious:
  1. Does a language have first-class functions if it can use its functions to solve particular problems in a certain manner, or
  2. Does a language have first-class functions if functions are just as flexible as other types in the language itself?
It probably means 1. above, as the second definition would omit most static languages, as they might be able, for example, to take textual input and change it into an internal representation of an integer, but not do the same for a function with similar effort. Someone who exclusively thinks in terms of static languages may not have realized a difference.

Saturday, November 28, 2009

Hamming Numbers (and Perl: Why I just ignored it)

I had read the following post: "Perl: Love it, or hate it, but don’t ignore
it."
, by Phillip Smith, in which the author wishes Perl
solutions were credited as much as solutions in other languages such as
Python and Ruby.



I agreed that Perl wasn't mentioned as much as it was, and a day or so
later I came across an example of why....



I monitor whats happening in Python, and came across a new recipe added
to Activestate code called: "Recipe 576961: Technique for cyclical
iteration"
, by Raymond Hettinger. On first, second, and third
readings, i just couldn't understand the need for the deferred_output
function; the whole recipe remained opaque:



Raymond's
code:



from itertools import tee, chain, islice
from heapq import merge

def hamming_numbers():
# Generate "5-smooth" numbers, also called "Hamming numbers"
# or "Regular numbers". See: http://en.wikipedia.org/wiki/Regular_number
# Finds solutions to 2**i * 3**j * 5**k for some integers i, j, and k.

def no_repeats(it):
last = None
for i in it:
if i is not last:
yield i
last = i

def deferred_output():
for i in output:
yield i

result, p2, p3, p5 = tee(deferred_output(), 4)
m2 = (2*x for x in p2) # multiples of 2
m3 = (3*x for x in p3) # multiples of 3
m5 = (5*x for x in p5) # multiples of 5
merged = merge(m2, m3, m5)
combined = chain([1], merged) # prepend a starting point
output = no_repeats(combined) # eliminate duplicates

return result

if __name__ == '__main__':
print(list(islice(hamming_numbers(), 1000))) # Show first 1000 hamming numbers




I was hooked by the series however and so went looking for more
information.



Perl


I came across a mention to the following page: "Infinite
lists in Perl"
by Mark Dominus which is copyright 1997 by the
Perl Journal. Normally I wouldn't mention the article, as it spends
more time creating an implementation of generators in Perl as it does
on addressing the computation of Hamming numbers. It also uses the term
"stream" instead of generators. but maybe this is down to the age of
the article, i.e. 1997?



Haskell


Looking a bit further, I found this Dr Dobbs Article, and this
commented Haskell
solution
:


hamming = 1 : map (2*) hamming `merge` map (3*) hamming `merge` map (5*) hamming
where merge (x:xs) (y:ys)
| x < y = x : xs `merge` (y:ys)
| x > y = y : (x:xs) `merge` ys
| otherwise = x : xs `merge` ys


The Haskell solution lead me to think that maybe Python has an issue
with defining recursive generators, and some experimentation lead me to
believe that yep, maybe that was why Raymond has the deferred_output
function.



My Python


Since I have trouble understanding Raymond s I went away and wrote my
own solution. It works. It can generate very large values of the
sequence quite quickly. But it does need memory to store
multiples:

from itertools import islice
from pprint import pprint as pp


def regularnumbergen(start = 1, multipliers=(2, 3, 5)):
'''\
Generates the Regular Number sequence defined here:
http://en.wikipedia.org/wiki/Regular_number
The sequence starts with 1 and consists of all multiples by 2, 3, or 5
of any number in the sequence in non-repeating, increasing order.
'''
multiples = [[] for i in multipliers]
this = start
while True:
yield this
# get multiples
for multiple, multiplier in zip(multiples, multipliers):
multiple.append(this * multiplier)
# get next
this = min(m[0] for m in multiples)
# remove this from multiples
for multiple in multiples:
if multiple[0] == this:
del multiple[0]

if __name__ == '__main__':
print(list(islice(regularnumbergen(), 10)))

Generating the following large values of the sequence took only seconds:


>>> print(list(islice(regularnumbergen(), 99999, 100001)))
[290142196707511001929482240000000000000, 290237644800000000000000000000000000000]
>>>




Wrapup


Well, I've learnt a bit more about generators, the limitatins in their
Python implementation w.r.t. Haskel, and I've put more meat on why I
tend to not quote much Perl In this case, i couldn't see the wood for
the trees! (But I do quote Perl).

Thursday, November 12, 2009

Curious about Go.

Verilog is to SytemVerilog as C is to C++ .
Verilog is to __________ as C is to Go ?

My main criticism of SytemVerilog was the size of the language compared to Verilog. Maybe we'll get SystemGo in five years time. (Or maybe Go itself will warp into some monstrosity; my crystal ball needs a new backlight).

Wednesday, November 04, 2009

More Words From Hex Letters (with extra zero's)

STOP PRESS! Someone commented that zeros could be used as ones so an update is at the bottom.

It was pointed out to me that I needn't restrict myself to four character words, and could use the number '1' as letter 'l', and the number '5' as the letter 'S'.

So here is the program and output of words of four or more characters that can be made from hexadecimal characters.

import urllib

words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt'
).read().split()
#len(words) == 25104

def findwords(hexchars):
hexchars = set(hexchars)
hexwords = [w for w in words if 4 <= len(w) and all(ch in hexchars for ch in w)]
return hexwords

def printwords(hexwords):
for i in range(0, len(hexwords), 5):
print (' ' + ' '.join("%-8s" % w for w in hexwords[i:i+5]))

hexchars = 'abcdef'
#hexwords = findwords(hexchars)
#print "\nHEXWORDS"
#printwords(hexwords)

# Now to use one: 1 as ell: l; and five: 5 as ess: s
hexwords = findwords(hexchars + 'ls')
hexwords = [word.replace('s','5').replace('l','1') for word in hexwords]
print "\nHEXWORDS USING one: 1 AS ell: l; AND five: 5 AS ess: s"
printwords(hexwords)

The list of words found have some gems, such as babble, for when you know your generating data that no one will read; baffle, baseball, blab, bleed, bless, (and blessed if it were in the dictionary), cabal, cascade, cease, debacle, decade, decease, (and should have deceased), false, flee, label, ...

(Someone at work suggested we use 1abe1 or label to prefix a section of writes to a peripheral in an assembler prog which got me going).

The words:


HEXWORDS USING one: 1 AS ell: l; AND five: 5 AS ess: s
aaa5 ababa aba5e abba5 abbe
abed abe1 ab1e ab5ce55 accede
acce55 added add1e ade1e affab1e
a1ba a1ec a1fa1fa a11e1e a55e55
babb1e babe babe1 bade baff1e
ba1d ba1e ba11 ba11ad ba11ed
ba15a ba5a1 ba5e ba5eba11 ba5e1
ba55 bead bead1e beebe beef
befa11 befe11 be1a be11 be11a
be11e be55 be55e1 b1ab b1ade
b1ed b1eed b1e55 caba1 cab1e
cafe ca1eb ca1f ca11 ca11a
ca5cade ca5e cea5e cede ce1ebe5
ce11 c1ad c1a55 dabb1e dacca
dada dade da1e da11a5 dead
deaf dea1 debac1e deba5e decade
deca1 decca decea5e deed deface
de11 de11a ea5e ea5e1 ecc1e5
efface effaceab1e e1ba e11a e15e
fab1e facade face fade fa11
fa15e feeb1e feed fee1 fe11
f1ea f1ed f1ee f1eece 1abe1
1ace 1ad1e 1a5e 1a55 1ead
1eaf 1ea5e 1eed5 1e55 1e55ee
5ab1e 5accade 5add1e 5afe 5a1ad
5a1e 5a11e 5cab 5ca1a 5ca1d
5ca1e 5ea1 5ecede 5eeab1e 5eed
5eedbed 5e1f 5e11 51ab 51ed


Additional Words with an 'o'

Without Michaels comment I would have skipped all words that use zero for an o, so I made the following two changes:

hexwords = findwords(hexchars + 'lso')
hexwords = [word.replace('s','5').replace('l','1').replace('o', '0') for word in hexwords]

And created the following extra words:

  ab0de    acc01ade ad0be    a1c0a    a110cab1e
a10e a100f a150 ba1b0a ba550
b10b b10c b100d b0bb1e b0ca
b0de b01d b01dface b01e b010
b05e b055 caca0 c10d c105e
c0a1 c0a1e5ce c0bb c0bb1e c0b01
c0ca c0c0 c0c0a c0da c0dd1e
c0de c0ed c0ffee c01a c01d
c01e c01055a1 c001 c05ec dec0de
d0bb5 d0dd d0d0 d0ff d01ce
d01e d011 d00d1e d05e ec01e
ee0c fa110ff f10c f10e f100d
f0a1 f0ca1 f01d f00d f001
f055 1a05 1a550 10ad 10af
10be 10b0 10ca1 10ca1e 10eb
10e55 101a 1011 1005e 1005e1eaf
105ab1e 105e 1055 0a5e5 0be5e
0b0e 0b5e55 0de55a 0ffa1 0ff10ad
0ff5add1e 01af 00d1e5 0510 5caff01d
5c0ff 5c01d 5eaf00d 510b 510e
50da 50fa 501ace 501d 501e
5010

Sunday, November 01, 2009

The 24 game

I've just created a new task over on Rosetta Code for the 24 game, and a separate task to create a solver for the game.

It's a brute -force solver, but I find the game to be playable, so will get my sprogs to test their arithmetic skills with it later.

I've just realised that the solver could be easily modified to solve "The numbers game" from the UK Channel 4 program Countdown. (Although their could be an issue with division as, when given the digits 3 3 7 7, the solver finds the solution: ( 3 + 3 / 7 ) * 7 which may not be allowed by the countdown rules which state that "only whole numbers may be used").

Sunday, October 18, 2009

Template for forking a Python program

I am doing a large regression run of some 30,000 simulations producing aroung 30 Gigs of logs to extract test parameter and simulation result from each.

I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.

A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.

I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.

Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.

After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.
 1 import os
2 import sys
3 import time
4
5 maxprocs = 4 # procs - 1 children and the parent to do the work
6 jobarray = range(25) # Work to be split between proccesses
7
8 def do1job(job):
9 'Process one item from the jobarray'
10 return job
11
12 def picklesave(results):
13 'save partialresults to file'
14 time.sleep(3)
15
16 def unpickleread(proc):
17 'read partial results previousely saved by process number procc'
18 return [] # dummy
19
20 def dowork(proc, maxprocs=maxprocs, jobarray=jobarray):
21 ' Split jobarray tasks between maxprocs by proci number'
22 time.sleep(proc) # convenience processing time
23 results = []
24 for count, job in enumerate(jobarray):
25 if count % maxprocs == proc:
26 results.append(do1job(job))
27 picklesave(results)
28 print ("Process: %i completed jobs: %s" % (proc, results))
29 return results
30
31
32 def createworkerchildren(maxprocs=maxprocs):
33 for proc in range(1, maxprocs):
34 print 'PARENT: Forking %s' % proc
35 worker_pid = os.fork()
36 if not worker_pid:
37 'This is a child process'
38 dowork(proc)
39 # Children exit here!
40 sys.exit(proc)
41
42 # Start and don't wait for child proccesses to do their work
43 createworkerchildren()
44
45 # Do parent processes share of the work
46 results = []
47 results += dowork(0) # don't have to pickle/unpickle Parent processes share
48
49 # wait for children
50 for proc in range(1, maxprocs):
51 print 'PARENT: Waiting for any child'
52 done = os.wait()
53 print 'PARENT: got child', done
54
55 # read and integrate child results
56 for proc in range(1, maxprocs):
57 results += unpickleread(proc)
58
59 # Calculate and print data summary
60 pass
61

Running the above on cygwin produced:
bash$ python testfork.py
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$

Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).

Sunday, October 11, 2009

Extend functools.partial to nested function calls?

I am reading the 1977 ACM Turing Award Lecture "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs" by John Backus.

In section 5.1 he is contrasting a von Neumann program for inner product, which he gives as:
c := 0
for i := 1 step 1 until n do
c := c + c[i] x b[i]
The above converts quit nicely to the python equivalent:
>>> v1, v2 = [1,2,3], [6,5,4]
>>> def ip_von_neumann(a, b):
c = 0
for i in range(3):
c = c + a[i] * b[i]
return c

>>> ip_von_neumann(v1, v2)
28
Backus then goes on to contrast the von Neumann style with his functional program:
Def Innerproduct ≡ (Insert +)o(ApplyToAll x)oTranspose
Which I converted to the following functional style Python:
>>> from operator import add, mul
>>> from functools import reduce as insert
>>>
>>> def ip_functional(*x):
return insert(add, map(mul, *x))

>>> ip_functional(v1, v2)
28

Well, the Python works, but it still has this name x that I would like to eliminate.

I think I can go part of the way by using functools.partial. For example, for the single function call of map(mul, *x) I can do:
>>> from functools import partial
>>> p1 = partial(map, mul)
>>> assert p1(v1, v2) == map(mul, *(v1, v2))
>>> map(mul, *(v1, v2))
[6, 10, 12]
>>>
But I don't know how to go that extra step and remove x from the definition fully, ie end up with just some assignment like:
ip_purefunctional = {A pure function of functions without any argument placeholders}


All help would be appreciated, Thanks.

Friday, October 02, 2009

The Otter Algorithm

I was reading about a Richard Dawkins program that through random mutation and selection of best fit, would gradually converge a random string into a target string. It is called the Weasel Program,and a rough outline of it is:
  1. Start with a random string of 28 characters
  2. Copy this string 100 times, with a 5% chance per character of that character being replaced with a random character
  3. Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score
  4. If any of the new strings has a perfect score, halt
  5. Otherwise, take the highest scoring string, and go to step 2
After coding it up, I found it just would not converge to a final match for me. I suspected I had a mistake somewhere , but just did not like how long it was taking to converge.

I decided to change things.

It seemed to me that if you are further away from the ideal, then you should change more letters; and, after seeing the score for how good the current best string  is sometimes decrease, I made a modification to choose the best of the copies and the best string that they are mutations of.

With that modification, convergence was swift.

Oh, I call it the Otter program, as it is more than the Weasel.

The prog

 1 
 2 '''
 3 From: http://en.wikipedia.org/wiki/Weasel_program
 4       With modifications
 5 '''
 6 
 7 from string import ascii_uppercase
 8 from random import choice, random
 9 
10 target = list("METHINKS IT IS LIKE A WEASEL")
11 chance = .05
12 times  = 100
13 
14 charset      = ascii_uppercase + ' '
15 iterations   = 0
16 highscore    = [choice(charset) for _ in range(len(target))]
17 perfectscore = len(target)
18 
19 def score(trial):
20     return sum(t==h for t,h in zip(trial, target))
21 
22 while highscore != target:
23     thischance =  1-((perfectscore - score(highscore)) / perfectscore * (1 - chance))
24     #thischance = chance
25     iterations += 1
26     if iterations % 50 == 0:
27         print (iterations, score(highscore), ''.join(highscore))
28     copies = [ [(ch if random() <= thischance else choice(charset)) for ch in highscore]
29                for _ in range(times) ]  + [highscore]
30     highscore = max(copies, key=score)
31 print (iterations, ''.join(highscore))

Function score at line 19 uses the fact that booleans True and False are also numbers and calculates how many characters are in the right position as its result.

Line 23 modifies the amount of randomness injected, based on how bad the last highscore was: Change more if you are further away from the target - less as you approach.
Name thischance is used in  line 28. random returns a float between 0.0 and 1.0. if it is less than thischance then a particular character is not changed.

The function for thischance in line 23 is equal to chance when no characters match, and 1.0 if all characters were to match. This forces less change as you approach the target string.

Here is a little snippet to show how thischance varies with the score (sc):


>>> for sc in range(0,29,7):
        print( "score=%2i, thischance=%5.2f" % (sc,
                1-((perfectscore - sc) / perfectscore * (1 - chance))) )


score= 0, thischance= 0.05
score= 7, thischance= 0.29
score=14, thischance= 0.53
score=21, thischance= 0.76
score=28, thischance= 1.00 
 
In line 29 the addition of highscore into copies means that the max taken in line 30 can never decrease.

P.S. The post is only tangentially linked to Darwinism. I am secure in the knowledge that my children's school teaches it, and I am not really interested in it being mentioned in any comments. Treat it just as a way to go from a random choice of letters and home in on a target string by  repeated random alterations, and selection of the "best".




Thursday, September 24, 2009

Natural Sorting

Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .



I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:

style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done style="font-family: monospace;">
bash$ ls -1 style="font-family: monospace;">
f0.txt style="font-family: monospace;">
f10.txt style="font-family: monospace;">
f100.txt style="font-family: monospace;">
f105.txt style="font-family: monospace;">
f110.txt style="font-family: monospace;">
f15.txt style="font-family: monospace;">
f20.txt style="font-family: monospace;">
f25.txt style="font-family: monospace;">
f30.txt style="font-family: monospace;">
f35.txt style="font-family: monospace;">
f40.txt style="font-family: monospace;">
f45.txt style="font-family: monospace;">
f5.txt style="font-family: monospace;">
f50.txt style="font-family: monospace;">
f55.txt style="font-family: monospace;">
f60.txt style="font-family: monospace;">
f65.txt style="font-family: monospace;">
f70.txt style="font-family: monospace;">
f75.txt style="font-family: monospace;">
f80.txt style="font-family: monospace;">
f85.txt style="font-family: monospace;">
f90.txt style="font-family: monospace;">
f95.txt

It might be more friendly if the files were ordered so that the numeric
section increased steadily.



In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")



Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.



Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling  what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.



I had a quick stab at creating a natural sort routine  and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..



As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.




  1 
color="#804040"> 2 # -*- coding: utf-8 -*-
color="#804040"> 3 # Not Python 3.x (Can't compare str and int)
color="#804040"> 4
color="#804040"> 5 '''
color="#804040"> 6 Natural sorting: Sorting of text that does more than rely on the
color="#804040"> 7 order of individual characters codes to make the finding of
color="#804040"> 8 individual strings easier for a human reader.
color="#804040"> 9
color="#804040"> 10 Some 'natural' orderings might include:
color="#804040"> 11 1) Ignore leading, trailing and multiple adjacent spaces
color="#804040"> 12 2) All space types equivalent.
color="#804040"> 13 3) Sorting without regard to case.
color="#804040"> 14 4) Sorting numeric portions of strings in numeric order:
color="#804040"> 15 foo9.txt before foo10.txt
color="#804040"> 16 As well as ... x9y99 before x9y100, before x10y0
color="#804040"> 17 ... (for any number of groups of integers in a string).
color="#804040"> 18 5) Title sorts: without regard to a leading, very common, word such
color="#804040"> 19 as 'The' in "The thirty-nine steps".
color="#804040"> 20 6) Sort letters without regard to accents.
color="#804040"> 21 7) Sort ligatures as separate letters.
color="#804040"> 22 8) Replacements:
color="#804040"> 23 Sort german scharfes S (ß) as ss
color="#804040"> 24 Sort Å¿, LATIN SMALL LETTER LONG S as s
color="#804040"> 25 Sort Ê’, LATIN SMALL LETTER EZH as s
color="#804040"> 26 ...
color="#804040"> 27 '''
color="#804040"> 28
color="#804040"> 29 from itertools color="#a020f0">import groupby
color="#804040"> 30 from unicodedata color="#a020f0">import decomposition, name
color="#804040"> 31
color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
color="#804040"> 33 replacements = {' color="#ff00ff">ß': 'ss', color="#0000ff"># Map single char to replacement string
color="#804040"> 34 ' color="#ff00ff">Å¿': 's',
color="#804040"> 35 ' color="#ff00ff">Ê’': 's',
color="#804040"> 36 }
color="#804040"> 37
color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789') color="#0000ff"># Don't use str.isnumeric
color="#804040"> 40
color="#804040"> 41 def color="#008080">splitchar(c):
color="#804040"> 42 ' color="#ff00ff"> De-ligature. De-accent a char'
color="#804040"> 43 de = decomposition(c)
color="#804040"> 44 if de:
color="#804040"> 45 color="#0000ff"># Just the words that are also hex numbers
color="#804040"> 46 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040"> 47 color="#804040">if all(c.lower()
color="#804040"> 48 color="#804040">in hexdigits color="#804040">for c color="#804040">in d)]
color="#804040"> 49 n = name(c, c).upper()
color="#804040"> 50 color="#0000ff"># (Gosh it's onerous)
color="#804040"> 51 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040"> 52 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040"> 53 de[1], de[0] = de[0], de[1]
color="#804040"> 54 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040"> 55 base, others = tmp[0], tmp[1:]
color="#804040"> 56 color="#804040">if ' color="#ff00ff">LIGATURE' color="#804040">in n:
color="#804040"> 57 color="#0000ff"># Assume two character ligature
color="#804040"> 58 base += others.pop(0)
color="#804040"> 59 else:
color="#804040"> 60 base = c
color="#804040"> 61 return base
color="#804040"> 62
color="#804040"> 63
color="#804040"> 64 def color="#008080">sortkeygen(s):
color="#804040"> 65 ''' color="#ff00ff">Generate 'natural' sort key for s
color="#804040"> 66
color="#804040"> 67 Doctests:
color="#804040"> 68 >>> sortkeygen(' some extra spaces ')
color="#804040"> 69 [u'some extra spaces']
color="#804040"> 70 >>> sortkeygen('CasE InseNsItIve')
color="#804040"> 71 [u'case insensitive']
color="#804040"> 72 >>> sortkeygen('The Wind in the Willows')
color="#804040"> 73 [u'wind in the willows']
color="#804040"> 74 >>> sortkeygen(u' color="#6a5acd">\462 ligature')
color="#804040"> 75 [u'ij ligature']
color="#804040"> 76 >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
color="#804040"> 77 [u'yy upper/lower case y with acute accent']
color="#804040"> 78 >>> sortkeygen('foo9.txt')
color="#804040"> 79 [u'foo', 9, u'.txt']
color="#804040"> 80 >>> sortkeygen('x9y99')
color="#804040"> 81 [u'x', 9, u'y', 99]
color="#804040"> 82 '''
color="#804040"> 83 # Ignore leading and trailing spaces
color="#804040"> 84 s = unicode(s).strip()
color="#804040"> 85 # All space types are equivalent
color="#804040"> 86 s = ' color="#ff00ff"> '.join(s.split())
color="#804040"> 87 # case insentsitive
color="#804040"> 88 s = s.lower()
color="#804040"> 89 # Title
color="#804040"> 90 words = s.split()
color="#804040"> 91 if len(words) > 1 color="#804040">and words[0] color="#804040">in commonleaders:
color="#804040"> 92 s = ' color="#ff00ff"> '.join( words[1:])
color="#804040"> 93 # accent and ligatures
color="#804040"> 94 s = ''.join(splitchar(c) color="#804040">for c color="#804040">in s)
color="#804040"> 95 # Replacements (single char replaced by one or more)
color="#804040"> 96 s = ''.join( replacements.get(ch, ch) color="#804040">for ch color="#804040">in s )
color="#804040"> 97 # Numeric sections as numerics
color="#804040"> 98 s = [ int("".join(g)) color="#804040">if isinteger color="#804040">else "".join(g)
color="#804040"> 99 color="#804040">for isinteger,g color="#804040">in groupby(s, color="#804040">lambda x: x color="#804040">in decdigits)]
color="#804040">100
color="#804040">101 return s
color="#804040">102
color="#804040">103 def color="#008080">naturalsort(items):
color="#804040">104 ''' color="#ff00ff"> Naturally sort a series of strings
color="#804040">105
color="#804040">106 Doctests:
color="#804040">107 >>> naturalsort(['The Wind in the Willows','The 40th step more',
color="#804040">108 'The 39 steps', 'Wanda'])
color="#804040">109 ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
color="#804040">110
color="#804040">111 '''
color="#804040">112 return sorted(items, key=sortkeygen)
color="#804040">113
color="#804040">114
color="#804040">115



color="#804040">116 # Extras...
color="#804040">117
color="#804040">118 '''
color="#804040">119 Help on built-in function decomposition in module unicodedata:
color="#804040">120
color="#804040">121 decomposition(...)
color="#804040">122 decomposition(unichr)
color="#804040">123
color="#804040">124 Returns the character decomposition mapping assigned to the Unicode
color="#804040">125 character unichr as string. An empty string is returned in case no
color="#804040">126 such mapping is defined.
color="#804040">127
color="#804040">128 Help on built-in function name in module unicodedata:
color="#804040">129
color="#804040">130 name(...)
color="#804040">131 name(unichr[, default])
color="#804040">132 Returns the name assigned to the Unicode character unichr as a
color="#804040">133 string. If no name is defined, default is returned, or, if not
color="#804040">134 given, ValueError is raised.
color="#804040">135 '''
color="#804040">136
color="#804040">137
color="#804040">138 def color="#008080">_temp():
color="#804040">139 # for c in u'\370\157\117\463\462\511\733\337\337':
color="#804040">140 for c color="#804040">in u' class="Constant">øoOijIJʼnÇßÝý':
color="#804040">141 de = decomposition(c)
color="#804040">142 color="#804040">if de:
color="#804040">143 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040">144 color="#804040">if all(c.upper()
color="#804040">145 color="#804040">in ' color="#ff00ff">0123456789ABCDEF' color="#804040">for c color="#804040">in d)]
color="#804040">146 n = name(c, c).upper()
color="#804040">147 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040">148 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040">149 de[1], de[0] = de[0], de[1]
color="#804040">150 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040">151 base, others = tmp[0], tmp[1:]
color="#804040">152 color="#804040">else:
color="#804040">153 base, others = c, []
color="#804040">154 color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff"> %s %s" %
color="#804040">155 (c,ord(c), name(c,' color="#ff00ff">-'),
color="#804040">156 name(base, base),
color="#804040">157 ' color="#ff00ff"> '.join(name(o,o) color="#804040">for o color="#804040">in others)))
color="#804040">158
color="#804040">159
color="#804040">160



Monday, September 14, 2009

Easy Command-line Parallelism

I am trying out a dual chip Xeon server with 8 cores that shows up as
16 CPU's to Redhat Linux. I have my href="http://paddy3118.blogspot.com/search?q=process">process
runner
script, that I use with a list of 100000 simulations
to do overnight and have experimented with running different numbers of
simulations in parallel to get good throughput.



The process runner is good, I still use it, but I also wanted something
for more general command-line use. for example, I have around 750
directories, each containing files relating to one test. I organize the
directories to have regular names, and am used to using shell for loops
to process the dataset.



For example, to gzip all the */transcript files I would do:


foreach f (*/transcript)
    gzip $f
end


(Yep, I know it's cshell, but that is the 'standard' at work.)



This isn't making much use of the processing power at hand, so I went
googling for a lightweight way to run such tasks in parallel.



I found two methods. One, to use make and its -P option to run several
jobs in parallel would need the creation of a make file in each case,
which is onerous. The other, which was new to me, rather than a
forgotten feature, is that xargs has a -P option that does a similar
thing - in combination with other options it will run  up to a
given number of jobs in parallel:


/bin/ls */transcript \
| xargs -n1 -iX -P16 gzip X


The above lists all the files, pipes them to xargs which takes them one
at a time (-n1) to form a job by substituting all occurrences of
character X (-iX) in its arguments after its options. a maximum of 16
(-P16) jobs are then run at any one time.



When I want to simulate all 750 tests and create a coverage listing of
each test I use xargs to call the bash shell and write something like:


/bin/ls -d * \
| xargs -n1 -iX -P16 bash \
-c 'cd X && vsim tb -c -coverage -do "coverage save -onexit -code t coverX.ucdb;do ../run.do" 2>&1 >/dev/null && vcover report -file coverX.rpt coverX.ucdb 2>&1 >/dev/null'


Yep, I do create such long monstrosities at times. It's because I am
exploring new avenues at the moment, and things are likely to change,
plus I need to know, and change, the details quite frequently. Although
i do make a note of these long one-liners, (and keep a lot of history
in my shells), I tend to only wrap things in a (bash), script when it
becomes stable.



Note:

Following the links from the Wikipedia page, it seems that the -P
option to xargs is not mandated by href="http://www.opengroup.org/onlinepubs/9699919799/utilities/xargs.html">The
Open Group, and does not appear in href="http://docs.sun.com/app/docs/doc/816-5165/xargs-1?a=view">Suns
version of xargs. I work in a mixed Solaris/Linux environment
and have been using Unix for many years which makes it hard to keep up
with the GNU extensions to what were familiar commands. But such is
progress. Bring it on!

Wednesday, September 02, 2009

J for a Py Guy

I just started reading a little bit more about the J programming language, after a lot of J activity on RC. I read the forward to "J for C Programmers" where I picked out an acknowledgement to Ken Iverson whose name rang a bell.



It seems that the J-language is from the APL family tree, but uses the ASCII character set and with later ideas from function-level programming added. It is good for MIMD machines it says, so I wonder how well it works for todays cheaper four to sixteen cores in a box machines, or whether it needs massively MIMD machines to really shine?



It has tickled my fancy. I'll read some more of J for C.

Thursday, August 27, 2009

Saw this, and thought of you...

Here in the UK, it is back-to-school time with many parents buying new school computers for their children. Windows 7 is not released yet, but here is an alternative opinion on the operating system from a competitor:



http://windows7sins.org/#7

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

If you are all sitting comfortably, then I shall begin...



Once upon a time, in a land an internet away, A python programmer was
browsing the web when he StumbledUpon a post with a curious regular
expression purporting to test numbers for primality. On further
investigation, the regexp was attributed in more than one place to
"Abigail" and took the form:

style="font-weight: bold; font-family: monospace;">^1?$|^(11+?)\1+$



Wanting to find out how it worked, the programmer was stumped, as all
the posts mentioning the regexp asked you to visit a page with a
supposedly excellent explanation of its inner workings, but the domain
name was no longer present.



The programmer decided to work it out for himself and came up with the
following...



A prime number is a
positive integer that is only divisible by itself and one. Zero and one
are not prime.



Let's say a number N, greater than one,  is not prime. Then
there exists two numbers, S and T, that when multiplied together equal
N. We have:

S x T = N



Lets us further assume
that T is greater than, or equal to S. (They can be swapped if
necessary to make this so). Then a small bit of manipulation gives:

(S x (T-1)) + S = N


(You take one less S in
the multiplication, then you need to add the S back).



Lets swap the terms and write this as equation (a):

style="font-weight: bold;">S + (S x (T-1)) = N  style="font-weight: bold;"> when S,T,N are integers greater
than one, and N is style="font-style: italic; font-weight: bold;">not style="font-weight: bold;"> prime



Abigail's regexp relies on representing integers as strings of that
number of ones:

Zero would be
    ''

One would be     '1'

two becomes     '11'

and so on...





The regexp gives a match if the string of ones does not represent a
prime.



To match zero or one occurrences of a 1 we use the first half of the
regexp:

style="font-family: monospace;">^1?$

Where ^anchors the following to a match from the beginning of the
string, ? will match zero or one occurrences of the preceding 1, and
the  trailing $ matches the end of the string, i.e. it matches
a string that contains only zero or one 1 characters.



Going back to equation (a), greater than one, is the same as two or
more.

So S will be represented by two or more 1's - we don't know how many,
but we do know that it is two or more. S, in a regexp, becomes
something like:

style="font-family: monospace;">11+

1 followed by one or more 1's. S is less than or equal to T,
which translates to using a less greedy form of the zero-or-one
matcher, which is +? giving a representation of S as:

style="font-family: monospace;">11+?

We want to find S followed by T-1 identical copies of S, so lets form a
group out of what matches S:

style="font-family: monospace;">(11+?)

Then by referring to this group, it is the first group in the regexp so
can be referred to as \1, we can search for extra copies of the group
by:

style="font-family: monospace;">\1+

S and its copies must cover the whole of the number so must start at ^
and finish at $:

style="font-weight: bold; font-family: monospace;">^(11+?)\1+$



Together with the earlier regexp fraction that covers the numbers zero
and one, they can be or'd together to cover all non primes as
the following:



style="font-family: monospace;">^1?$|^(11+?)\1+$






...Not knowing when to quit, the programmer opted to display some of
his Python code that defines a primality tester using the regexp:


>>> import re

>>> def isprime(n):

    return not re.match(r' style="font-weight: bold;">^1?$|^(11+?)\1+$', '1' * n)

>>> [i for i in range(40) if isprime(i)]

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]




And another function that shows S and T  from the regexp match:


>>> def is style="font-weight: bold;">notprime(n):

    notprime = re.match(r'^(1?)$|^(11+?)(\2+)$', '1' * n)

    if notprime:

        if n <=1:

            return (len( notprime.groups()[0]), )

        else:

            S, SxT1 = notprime.groups()[1:]

            s = len(S)

            t = 1 + len(SxT1)//s

            return (s,t)

    else:

        return ()


>>> [(i, isnotprime(i)) for i in range(15)]

[(0, (0,)), (1, (1,)), (2, ()), (3, ()), (4, (2, 2)), (5, ()), (6, (2, 3)), (7, ()), (8, (2, 4)), (9, (3, 3)), (10, (2, 5)), (11, ()), (12, (2, 6)), (13, ()), (14, (2, 7))]




... And so to bed.



THE END.

Saturday, August 08, 2009

Words From Hex Letters

Enter the six hexadecimal letters into an href="http://rosettacode.org/wiki/Anagrams#Python">anagram
solver with a large scrabble-like href="http://www.puzzlers.org/pub/wordlists/unixdict.txt">dictionary
and you can generate a list of sixteen 'English' words that are also
hex numbers. Good for sprinkling around your assembler listings to
spice up the usual DEADBEEF. CAFÉABBÉ anyone? (confess over a coffee).



style="text-align: left; width: 189px; height: 648px; margin-left: 40px;"
border="1" cellpadding="2" cellspacing="2">


WORD
MEANING


ABBE
priest/clergyman


ABED
in bed/ asleep


BABE
infant


BADE
bid, appeal


BEAD
droplet


BEEF
meat from cow. complaint.


CAFE
small restaurant


CEDE
surrender, abdicate


DADA
father, dad


DADE
<None found>


DEAD
"pining for the href="http://en.wikipedia.org/wiki/Fjords" title="Fjords"
class="mw-redirect">fjords", stunned,
 snuffed-it.


DEAF
"hard of hearing"


DEED
contract. accomplishment


FACE
boldness. front


FADE
disappear gradually.


FEED
food. give food







P.S. Anyone have a meaning for DADE?


Sunday, August 02, 2009

Facebook User-Clustering Puzzle

I came across the following href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">puzzle
at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.



They give a strict format for the log file entries of:

style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>




Some lines from the log might look like:


Thu Dec 11 18:17:01 PST 2008    Finley.Cox@xx.com    James.Harrison@xx.com
Thu Dec 11 18:21:01 PST 2008    Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
Thu Dec 11 18:23:01 PST 2008    Harry.White@xx.com   Finley.Ali@xx.com




They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some
time:

from: A to B

from: A to C

from: B to A

from: B to C

from: C to A

from: C to B



You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.



Their is also a format for how clusters are to be reported:


  • Only clusters of three or more people are to be reported

  • Reports are to be printed to stdout.

  • A cluster to a line.

  • Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
    two characters of a single comma followed by a single space.

  • The lines of clusters to also be sorted on the characters
    appearing in the each cluster line.




An output for some log file might look like:


Aaron.Lewis@xx.com, Callum.Bennett@xx.com, Jayden.Mills@xx.com, Lucas.Lewis@xx.com, Matthew.Kennedy@xx.com
Alfie.Clarke@xx.com, Amy.Russell@xx.com, Connor.King@xx.com, Ethan.Thomas@xx.com, Joshua.Mcdonald@xx.com




The log files are said to be large.



The above is the original problem, re-phrased.



The Reverse Problem


Early on in thinking about how to solve the problem I thought I would
need  much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good  task too and set out to solve that
instead!



I need to:


  1. Generate pseudo-random email addresses

  2. Generate clusters of users

  3. Generate a log from the clusters


 And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.



The code follows, this description.


  • Line 9 is a debug remnant.

  • Although have only run the code in Python 3, I think it
    should work in Python 2.6 as well.

    Lines 10-13 are because intern is moved to the sys module in Python 3.

  • Lines 15-40: I needed thousands of names so trawled WP for href="http://en.wikipedia.org/wiki/List_of_most_popular_given_names">first
    names and href="http://en.wikipedia.org/wiki/List_of_most_common_surnames">family
    names to create their product.

  • Lines 27 and 41: I intern'd the strings because I could -
    It's a possible premature style="text-decoration: line-through;">ejacu
    optimization, but hey.

  • Lines 43-45: I will be building the log without times but
    in order, then adding times to the ordered log entries. to do that i
    need the time to start, and some possible time increments between log
    entries that I will randomly choose between.

  • Line 48; The size of cluster I generate is a random choice
    between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will
    predominate.

  • Line 51 controls how many style="font-style: italic;">extra times an
    email might be sent between two people.

  • Line 53: namegen
    just pairs every first name with every family name. If you want more,
    you could use two first names for example, as well as extending the
    name lists. (or hyphenated family names). Anyone for Imogen Ethan
    Islam-Singh :-)

  • Line 60: name2email
    changes the name from namegen into an email address string

  • Lines 64-69: clustergen
    generates clustercount clusters of names. It randomly chooses how many
    names to put in each cluster using clustersizefreq;, then generates
    that many names per cluster. nextname feeds new names are required, and
    the sorts ensure the clusters are generated in the correct order for
    printing to satisfy the original puzzles output criteria.

  • Lines 71-74: clusterprint will correctly format and print
    the clustergen data structure.

  • Lines 76-95: logggen.
    The largest function 


    • Line 81 creates a number of clusters

    • Line 83 takes the cluster of users and turns each cluster
      in turn into the expanded set of correspondances between every member
      of the cluster, in order.

    • Line 86 randomly duplicates some entries of the log using
      repeatmsgfraction.

    • Line 89 adds some authenticity by randomising the order
      of the emails

    • Lines 91-94 add the time information to the log.

    • Line 95 returns the cluster info as well as the log.


  • Lines 97-101 function logprint can correctly format and
    print the log data. Note the handling of the PST timezone in line 100.





 1 
color="#804040"> 2 '''
color="#804040"> 3 Random Data generator for:
color="#804040"> 4 href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">http://www.facebook.com/careers/puzzles.php?puzzle_id=8
color="#804040"> 5 '''
color="#804040"> 6
color="#804040"> 7
color="#804040"> 8 import itertools, random, datetime
color="#804040"> 9 from pprint color="#a020f0">import pprint color="#a020f0">as pp
color="#804040"> 10 try: color="#0000ff"># Python 3/2 compatibility
color="#804040"> 11 from sys color="#a020f0">import intern
color="#804040"> 12 except:
color="#804040"> 13 pass
color="#804040"> 14
color="#804040"> 15 firstname = ''' color="#ff00ff">
color="#804040"> 16 Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
color="#804040"> 17 Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
color="#804040"> 18 Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
color="#804040"> 19 Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
color="#804040"> 20 Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
color="#804040"> 21 Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
color="#804040"> 22 Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
color="#804040"> 23 Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
color="#804040"> 24 Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
color="#804040"> 25 Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
color="#804040"> 26 '''.split()
color="#804040"> 27 firstname = [intern(n) color="#804040">for n color="#804040">in firstname]
color="#804040"> 28
color="#804040"> 29 familyname = list(set( ''' color="#ff00ff">
color="#804040"> 30 Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
color="#804040"> 31 Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
color="#804040"> 32 Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
color="#804040"> 33 Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
color="#804040"> 34 Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
color="#804040"> 35 Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
color="#804040"> 36 Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
color="#804040"> 37 Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
color="#804040"> 38 Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
color="#804040"> 39 Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
color="#804040"> 40 '''.split() ))
color="#804040"> 41 familyname = [intern(n) color="#804040">for n color="#804040">in familyname]
color="#804040"> 42
color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
color="#804040"> 44 # Choice of delta times between log entries
color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs) color="#804040">for secs color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
color="#804040"> 46
color="#804040"> 47 # Choice of cluster sizes
color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
color="#804040"> 49
color="#804040"> 50 # Many messages between the same two people?
color="#804040"> 51 repeatmsgfraction = 2.0 color="#0000ff"># Extra 200%
color="#804040"> 52
color="#804040"> 53 def color="#008080">namegen():
color="#804040"> 54 'Generates (first, family) name tuples'
color="#804040"> 55 names = list(itertools.product(firstname, familyname))
color="#804040"> 56 random.shuffle(names)
color="#804040"> 57 for name color="#804040">in names:
color="#804040"> 58 yield name
color="#804040"> 59
color="#804040"> 60 def color="#008080">name2email(name):
color="#804040"> 61 'format (first, family) name as email address'
color="#804040"> 62 return ' color="#ff00ff">%s.%s@xx.com' % name
color="#804040"> 63
color="#804040"> 64 def color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 65 firstname=firstname, familyname=familyname):
color="#804040"> 66 'Generate clustercount clusters of unique users'
color="#804040"> 67 nextname = namegen().__next__
color="#804040"> 68 return sorted( sorted(nextname() color="#804040">for i color="#804040">in range(random.choice(clustersizefreq)))
color="#804040"> 69 for j color="#804040">in range(clustercount) )
color="#804040"> 70
color="#804040"> 71 def color="#008080">clusterprint(clusters):
color="#804040"> 72 'Print already sorted clusters in output format'
color="#804040"> 73 for cluster color="#804040">in clusters:
color="#804040"> 74 print ( ' color="#ff00ff">, '.join(name2email(n) color="#804040">for n color="#804040">in cluster) )
color="#804040"> 75
color="#804040"> 76 def color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 77 firstname=firstname, familyname=familyname,
color="#804040"> 78 starttime=starttime, timedeltafreq=timedeltafreq,
color="#804040"> 79 repeatmsgfraction=repeatmsgfraction):
color="#804040"> 80 'Generate a log of clustercount clusters of unique users'
color="#804040"> 81 clusters = clustergen(clustercount)
color="#804040"> 82 # clustered emails
color="#804040"> 83 log = sum([ list(itertools.permutations(cluster, 2)) color="#804040">for cluster color="#804040">in clusters ],
color="#804040"> 84 [])
color="#804040"> 85 # repeats
color="#804040"> 86 for i color="#804040">in range(int(repeatmsgfraction*len(log))):
color="#804040"> 87 log.append(random.choice(log))
color="#804040"> 88 # dispersed emails
color="#804040"> 89 random.shuffle(log)
color="#804040"> 90 # log times for emails
color="#804040"> 91 logtime = starttime
color="#804040"> 92 for n, (send, to) color="#804040">in enumerate(log):
color="#804040"> 93 log[n] = (logtime, send, to)
color="#804040"> 94 logtime += random.choice(timedeltafreq)
color="#804040"> 95 return clusters, log
color="#804040"> 96
color="#804040"> 97 def color="#008080">logprint(log):
color="#804040"> 98 'Print log in input format'
color="#804040"> 99 for logtime, send, to color="#804040">in log:
color="#804040">100 print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
color="#804040">101 , name2email(send), name2email(to)))
color="#804040">102


The Program in use


Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.

style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log) style="font-family: monospace;">
Alfie.Khan@xx.com,
Lily.Rose@xx.com, Tyler.Blanchet@xx.com
style="font-family: monospace;">
Finley.Ali@xx.com,
Harry.White@xx.com


Finley.Cox@xx.com,
James.Harrison@xx.com




Thu Dec 11 17:53:01
PST 2008   
Finley.Cox@xx.com    James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008   
Lily.Rose@xx.com    Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008   
Finley.Cox@xx.com    James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008   
Finley.Ali@xx.com    Harry.White@xx.com
style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008   
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008   
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008   
Alfie.Khan@xx.com    Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008   
Finley.Cox@xx.com    James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008   
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008   
Harry.White@xx.com    Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008   
Harry.White@xx.com    Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008   
James.Harrison@xx.com    Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008   
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008   
Alfie.Khan@xx.com    Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008   
James.Harrison@xx.com    Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008   
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008   
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008   
Harry.White@xx.com    Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008   
Harry.White@xx.com    Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008   
James.Harrison@xx.com    Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008   
Finley.Cox@xx.com    James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008   
Harry.White@xx.com    Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008   
Lily.Rose@xx.com    Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008   
Tyler.Blanchet@xx.com    Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008   
James.Harrison@xx.com    Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008   
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008   
Lily.Rose@xx.com    Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008   
Finley.Cox@xx.com    James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008   
Tyler.Blanchet@xx.com    Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008   
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
style="font-family: monospace;">


30



To Do



  • The cluster info for the puzzle should, of course, have the
    clusters of size two filtered out

  • I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
    Martians forcibly wiped it from my brain" :-)




- Paddy.




Tuesday, July 28, 2009

The case of the disappearing over-bar

Rosetta-code has a task
which asks you to reverse a Unicode string
correctly. I had glanced at the Python
solution
and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article
for the true Unicode string).




I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:




Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
as⃝df̅
>>> # Just showing x gives the over-bar on the f
>>> x
'as⃝df̅'
>>> # Print it and it is fine.
>>> print(x)
as⃝df̅
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
'̅fd⃝sa'
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])
̅fd⃝sa
>>>



|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:


>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']


As the unicode might otherwise be lost in my editing to try and present
it to you above.



A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.




The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:


>>> [(c,'%s' % unicodedata.name(c,0xfffff)) for c in x]
[('a', 'LATIN SMALL LETTER A'), ('s', 'LATIN SMALL LETTER S'), ('⃝', 'COMBINING ENCLOSING CIRCLE'), ('d', 'LATIN SMALL LETTER D'), ('f', 'LATIN SMALL LETTER F'), ('̅', 'COMBINING OVERLINE')]


I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:


'''
Reverse a Unicode string with proper handling of combining characters
'''

import unicodedata

def ureverse(ustring):
'''
Reverse a string including unicode combining characters

Example:
>>> ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
>>> ucoderev = ureverse(ucode)
>>> ['%x' % ord(char) for char in ucoderev]
['66', '305', '64', '73', '20dd', '61']
>>>
'''
groupedchars = []
uchar = list(ustring)
while uchar:
if 'COMBINING' in unicodedata.name(uchar[0], ''):
groupedchars[-1] += uchar.pop(0)
else:
groupedchars.append(uchar.pop(0))
# Grouped reversal
groupedchars = groupedchars[::-1]

return ''.join(groupedchars)

if __name__ == '__main__':
ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
ucoderev = ureverse(ucode)
print (ucode)
print (ucoderev)


 

It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).

Gosh, an attractive Unicode issue! Whatever next.