Go deh!: 2009

Thursday, December 03, 2009

What is a first-class function?

There is a debate about it on Wikipedia, and I am curious:

Does a language have first-class functions if it can use its functions to solve particular problems in a certain manner, or
Does a language have first-class functions if functions are just as flexible as other types in the language itself?

It probably means 1. above, as the second definition would omit most static languages, as they might be able, for example, to take textual input and change it into an internal representation of an integer, but not do the same for a function with similar effort. Someone who exclusively thinks in terms of static languages may not have realized a difference.

Saturday, November 28, 2009

Hamming Numbers (and Perl: Why I just ignored it)

I had read the following post: "Perl: Love it, or hate it, but don’t ignore
it.", by Phillip Smith, in which the author wishes Perl
solutions were credited as much as solutions in other languages such as
Python and Ruby.

I agreed that Perl wasn't mentioned as much as it was, and a day or so
later I came across an example of why....

I monitor whats happening in Python, and came across a new recipe added
to Activestate code called: "Recipe 576961: Technique for cyclical
iteration", by Raymond Hettinger. On first, second, and third
readings, i just couldn't understand the need for the deferred_output
function; the whole recipe remained opaque:

Raymond's
code:

from itertools import tee, chain, islice
from heapq import merge

def hamming_numbers():
   # Generate "5-smooth" numbers, also called "Hamming numbers"
   # or "Regular numbers". See: http://en.wikipedia.org/wiki/Regular_number
   # Finds solutions to 2**i * 3**j * 5**k for some integers i, j, and k.

   def no_repeats(it):
       last = None
       for i in it:
           if i is not last:
                yield i
                last = i

    def deferred_output():
       for i in output:
           yield i

   result, p2, p3, p5 = tee(deferred_output(), 4)
   m2 = (2*x for x in p2) # multiples of 2
   m3 = (3*x for x in p3) # multiples of 3
   m5 = (5*x for x in p5) # multiples of 5
   merged = merge(m2, m3, m5)
   combined = chain([1], merged) # prepend a starting point
   output = no_repeats(combined) # eliminate duplicates

   return result

if __name__ == '__main__':
   print(list(islice(hamming_numbers(), 1000))) # Show first 1000 hamming numbers

I was hooked by the series however and so went looking for more
information.

Perl

I came across a mention to the following page: "Infinite
lists in Perl" by Mark Dominus which is copyright 1997 by the
Perl Journal. Normally I wouldn't mention the article, as it spends
more time creating an implementation of generators in Perl as it does
on addressing the computation of Hamming numbers. It also uses the term
"stream" instead of generators. but maybe this is down to the age of
the article, i.e. 1997?

Haskell

Looking a bit further, I found this Dr Dobbs Article, and this
commented Haskell
solution:

hamming = 1 : map (2*) hamming `merge` map (3*) hamming `merge` map (5*) hamming
    where merge (x:xs) (y:ys)
           | x < y = x : xs `merge` (y:ys)
           | x > y = y : (x:xs) `merge` ys
           | otherwise = x : xs `merge` ys

The Haskell solution lead me to think that maybe Python has an issue
with defining recursive generators, and some experimentation lead me to
believe that yep, maybe that was why Raymond has the deferred_output
function.

My Python

Since I have trouble understanding Raymond s I went away and wrote my
own solution. It works. It can generate very large values of the
sequence quite quickly. But it does need memory to store
multiples:

from itertools import islice
from pprint import pprint as pp


def regularnumbergen(start = 1, multipliers=(2, 3, 5)):
   '''\
    Generates the Regular Number sequence defined here:
        http://en.wikipedia.org/wiki/Regular_number
    The sequence starts with 1 and consists of all multiples by 2, 3, or 5
    of any number in the sequence in non-repeating, increasing  order.
    '''
   multiples = [[] for i in multipliers]
   this = start
   while True:
       yield this
       # get multiples
       for multiple, multiplier in zip(multiples, multipliers):
           multiple.append(this * multiplier)
       # get next
       this = min(m[0] for m in multiples)
       # remove this from multiples
       for multiple in multiples:
           if multiple[0] == this:
               del multiple[0]

if __name__ == '__main__':
   print(list(islice(regularnumbergen(), 10)))

Generating the following large values of the sequence took only seconds:

>>> print(list(islice(regularnumbergen(), 99999, 100001)))
[290142196707511001929482240000000000000, 290237644800000000000000000000000000000]
>>>

Wrapup

Well, I've learnt a bit more about generators, the limitatins in their
Python implementation w.r.t. Haskel, and I've put more meat on why I
tend to not quote much Perl In this case, i couldn't see the wood for
the trees! (But I do quote Perl).

Thursday, November 12, 2009

Curious about Go.

Verilog is to SytemVerilog as C is to C++ .
Verilog is to __________ as C is to Go ?

My main criticism of SytemVerilog was the size of the language compared to Verilog. Maybe we'll get SystemGo in five years time. (Or maybe Go itself will warp into some monstrosity; my crystal ball needs a new backlight).

Wednesday, November 04, 2009

More Words From Hex Letters (with extra zero's)

STOP PRESS! Someone commented that zeros could be used as ones so an update is at the bottom.

It was pointed out to me that I needn't restrict myself to four character words, and could use the number '1' as letter 'l', and the number '5' as the letter 'S'.

So here is the program and output of words of four or more characters that can be made from hexadecimal characters.


import urllib

words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt'
                  ).read().split()
#len(words) == 25104

def findwords(hexchars):
hexchars = set(hexchars)
hexwords = [w for w in words if 4 <= len(w) and all(ch in hexchars for ch in w)]
return hexwords

def printwords(hexwords):
for i in range(0, len(hexwords), 5):
   print ('  ' + ' '.join("%-8s" % w for w in hexwords[i:i+5]))

hexchars = 'abcdef'
#hexwords = findwords(hexchars)
#print "\nHEXWORDS"
#printwords(hexwords)

# Now to use one: 1 as ell: l; and five: 5 as ess: s
hexwords = findwords(hexchars + 'ls')
hexwords = [word.replace('s','5').replace('l','1') for word in hexwords]
print "\nHEXWORDS USING one: 1 AS ell: l; AND five: 5 AS ess: s"
printwords(hexwords)

The list of words found have some gems, such as babble, for when you know your generating data that no one will read; baffle, baseball, blab, bleed, bless, (and blessed if it were in the dictionary), cabal, cascade, cease, debacle, decade, decease, (and should have deceased), false, flee, label, ...

(Someone at work suggested we use 1abe1 or label to prefix a section of writes to a peripheral in an assembler prog which got me going).

The words:


HEXWORDS USING one: 1 AS ell: l; AND five: 5 AS ess: s
aaa5     ababa    aba5e    abba5    abbe
abed     abe1     ab1e     ab5ce55  accede
acce55   added    add1e    ade1e    affab1e
a1ba     a1ec     a1fa1fa  a11e1e   a55e55
babb1e   babe     babe1    bade     baff1e
ba1d     ba1e     ba11     ba11ad   ba11ed
ba15a    ba5a1    ba5e     ba5eba11 ba5e1
ba55     bead     bead1e   beebe    beef
befa11   befe11   be1a     be11     be11a
be11e    be55     be55e1   b1ab     b1ade
b1ed     b1eed    b1e55    caba1    cab1e
cafe     ca1eb    ca1f     ca11     ca11a
ca5cade  ca5e     cea5e    cede     ce1ebe5
ce11     c1ad     c1a55    dabb1e   dacca
dada     dade     da1e     da11a5   dead
deaf     dea1     debac1e  deba5e   decade
deca1    decca    decea5e  deed     deface
de11     de11a    ea5e     ea5e1    ecc1e5
efface   effaceab1e e1ba     e11a     e15e
fab1e    facade   face     fade     fa11
fa15e    feeb1e   feed     fee1     fe11
f1ea     f1ed     f1ee     f1eece   1abe1
1ace     1ad1e    1a5e     1a55     1ead
1eaf     1ea5e    1eed5    1e55     1e55ee
5ab1e    5accade  5add1e   5afe     5a1ad
5a1e     5a11e    5cab     5ca1a    5ca1d
5ca1e    5ea1     5ecede   5eeab1e  5eed
5eedbed  5e1f     5e11     51ab     51ed

Additional Words with an 'o'

Without Michaels comment I would have skipped all words that use zero for an o, so I made the following two changes:

hexwords = findwords(hexchars + 'lso')
hexwords = [word.replace('s','5').replace('l','1').replace('o', '0') for word in hexwords]

And created the following extra words:

  ab0de    acc01ade ad0be    a1c0a    a110cab1e
a10e     a100f    a150     ba1b0a   ba550 
b10b     b10c     b100d    b0bb1e   b0ca  
b0de     b01d     b01dface b01e     b010  
b05e     b055     caca0    c10d     c105e 
c0a1     c0a1e5ce c0bb     c0bb1e   c0b01 
c0ca     c0c0     c0c0a    c0da     c0dd1e
c0de     c0ed     c0ffee   c01a     c01d  
c01e     c01055a1 c001     c05ec    dec0de
d0bb5    d0dd     d0d0     d0ff     d01ce 
d01e     d011     d00d1e   d05e     ec01e 
ee0c     fa110ff  f10c     f10e     f100d 
f0a1     f0ca1    f01d     f00d     f001  
f055     1a05     1a550    10ad     10af  
10be     10b0     10ca1    10ca1e   10eb  
10e55    101a     1011     1005e    1005e1eaf
105ab1e  105e     1055     0a5e5    0be5e 
0b0e     0b5e55   0de55a   0ffa1    0ff10ad
0ff5add1e 01af     00d1e5   0510     5caff01d
5c0ff    5c01d    5eaf00d  510b     510e  
50da     50fa     501ace   501d     501e  
5010

Sunday, November 01, 2009

The 24 game

I've just created a new task over on Rosetta Code for the 24 game, and a separate task to create a solver for the game.

It's a brute -force solver, but I find the game to be playable, so will get my sprogs to test their arithmetic skills with it later.

I've just realised that the solver could be easily modified to solve "The numbers game" from the UK Channel 4 program Countdown. (Although their could be an issue with division as, when given the digits 3 3 7 7, the solver finds the solution: ( 3 + 3 / 7 ) * 7 which may not be allowed by the countdown rules which state that "only whole numbers may be used").

Sunday, October 18, 2009

Template for forking a Python program

I am doing a large regression run of some 30,000 simulations producing aroung 30 Gigs of logs to extract test parameter and simulation result from each.

I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.

A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.

I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.

Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.

After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.

 1 import os
 2 import sys
 3 import time
 4 
 5 maxprocs = 4 # procs - 1 children and the parent to do the work
 6 jobarray = range(25) # Work to be split between proccesses
 7 
 8 def do1job(job):
 9  'Process one item from the jobarray'
10  return job
11 
12 def picklesave(results):
13  'save partialresults to file'
14  time.sleep(3)
15 
16 def unpickleread(proc):
17  'read partial results previousely saved by process number procc'
18  return [] # dummy
19 
20 def dowork(proc, maxprocs=maxprocs, jobarray=jobarray):
21  ' Split jobarray tasks between maxprocs by proci number'
22  time.sleep(proc) # convenience processing time 
23  results = []
24  for count, job in enumerate(jobarray):
25  if count % maxprocs == proc:
26  results.append(do1job(job))
27  picklesave(results)
28  print ("Process: %i completed jobs: %s" % (proc, results))
29  return results
30 
31 
32 def createworkerchildren(maxprocs=maxprocs):
33  for proc in range(1, maxprocs):
34  print 'PARENT: Forking %s' % proc
35  worker_pid = os.fork()
36  if not worker_pid:
37  'This is a child process'
38  dowork(proc)
39  # Children exit here!
40  sys.exit(proc)
41 
42 # Start and don't wait for child proccesses to do their work
43 createworkerchildren()
44 
45 # Do parent processes share of the work
46 results = []
47 results += dowork(0) # don't have to pickle/unpickle Parent processes share
48 
49 # wait for children
50 for proc in range(1, maxprocs):
51  print 'PARENT: Waiting for any child'
52  done = os.wait()
53  print 'PARENT: got child', done
54 
55 # read and integrate child results
56 for proc in range(1, maxprocs):
57  results += unpickleread(proc)
58 
59 # Calculate and print data summary
60 pass
61

Running the above on cygwin produced:

bash$ python testfork.py
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$

Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).

Sunday, October 11, 2009

Extend functools.partial to nested function calls?

I am reading the 1977 ACM Turing Award Lecture "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs" by John Backus.

In section 5.1 he is contrasting a von Neumann program for inner product, which he gives as:

c := 0
for i := 1 step 1 until n do
   c := c + c[i] x b[i]

The above converts quit nicely to the python equivalent:

>>> v1, v2 = [1,2,3], [6,5,4]
>>> def ip_von_neumann(a, b):
    c = 0
    for i in range(3):
       c = c  + a[i] * b[i]
    return c

>>> ip_von_neumann(v1, v2)
28

Backus then goes on to contrast the von Neumann style with his functional program:

Def Innerproduct ≡ (Insert +)o(ApplyToAll x)oTranspose

Which I converted to the following functional style Python:

>>> from operator import add, mul
>>> from functools import reduce as insert
>>>
>>> def ip_functional(*x):
    return insert(add, map(mul, *x))

>>> ip_functional(v1, v2)
28

Well, the Python works, but it still has this name x that I would like to eliminate.

I think I can go part of the way by using functools.partial. For example, for the single function call of map(mul, *x) I can do:

>>> from functools import partial
>>> p1 = partial(map, mul)
>>> assert p1(v1, v2) == map(mul, *(v1, v2))
>>> map(mul, *(v1, v2))
[6, 10, 12]
>>>

But I don't know how to go that extra step and remove x from the definition fully, ie end up with just some assignment like:

ip_purefunctional = {A pure function of functions without any argument placeholders}

All help would be appreciated, Thanks.

Friday, October 02, 2009

The Otter Algorithm

I was reading about a Richard Dawkins program that through random mutation and selection of best fit, would gradually converge a random string into a target string. It is called the Weasel Program,and a rough outline of it is:

Start with a random string of 28 characters
Copy this string 100 times, with a 5% chance per character of that character being replaced with a random character
Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score
If any of the new strings has a perfect score, halt
Otherwise, take the highest scoring string, and go to step 2

After coding it up, I found it just would not converge to a final match for me. I suspected I had a mistake somewhere , but just did not like how long it was taking to converge.

I decided to change things.

It seemed to me that if you are further away from the ideal, then you should change more letters; and, after seeing the score for how good the current best string is sometimes decrease, I made a modification to choose the best of the copies and the best string that they are mutations of.

With that modification, convergence was swift.

Oh, I call it the Otter program, as it is more than the Weasel.

The prog

 1 
 2 '''
 3 From: http://en.wikipedia.org/wiki/Weasel_program
 4       With modifications
 5 '''
 6 
 7 from string import ascii_uppercase
 8 from random import choice, random
 9 
10 target = list("METHINKS IT IS LIKE A WEASEL")
11 chance = .05
12 times  = 100
13 
14 charset      = ascii_uppercase + ' '
15 iterations   = 0
16 highscore    = [choice(charset) for _ in range(len(target))]
17 perfectscore = len(target)
18 
19 def score(trial):
20     return sum(t==h for t,h in zip(trial, target))
21 
22 while highscore != target:
23     thischance =  1-((perfectscore - score(highscore)) / perfectscore * (1 - chance))
24     #thischance = chance
25     iterations += 1
26     if iterations % 50 == 0:
27         print (iterations, score(highscore), ''.join(highscore))
28     copies = [ [(ch if random() <= thischance else choice(charset)) for ch in highscore]
29                for _ in range(times) ]  + [highscore]
30     highscore = max(copies, key=score)
31 print (iterations, ''.join(highscore))

Function score at line 19 uses the fact that booleans True and False are also numbers and calculates how many characters are in the right position as its result.

Line 23 modifies the amount of randomness injected, based on how bad the last highscore was: Change more if you are further away from the target - less as you approach.
Name thischance is used in line 28. random returns a float between 0.0 and 1.0. if it is less than thischance then a particular character is not changed.

The function for thischance in line 23 is equal to chance when no characters match, and 1.0 if all characters were to match. This forces less change as you approach the target string.

Here is a little snippet to show how thischance varies with the score (sc):

>>> for sc in range(0,29,7):
        print( "score=%2i, thischance=%5.2f" % (sc,
                1-((perfectscore - sc) / perfectscore * (1 - chance))) )


score= 0, thischance= 0.05
score= 7, thischance= 0.29
score=14, thischance= 0.53
score=21, thischance= 0.76
score=28, thischance= 1.00

In line 29 the addition of highscore into copies means that the max taken in line 30 can never decrease.

P.S. The post is only tangentially linked to Darwinism. I am secure in the knowledge that my children's school teaches it, and I am not really interested in it being mentioned in any comments. Treat it just as a way to go from a random choice of letters and home in on a target string by repeated random alterations, and selection of the "best".

Thursday, September 24, 2009

Natural Sorting

Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .

I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:

style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done style="font-family: monospace;">
bash$ ls -1 style="font-family: monospace;">
f0.txt style="font-family: monospace;">
f10.txt style="font-family: monospace;">
f100.txt style="font-family: monospace;">
f105.txt style="font-family: monospace;">
f110.txt style="font-family: monospace;">
f15.txt style="font-family: monospace;">
f20.txt style="font-family: monospace;">
f25.txt style="font-family: monospace;">
f30.txt style="font-family: monospace;">
f35.txt style="font-family: monospace;">
f40.txt style="font-family: monospace;">
f45.txt style="font-family: monospace;">
f5.txt style="font-family: monospace;">
f50.txt style="font-family: monospace;">
f55.txt style="font-family: monospace;">
f60.txt style="font-family: monospace;">
f65.txt style="font-family: monospace;">
f70.txt style="font-family: monospace;">
f75.txt style="font-family: monospace;">
f80.txt style="font-family: monospace;">
f85.txt style="font-family: monospace;">
f90.txt style="font-family: monospace;">
f95.txt

It might be more friendly if the files were ordered so that the numeric
section increased steadily.

In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")

Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.

Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.

I had a quick stab at creating a natural sort routine and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..

As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.

  1 
 color="#804040">  2 # -*- coding: utf-8 -*-
 color="#804040">  3 # Not Python 3.x (Can't compare str and int)
 color="#804040">  4 
 color="#804040">  5 '''
 color="#804040">  6     Natural sorting: Sorting of text that does more than rely on the
 color="#804040">  7     order of individual characters codes to make the finding of
 color="#804040">  8     individual strings easier for a human reader.
 color="#804040">  9     
 color="#804040"> 10     Some 'natural' orderings might include:
 color="#804040"> 11         1) Ignore leading, trailing and multiple adjacent spaces
 color="#804040"> 12         2) All space types equivalent.
 color="#804040"> 13         3) Sorting without regard to case.
 color="#804040"> 14         4) Sorting numeric portions of strings in numeric order:
 color="#804040"> 15             foo9.txt before foo10.txt
 color="#804040"> 16             As well as ... x9y99 before x9y100, before x10y0
 color="#804040"> 17             ... (for any number of groups of integers in a string).
 color="#804040"> 18         5) Title sorts: without regard to a leading, very common, word such
 color="#804040"> 19            as 'The' in "The thirty-nine steps".
 color="#804040"> 20         6) Sort letters without regard to accents.
 color="#804040"> 21         7) Sort ligatures as separate letters.
 color="#804040"> 22         8) Replacements:
 color="#804040"> 23             Sort german scharfes S (ÃŸ) as ss
 color="#804040"> 24             Sort Å¿,  LATIN SMALL LETTER LONG S as s
 color="#804040"> 25             Sort Ê’,  LATIN SMALL LETTER EZH as s
 color="#804040"> 26             ...
 color="#804040"> 27 '''
 color="#804040"> 28 
 color="#804040"> 29 from itertools  color="#a020f0">import groupby
 color="#804040"> 30 from unicodedata  color="#a020f0">import decomposition, name
 color="#804040"> 31 
 color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
 color="#804040"> 33 replacements = {' color="#ff00ff">ÃŸ': 'ss',   color="#0000ff"># Map single char to replacement string
 color="#804040"> 34                 ' color="#ff00ff">Å¿': 's',
 color="#804040"> 35                 ' color="#ff00ff">Ê’': 's',
 color="#804040"> 36                 }
 color="#804040"> 37 
 color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
 color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789')    color="#0000ff"># Don't use str.isnumeric
 color="#804040"> 40 
 color="#804040"> 41 def  color="#008080">splitchar(c):
 color="#804040"> 42     ' color="#ff00ff"> De-ligature. De-accent a char'
 color="#804040"> 43     de = decomposition(c)
 color="#804040"> 44     if de:
 color="#804040"> 45          color="#0000ff"># Just the words that are also hex numbers
 color="#804040"> 46         de = [d  color="#804040">for d  color="#804040">in de.split()
 color="#804040"> 47                    color="#804040">if all(c.lower()
 color="#804040"> 48                           color="#804040">in hexdigits  color="#804040">for c  color="#804040">in d)]
 color="#804040"> 49         n = name(c, c).upper()
 color="#804040"> 50          color="#0000ff"># (Gosh it's onerous)
 color="#804040"> 51          color="#804040">if len(de)> 1  color="#804040">and ' color="#ff00ff">PRECEDE'  color="#804040">in n:
 color="#804040"> 52              color="#0000ff"># E.g. Å‰  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 color="#804040"> 53             de[1], de[0] = de[0], de[1]
 color="#804040"> 54         tmp = [ unichr(int(k, 16))  color="#804040">for k  color="#804040">in de]
 color="#804040"> 55         base, others = tmp[0], tmp[1:]
 color="#804040"> 56          color="#804040">if ' color="#ff00ff">LIGATURE'  color="#804040">in n:
 color="#804040"> 57              color="#0000ff"># Assume two character ligature
 color="#804040"> 58             base += others.pop(0)
 color="#804040"> 59     else:
 color="#804040"> 60         base = c
 color="#804040"> 61     return base
 color="#804040"> 62 
 color="#804040"> 63 
 color="#804040"> 64 def  color="#008080">sortkeygen(s):
 color="#804040"> 65     ''' color="#ff00ff">Generate 'natural' sort key for s
 color="#804040"> 66 
 color="#804040"> 67     Doctests:
 color="#804040"> 68         >>> sortkeygen('  some extra    spaces  ')
 color="#804040"> 69         [u'some extra spaces']
 color="#804040"> 70         >>> sortkeygen('CasE InseNsItIve')
 color="#804040"> 71         [u'case insensitive']
 color="#804040"> 72         >>> sortkeygen('The Wind in the Willows')
 color="#804040"> 73         [u'wind in the willows']
 color="#804040"> 74         >>> sortkeygen(u' color="#6a5acd">\462 ligature')
 color="#804040"> 75         [u'ij ligature']
 color="#804040"> 76         >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
 color="#804040"> 77         [u'yy upper/lower case y with acute accent']
 color="#804040"> 78         >>> sortkeygen('foo9.txt')
 color="#804040"> 79         [u'foo', 9, u'.txt']
 color="#804040"> 80         >>> sortkeygen('x9y99')
 color="#804040"> 81         [u'x', 9, u'y', 99]
 color="#804040"> 82     '''
 color="#804040"> 83     # Ignore leading and trailing spaces
 color="#804040"> 84     s = unicode(s).strip()
 color="#804040"> 85     # All space types are equivalent
 color="#804040"> 86     s = ' color="#ff00ff"> '.join(s.split())
 color="#804040"> 87     # case insentsitive
 color="#804040"> 88     s = s.lower()
 color="#804040"> 89     # Title
 color="#804040"> 90     words = s.split()
 color="#804040"> 91     if len(words) > 1  color="#804040">and words[0]  color="#804040">in commonleaders:
 color="#804040"> 92         s = ' color="#ff00ff"> '.join( words[1:])
 color="#804040"> 93     # accent and ligatures
 color="#804040"> 94     s = ''.join(splitchar(c)  color="#804040">for c  color="#804040">in s)
 color="#804040"> 95     # Replacements (single char replaced by one or more)
 color="#804040"> 96     s = ''.join( replacements.get(ch, ch)  color="#804040">for ch  color="#804040">in s )
 color="#804040"> 97     # Numeric sections as numerics
 color="#804040"> 98     s = [ int("".join(g))  color="#804040">if isinteger  color="#804040">else "".join(g)
 color="#804040"> 99            color="#804040">for isinteger,g  color="#804040">in groupby(s,  color="#804040">lambda x: x  color="#804040">in decdigits)]
 color="#804040">100 
 color="#804040">101     return s
 color="#804040">102 
 color="#804040">103 def  color="#008080">naturalsort(items):
 color="#804040">104     ''' color="#ff00ff"> Naturally sort a series of strings
 color="#804040">105 
 color="#804040">106     Doctests:
 color="#804040">107         >>> naturalsort(['The Wind in the Willows','The 40th step more',
 color="#804040">108                          'The 39 steps', 'Wanda'])
 color="#804040">109         ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
 color="#804040">110 
 color="#804040">111     '''
 color="#804040">112     return sorted(items, key=sortkeygen)
 color="#804040">113 
 color="#804040">114 
 color="#804040">115 



 color="#804040">116 # Extras...
 color="#804040">117 
 color="#804040">118 '''
 color="#804040">119 Help on built-in function decomposition in module unicodedata:
 color="#804040">120 
 color="#804040">121 decomposition(...)
 color="#804040">122     decomposition(unichr)
 color="#804040">123     
 color="#804040">124     Returns the character decomposition mapping assigned to the Unicode
 color="#804040">125     character unichr as string. An empty string is returned in case no
 color="#804040">126     such mapping is defined.
 color="#804040">127 
 color="#804040">128 Help on built-in function name in module unicodedata:
 color="#804040">129 
 color="#804040">130 name(...)
 color="#804040">131     name(unichr[, default])
 color="#804040">132     Returns the name assigned to the Unicode character unichr as a
 color="#804040">133     string. If no name is defined, default is returned, or, if not
 color="#804040">134     given, ValueError is raised.
 color="#804040">135 '''
 color="#804040">136 
 color="#804040">137 
 color="#804040">138 def  color="#008080">_temp():
 color="#804040">139     # for c in u'\370\157\117\463\462\511\733\337\337':
 color="#804040">140     for c  color="#804040">in u' class="Constant">Ã¸oOÄ³Ä²Å‰ÇÃŸÃÃ½':
 color="#804040">141         de = decomposition(c)
 color="#804040">142          color="#804040">if de:
 color="#804040">143             de = [d  color="#804040">for d  color="#804040">in de.split()
 color="#804040">144                        color="#804040">if all(c.upper()
 color="#804040">145                           color="#804040">in ' color="#ff00ff">0123456789ABCDEF'  color="#804040">for c  color="#804040">in d)]
 color="#804040">146             n = name(c, c).upper()
 color="#804040">147              color="#804040">if len(de)> 1  color="#804040">and ' color="#ff00ff">PRECEDE'  color="#804040">in n:
 color="#804040">148                  color="#0000ff"># E.g. Å‰  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 color="#804040">149                 de[1], de[0] = de[0], de[1]
 color="#804040">150             tmp = [ unichr(int(k, 16))  color="#804040">for k  color="#804040">in de]
 color="#804040">151             base, others = tmp[0], tmp[1:]
 color="#804040">152          color="#804040">else:
 color="#804040">153             base, others = c, []
 color="#804040">154          color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff">  %s  %s" %
 color="#804040">155                   (c,ord(c), name(c,' color="#ff00ff">-'),
 color="#804040">156                    name(base, base),
 color="#804040">157                    ' color="#ff00ff"> '.join(name(o,o)  color="#804040">for o  color="#804040">in others)))
 color="#804040">158 
 color="#804040">159 
 color="#804040">160

Monday, September 14, 2009

Easy Command-line Parallelism

I am trying out a dual chip Xeon server with 8 cores that shows up as
16 CPU's to Redhat Linux. I have my href="http://paddy3118.blogspot.com/search?q=process">process
runner script, that I use with a list of 100000 simulations
to do overnight and have experimented with running different numbers of
simulations in parallel to get good throughput.

The process runner is good, I still use it, but I also wanted something
for more general command-line use. for example, I have around 750
directories, each containing files relating to one test. I organize the
directories to have regular names, and am used to using shell for loops
to process the dataset.

For example, to gzip all the */transcript files I would do:

foreach f (*/transcript)
    gzip $f
end

(Yep, I know it's cshell, but that is the 'standard' at work.)

This isn't making much use of the processing power at hand, so I went
googling for a lightweight way to run such tasks in parallel.

I found two methods. One, to use make and its -P option to run several
jobs in parallel would need the creation of a make file in each case,
which is onerous. The other, which was new to me, rather than a
forgotten feature, is that xargs has a -P option that does a similar
thing - in combination with other options it will run up to a
given number of jobs in parallel:

/bin/ls */transcript \
  | xargs -n1 -iX -P16 gzip X

The above lists all the files, pipes them to xargs which takes them one
at a time (-n1) to form a job by substituting all occurrences of
character X (-iX) in its arguments after its options. a maximum of 16
(-P16) jobs are then run at any one time.

When I want to simulate all 750 tests and create a coverage listing of
each test I use xargs to call the bash shell and write something like:

/bin/ls -d * \
 | xargs -n1 -iX -P16 bash \
 -c 'cd X && vsim tb -c -coverage -do "coverage save -onexit -code t coverX.ucdb;do ../run.do" 2>&1 >/dev/null && vcover report -file coverX.rpt coverX.ucdb 2>&1 >/dev/null'

Yep, I do create such long monstrosities at times. It's because I am
exploring new avenues at the moment, and things are likely to change,
plus I need to know, and change, the details quite frequently. Although
i do make a note of these long one-liners, (and keep a lot of history
in my shells), I tend to only wrap things in a (bash), script when it
becomes stable.

Note:

Following the links from the Wikipedia page, it seems that the -P
option to xargs is not mandated by href="http://www.opengroup.org/onlinepubs/9699919799/utilities/xargs.html">The
Open Group, and does not appear in href="http://docs.sun.com/app/docs/doc/816-5165/xargs-1?a=view">Suns
version of xargs. I work in a mixed Solaris/Linux environment
and have been using Unix for many years which makes it hard to keep up
with the GNU extensions to what were familiar commands. But such is
progress. Bring it on!

Wednesday, September 02, 2009

J for a Py Guy

I just started reading a little bit more about the J programming language, after a lot of J activity on RC. I read the forward to "J for C Programmers" where I picked out an acknowledgement to Ken Iverson whose name rang a bell.

It seems that the J-language is from the APL family tree, but uses the ASCII character set and with later ideas from function-level programming added. It is good for MIMD machines it says, so I wonder how well it works for todays cheaper four to sixteen cores in a box machines, or whether it needs massively MIMD machines to really shine?

It has tickled my fancy. I'll read some more of J for C.

Thursday, August 27, 2009

Saw this, and thought of you...

Here in the UK, it is back-to-school time with many parents buying new school computers for their children. Windows 7 is not released yet, but here is an alternative opinion on the operating system from a competitor:

http://windows7sins.org/#7

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

If you are all sitting comfortably, then I shall begin...

Once upon a time, in a land an internet away, A python programmer was
browsing the web when he StumbledUpon a post with a curious regular
expression purporting to test numbers for primality. On further
investigation, the regexp was attributed in more than one place to
"Abigail" and took the form:

style="font-weight: bold; font-family: monospace;">^1?$|^(11+?)\1+$

Wanting to find out how it worked, the programmer was stumped, as all
the posts mentioning the regexp asked you to visit a page with a
supposedly excellent explanation of its inner workings, but the domain
name was no longer present.

The programmer decided to work it out for himself and came up with the
following...

A prime number is a
positive integer that is only divisible by itself and one. Zero and one
are not prime.

Let's say a number N, greater than one, is not prime. Then
there exists two numbers, S and T, that when multiplied together equal
N. We have:

S x T = N

Lets us further assume
that T is greater than, or equal to S. (They can be swapped if
necessary to make this so). Then a small bit of manipulation gives:

(S x (T-1)) + S = N

(You take one less S in
the multiplication, then you need to add the S back).

Lets swap the terms and write this as equation (a):

style="font-weight: bold;">S + (S x (T-1)) = N style="font-weight: bold;"> when S,T,N are integers greater
than one, and N is style="font-style: italic; font-weight: bold;">not style="font-weight: bold;"> prime

Abigail's regexp relies on representing integers as strings of that
number of ones:

Zero would be
    ''

One would be     '1'

two becomes     '11'

and so on...

The regexp gives a match if the string of ones does not represent a
prime.

To match zero or one occurrences of a 1 we use the first half of the
regexp:

style="font-family: monospace;">^1?$

Where ^anchors the following to a match from the beginning of the
string, ? will match zero or one occurrences of the preceding 1, and
the trailing $ matches the end of the string, i.e. it matches
a string that contains only zero or one 1 characters.

Going back to equation (a), greater than one, is the same as two or
more.

So S will be represented by two or more 1's - we don't know how many,
but we do know that it is two or more. S, in a regexp, becomes
something like:

style="font-family: monospace;">11+

1 followed by one or more 1's. S is less than or equal to T,
which translates to using a less greedy form of the zero-or-one
matcher, which is +? giving a representation of S as:

style="font-family: monospace;">11+?

We want to find S followed by T-1 identical copies of S, so lets form a
group out of what matches S:

style="font-family: monospace;">(11+?)

Then by referring to this group, it is the first group in the regexp so
can be referred to as \1, we can search for extra copies of the group
by:

style="font-family: monospace;">\1+

S and its copies must cover the whole of the number so must start at ^
and finish at $:

style="font-weight: bold; font-family: monospace;">^(11+?)\1+$

Together with the earlier regexp fraction that covers the numbers zero
and one, they can be or'd together to cover all non primes as
the following:

style="font-family: monospace;">^1?$|^(11+?)\1+$

...Not knowing when to quit, the programmer opted to display some of
his Python code that defines a primality tester using the regexp:

>>> import re

>>> def isprime(n):

    return not re.match(r' style="font-weight: bold;">^1?$|^(11+?)\1+$', '1' * n)

>>> [i for i in range(40) if isprime(i)]

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]

And another function that shows S and T from the regexp match:

>>> def is style="font-weight: bold;">notprime(n):

    notprime = re.match(r'^(1?)$|^(11+?)(\2+)$', '1' * n)

    if notprime:

        if n <=1:

            return (len( notprime.groups()[0]), )

        else:

            S, SxT1 = notprime.groups()[1:]

            s = len(S)

            t = 1 + len(SxT1)//s

            return (s,t)

    else:

        return ()

>>> [(i, isnotprime(i)) for i in range(15)]

[(0, (0,)), (1, (1,)), (2, ()), (3, ()), (4, (2, 2)), (5, ()), (6, (2, 3)), (7, ()), (8, (2, 4)), (9, (3, 3)), (10, (2, 5)), (11, ()), (12, (2, 6)), (13, ()), (14, (2, 7))]

... And so to bed.

THE END.

Saturday, August 08, 2009

Words From Hex Letters

Enter the six hexadecimal letters into an href="http://rosettacode.org/wiki/Anagrams#Python">anagram
solver with a large scrabble-like href="http://www.puzzlers.org/pub/wordlists/unixdict.txt">dictionary
and you can generate a list of sixteen 'English' words that are also
hex numbers. Good for sprinkling around your assembler listings to
spice up the usual DEADBEEF. CAFÉABBÉ anyone? (confess over a coffee).

style="text-align: left; width: 189px; height: 648px; margin-left: 40px;"
border="1" cellpadding="2" cellspacing="2">

WORD
MEANING

ABBE
priest/clergyman

ABED
in bed/ asleep

BABE
infant

BADE
bid, appeal

BEAD
droplet

BEEF
meat from cow. complaint.

CAFE
small restaurant

CEDE
surrender, abdicate

DADA
father, dad

DADE
<None found>

DEAD
"pining for the href="http://en.wikipedia.org/wiki/Fjords" title="Fjords"
class="mw-redirect">fjords", stunned,
snuffed-it.

DEAF
"hard of hearing"

DEED
contract. accomplishment

FACE
boldness. front

FADE
disappear gradually.

FEED
food. give food

P.S. Anyone have a meaning for DADE?

Sunday, August 02, 2009

Facebook User-Clustering Puzzle

I came across the following href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">puzzle
at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.

They give a strict format for the log file entries of:

style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>

Some lines from the log might look like:

Thu Dec 11 18:17:01 PST 2008    Finley.Cox@xx.com    James.Harrison@xx.com
Thu Dec 11 18:21:01 PST 2008    Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
Thu Dec 11 18:23:01 PST 2008    Harry.White@xx.com   Finley.Ali@xx.com

They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some
time:

from: A to B

from: A to C

from: B to A

from: B to C

from: C to A

from: C to B

You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.

Their is also a format for how clusters are to be reported:

Only clusters of three or more people are to be reported

Reports are to be printed to stdout.

A cluster to a line.

Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
two characters of a single comma followed by a single space.

The lines of clusters to also be sorted on the characters
appearing in the each cluster line.

An output for some log file might look like:

Aaron.Lewis@xx.com, Callum.Bennett@xx.com, Jayden.Mills@xx.com, Lucas.Lewis@xx.com, Matthew.Kennedy@xx.com
Alfie.Clarke@xx.com, Amy.Russell@xx.com, Connor.King@xx.com, Ethan.Thomas@xx.com, Joshua.Mcdonald@xx.com

The log files are said to be large.

The above is the original problem, re-phrased.

The Reverse Problem

Early on in thinking about how to solve the problem I thought I would
need much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good task too and set out to solve that
instead!

I need to:

Generate pseudo-random email addresses

Generate clusters of users

Generate a log from the clusters

And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.

The code follows, this description.

Line 9 is a debug remnant.

Although have only run the code in Python 3, I think it
should work in Python 2.6 as well.

Lines 10-13 are because intern is moved to the sys module in Python 3.

Lines 15-40: I needed thousands of names so trawled WP for href="http://en.wikipedia.org/wiki/List_of_most_popular_given_names">first
names and href="http://en.wikipedia.org/wiki/List_of_most_common_surnames">family
names to create their product.

Lines 27 and 41: I intern'd the strings because I could -
It's a possible premature style="text-decoration: line-through;">ejacu
optimization, but hey.

Lines 43-45: I will be building the log without times but
in order, then adding times to the ordered log entries. to do that i
need the time to start, and some possible time increments between log
entries that I will randomly choose between.

Line 48; The size of cluster I generate is a random choice
between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will
predominate.

Line 51 controls how many style="font-style: italic;">extra times an
email might be sent between two people.

Line 53: namegen
just pairs every first name with every family name. If you want more,
you could use two first names for example, as well as extending the
name lists. (or hyphenated family names). Anyone for Imogen Ethan
Islam-Singh :-)

Line 60: name2email
changes the name from namegen into an email address string

Lines 64-69: clustergen
generates clustercount clusters of names. It randomly chooses how many
names to put in each cluster using clustersizefreq;, then generates
that many names per cluster. nextname feeds new names are required, and
the sorts ensure the clusters are generated in the correct order for
printing to satisfy the original puzzles output criteria.

Lines 71-74: clusterprint will correctly format and print
the clustergen data structure.

Lines 76-95: logggen.
The largest function

Line 81 creates a number of clusters

Line 83 takes the cluster of users and turns each cluster
in turn into the expanded set of correspondances between every member
of the cluster, in order.

Line 86 randomly duplicates some entries of the log using
repeatmsgfraction.

Line 89 adds some authenticity by randomising the order
of the emails

Lines 91-94 add the time information to the log.

Line 95 returns the cluster info as well as the log.

Lines 97-101 function logprint can correctly format and
print the log data. Note the handling of the PST timezone in line 100.

 1 
 color="#804040"> 2 '''
 color="#804040"> 3  Random Data generator for:
 color="#804040"> 4   href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">http://www.facebook.com/careers/puzzles.php?puzzle_id=8
 color="#804040"> 5 '''
 color="#804040"> 6 
 color="#804040"> 7 
 color="#804040"> 8 import itertools, random, datetime
 color="#804040"> 9 from pprint  color="#a020f0">import pprint  color="#a020f0">as pp
 color="#804040"> 10 try:  color="#0000ff"># Python 3/2 compatibility
 color="#804040"> 11  from sys  color="#a020f0">import intern
 color="#804040"> 12 except:
 color="#804040"> 13  pass
 color="#804040"> 14 
 color="#804040"> 15 firstname = ''' color="#ff00ff"> 
 color="#804040"> 16  Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
 color="#804040"> 17  Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
 color="#804040"> 18  Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
 color="#804040"> 19  Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
 color="#804040"> 20  Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
 color="#804040"> 21  Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
 color="#804040"> 22  Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
 color="#804040"> 23  Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
 color="#804040"> 24  Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
 color="#804040"> 25  Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
 color="#804040"> 26  '''.split()
 color="#804040"> 27 firstname = [intern(n)  color="#804040">for n  color="#804040">in firstname]
 color="#804040"> 28 
 color="#804040"> 29 familyname = list(set( ''' color="#ff00ff"> 
 color="#804040"> 30  Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
 color="#804040"> 31  Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
 color="#804040"> 32  Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
 color="#804040"> 33  Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
 color="#804040"> 34  Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
 color="#804040"> 35  Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
 color="#804040"> 36  Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
 color="#804040"> 37  Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
 color="#804040"> 38  Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
 color="#804040"> 39  Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
 color="#804040"> 40  '''.split() ))
 color="#804040"> 41 familyname = [intern(n)  color="#804040">for n  color="#804040">in familyname]
 color="#804040"> 42 
 color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
 color="#804040"> 44 # Choice of delta times between log entries
 color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs)  color="#804040">for secs  color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
 color="#804040"> 46 
 color="#804040"> 47 # Choice of cluster sizes
 color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
 color="#804040"> 49 
 color="#804040"> 50 # Many messages between the same two people?
 color="#804040"> 51 repeatmsgfraction = 2.0  color="#0000ff"># Extra 200%
 color="#804040"> 52 
 color="#804040"> 53 def  color="#008080">namegen():
 color="#804040"> 54  'Generates (first, family) name tuples'
 color="#804040"> 55  names = list(itertools.product(firstname, familyname))
 color="#804040"> 56  random.shuffle(names)
 color="#804040"> 57  for name  color="#804040">in names:
 color="#804040"> 58  yield name
 color="#804040"> 59 
 color="#804040"> 60 def  color="#008080">name2email(name):
 color="#804040"> 61  'format (first, family) name as email address'
 color="#804040"> 62  return ' color="#ff00ff">%s.%s@xx.com' % name
 color="#804040"> 63 
 color="#804040"> 64 def  color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
 color="#804040"> 65  firstname=firstname, familyname=familyname):
 color="#804040"> 66  'Generate clustercount clusters of unique users'
 color="#804040"> 67  nextname = namegen().__next__
 color="#804040"> 68  return sorted( sorted(nextname()  color="#804040">for i  color="#804040">in range(random.choice(clustersizefreq)))
 color="#804040"> 69  for j  color="#804040">in range(clustercount) )
 color="#804040"> 70 
 color="#804040"> 71 def  color="#008080">clusterprint(clusters):
 color="#804040"> 72  'Print already sorted clusters in output format'
 color="#804040"> 73  for cluster  color="#804040">in clusters:
 color="#804040"> 74  print ( ' color="#ff00ff">, '.join(name2email(n)  color="#804040">for n  color="#804040">in cluster) )
 color="#804040"> 75 
 color="#804040"> 76 def  color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
 color="#804040"> 77  firstname=firstname, familyname=familyname,
 color="#804040"> 78  starttime=starttime, timedeltafreq=timedeltafreq,
 color="#804040"> 79  repeatmsgfraction=repeatmsgfraction):
 color="#804040"> 80  'Generate a log of clustercount clusters of unique users'
 color="#804040"> 81  clusters = clustergen(clustercount)
 color="#804040"> 82  # clustered emails
 color="#804040"> 83  log = sum([ list(itertools.permutations(cluster, 2))  color="#804040">for cluster  color="#804040">in clusters ],
 color="#804040"> 84  [])
 color="#804040"> 85  # repeats
 color="#804040"> 86  for i  color="#804040">in range(int(repeatmsgfraction*len(log))):
 color="#804040"> 87  log.append(random.choice(log))
 color="#804040"> 88  # dispersed emails
 color="#804040"> 89  random.shuffle(log)
 color="#804040"> 90  # log times for emails
 color="#804040"> 91  logtime = starttime
 color="#804040"> 92  for n, (send, to)  color="#804040">in enumerate(log):
 color="#804040"> 93  log[n] = (logtime, send, to)
 color="#804040"> 94  logtime += random.choice(timedeltafreq)
 color="#804040"> 95  return clusters, log
 color="#804040"> 96 
 color="#804040"> 97 def  color="#008080">logprint(log):
 color="#804040"> 98  'Print log in input format'
 color="#804040"> 99  for logtime, send, to  color="#804040">in log:
 color="#804040">100  print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
 color="#804040">101  , name2email(send), name2email(to)))
 color="#804040">102

The Program in use

Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.

style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log) style="font-family: monospace;">
Alfie.Khan@xx.com,
Lily.Rose@xx.com, Tyler.Blanchet@xx.com style="font-family: monospace;">
Finley.Ali@xx.com,
Harry.White@xx.com

Finley.Cox@xx.com,
James.Harrison@xx.com

Thu Dec 11 17:53:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
Lily.Rose@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
Finley.Ali@xx.com    Harry.White@xx.com style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
Alfie.Khan@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
Alfie.Khan@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
Lily.Rose@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
Tyler.Blanchet@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
Lily.Rose@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
Tyler.Blanchet@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">

30

To Do

The cluster info for the puzzle should, of course, have the
clusters of size two filtered out

I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
Martians forcibly wiped it from my brain" :-)

- Paddy.

Tuesday, July 28, 2009

The case of the disappearing over-bar

Rosetta-code has a task
which asks you to reverse a Unicode string
correctly. I had glanced at the Python
solution and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article for the true Unicode string).

I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:

Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
as⃝df̅
>>> # Just showing x gives the over-bar on the f
>>> x
'asâƒdfÌ…'
>>> # Print it and it is fine.
>>> print(x)
asâƒdfÌ…
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
'Ì…fdâƒsa'
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])
Ì…fdâƒsa
>>>

|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:

>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']

As the unicode might otherwise be lost in my editing to try and present
it to you above.

A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.

The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:

>>> [(c,'%s' % unicodedata.name(c,0xfffff)) for c in x]
[('a', 'LATIN SMALL LETTER A'), ('s', 'LATIN SMALL LETTER S'), ('⃝', 'COMBINING ENCLOSING CIRCLE'), ('d', 'LATIN SMALL LETTER D'), ('f', 'LATIN SMALL LETTER F'), ('̅', 'COMBINING OVERLINE')]

I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:

'''
  Reverse a Unicode string with proper handling of combining characters
'''

import unicodedata

def ureverse(ustring):
    '''
    Reverse a string including unicode combining characters

    Example:
        >>> ucode = ''.join( chr(int(n, 16))
                             for n in ['61', '73', '20dd', '64', '66', '305'] )
        >>> ucoderev = ureverse(ucode)
        >>> ['%x' % ord(char) for char in ucoderev]
        ['66', '305', '64', '73', '20dd', '61']
        >>> 
    '''
    groupedchars = []
    uchar = list(ustring)
    while uchar:
        if 'COMBINING' in unicodedata.name(uchar[0], ''):
            groupedchars[-1] += uchar.pop(0)
        else:
            groupedchars.append(uchar.pop(0))
    # Grouped reversal
    groupedchars = groupedchars[::-1]

    return ''.join(groupedchars)

if __name__ == '__main__':
    ucode = ''.join( chr(int(n, 16))
                     for n in ['61', '73', '20dd', '64', '66', '305'] )
    ucoderev = ureverse(ucode)
    print (ucode)
    print (ucoderev)

It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).

Gosh, an attractive Unicode issue! Whatever next.