Mainly Tech projects on Python and Electronic Design Automation.

Thursday, August 27, 2009

Saw this, and thought of you...

Here in the UK, it is back-to-school time with many parents buying new school computers for their children. Windows 7 is not released yet, but here is an alternative opinion on the operating system from a competitor:

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

If you are all sitting comfortably, then I shall begin...

Once upon a time, in a land an internet away, A python programmer was
browsing the web when he StumbledUpon a post with a curious regular
expression purporting to test numbers for primality. On further
investigation, the regexp was attributed in more than one place to
"Abigail" and took the form:

style="font-weight: bold; font-family: monospace;">^1?$|^(11+?)\1+$

Wanting to find out how it worked, the programmer was stumped, as all
the posts mentioning the regexp asked you to visit a page with a
supposedly excellent explanation of its inner workings, but the domain
name was no longer present.

The programmer decided to work it out for himself and came up with the

A prime number is a
positive integer that is only divisible by itself and one. Zero and one
are not prime.

Let's say a number N, greater than one,  is not prime. Then
there exists two numbers, S and T, that when multiplied together equal
N. We have:

S x T = N

Lets us further assume
that T is greater than, or equal to S. (They can be swapped if
necessary to make this so). Then a small bit of manipulation gives:

(S x (T-1)) + S = N

(You take one less S in
the multiplication, then you need to add the S back).

Lets swap the terms and write this as equation (a):

style="font-weight: bold;">S + (S x (T-1)) = N  style="font-weight: bold;"> when S,T,N are integers greater
than one, and N is style="font-style: italic; font-weight: bold;">not style="font-weight: bold;"> prime

Abigail's regexp relies on representing integers as strings of that
number of ones:

Zero would be

One would be     '1'

two becomes     '11'

and so on...

The regexp gives a match if the string of ones does not represent a

To match zero or one occurrences of a 1 we use the first half of the

style="font-family: monospace;">^1?$

Where ^anchors the following to a match from the beginning of the
string, ? will match zero or one occurrences of the preceding 1, and
the  trailing $ matches the end of the string, i.e. it matches
a string that contains only zero or one 1 characters.

Going back to equation (a), greater than one, is the same as two or

So S will be represented by two or more 1's - we don't know how many,
but we do know that it is two or more. S, in a regexp, becomes
something like:

style="font-family: monospace;">11+

1 followed by one or more 1's. S is less than or equal to T,
which translates to using a less greedy form of the zero-or-one
matcher, which is +? giving a representation of S as:

style="font-family: monospace;">11+?

We want to find S followed by T-1 identical copies of S, so lets form a
group out of what matches S:

style="font-family: monospace;">(11+?)

Then by referring to this group, it is the first group in the regexp so
can be referred to as \1, we can search for extra copies of the group

style="font-family: monospace;">\1+

S and its copies must cover the whole of the number so must start at ^
and finish at $:

style="font-weight: bold; font-family: monospace;">^(11+?)\1+$

Together with the earlier regexp fraction that covers the numbers zero
and one, they can be or'd together to cover all non primes as
the following:

style="font-family: monospace;">^1?$|^(11+?)\1+$

...Not knowing when to quit, the programmer opted to display some of
his Python code that defines a primality tester using the regexp:

>>> import re

>>> def isprime(n):

    return not re.match(r' style="font-weight: bold;">^1?$|^(11+?)\1+$', '1' * n)

>>> [i for i in range(40) if isprime(i)]

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]

And another function that shows S and T  from the regexp match:

>>> def is style="font-weight: bold;">notprime(n):

    notprime = re.match(r'^(1?)$|^(11+?)(\2+)$', '1' * n)

    if notprime:

        if n <=1:

            return (len( notprime.groups()[0]), )


            S, SxT1 = notprime.groups()[1:]

            s = len(S)

            t = 1 + len(SxT1)//s

            return (s,t)


        return ()

>>> [(i, isnotprime(i)) for i in range(15)]

[(0, (0,)), (1, (1,)), (2, ()), (3, ()), (4, (2, 2)), (5, ()), (6, (2, 3)), (7, ()), (8, (2, 4)), (9, (3, 3)), (10, (2, 5)), (11, ()), (12, (2, 6)), (13, ()), (14, (2, 7))]

... And so to bed.


Saturday, August 08, 2009

Words From Hex Letters

Enter the six hexadecimal letters into an href="">anagram
solver with a large scrabble-like href="">dictionary
and you can generate a list of sixteen 'English' words that are also
hex numbers. Good for sprinkling around your assembler listings to
spice up the usual DEADBEEF. CAFÉABBÉ anyone? (confess over a coffee).

style="text-align: left; width: 189px; height: 648px; margin-left: 40px;"
border="1" cellpadding="2" cellspacing="2">



in bed/ asleep


bid, appeal


meat from cow. complaint.

small restaurant

surrender, abdicate

father, dad

<None found>

"pining for the href="" title="Fjords"
class="mw-redirect">fjords", stunned,

"hard of hearing"

contract. accomplishment

boldness. front

disappear gradually.

food. give food

P.S. Anyone have a meaning for DADE?

Sunday, August 02, 2009

Facebook User-Clustering Puzzle

I came across the following href="">puzzle
at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.

They give a strict format for the log file entries of:

style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>

Some lines from the log might look like:

Thu Dec 11 18:17:01 PST 2008
Thu Dec 11 18:21:01 PST 2008
Thu Dec 11 18:23:01 PST 2008

They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some

from: A to B

from: A to C

from: B to A

from: B to C

from: C to A

from: C to B

You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.

Their is also a format for how clusters are to be reported:

  • Only clusters of three or more people are to be reported

  • Reports are to be printed to stdout.

  • A cluster to a line.

  • Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
    two characters of a single comma followed by a single space.

  • The lines of clusters to also be sorted on the characters
    appearing in the each cluster line.

An output for some log file might look like:,,,,,,,,

The log files are said to be large.

The above is the original problem, re-phrased.

The Reverse Problem

Early on in thinking about how to solve the problem I thought I would
need  much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good  task too and set out to solve that

I need to:

  1. Generate pseudo-random email addresses

  2. Generate clusters of users

  3. Generate a log from the clusters

 And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.

The code follows, this description.

  • Line 9 is a debug remnant.

  • Although have only run the code in Python 3, I think it
    should work in Python 2.6 as well.

    Lines 10-13 are because intern is moved to the sys module in Python 3.

  • Lines 15-40: I needed thousands of names so trawled WP for href="">first
    names and href="">family
    names to create their product.

  • Lines 27 and 41: I intern'd the strings because I could -
    It's a possible premature style="text-decoration: line-through;">ejacu
    optimization, but hey.

  • Lines 43-45: I will be building the log without times but
    in order, then adding times to the ordered log entries. to do that i
    need the time to start, and some possible time increments between log
    entries that I will randomly choose between.

  • Line 48; The size of cluster I generate is a random choice
    between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will

  • Line 51 controls how many style="font-style: italic;">extra times an
    email might be sent between two people.

  • Line 53: namegen
    just pairs every first name with every family name. If you want more,
    you could use two first names for example, as well as extending the
    name lists. (or hyphenated family names). Anyone for Imogen Ethan
    Islam-Singh :-)

  • Line 60: name2email
    changes the name from namegen into an email address string

  • Lines 64-69: clustergen
    generates clustercount clusters of names. It randomly chooses how many
    names to put in each cluster using clustersizefreq;, then generates
    that many names per cluster. nextname feeds new names are required, and
    the sorts ensure the clusters are generated in the correct order for
    printing to satisfy the original puzzles output criteria.

  • Lines 71-74: clusterprint will correctly format and print
    the clustergen data structure.

  • Lines 76-95: logggen.
    The largest function 

    • Line 81 creates a number of clusters

    • Line 83 takes the cluster of users and turns each cluster
      in turn into the expanded set of correspondances between every member
      of the cluster, in order.

    • Line 86 randomly duplicates some entries of the log using

    • Line 89 adds some authenticity by randomising the order
      of the emails

    • Lines 91-94 add the time information to the log.

    • Line 95 returns the cluster info as well as the log.

  • Lines 97-101 function logprint can correctly format and
    print the log data. Note the handling of the PST timezone in line 100.

color="#804040"> 2 '''
color="#804040"> 3 Random Data generator for:
color="#804040"> 4 href="">
color="#804040"> 5 '''
color="#804040"> 6
color="#804040"> 7
color="#804040"> 8 import itertools, random, datetime
color="#804040"> 9 from pprint color="#a020f0">import pprint color="#a020f0">as pp
color="#804040"> 10 try: color="#0000ff"># Python 3/2 compatibility
color="#804040"> 11 from sys color="#a020f0">import intern
color="#804040"> 12 except:
color="#804040"> 13 pass
color="#804040"> 14
color="#804040"> 15 firstname = ''' color="#ff00ff">
color="#804040"> 16 Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
color="#804040"> 17 Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
color="#804040"> 18 Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
color="#804040"> 19 Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
color="#804040"> 20 Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
color="#804040"> 21 Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
color="#804040"> 22 Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
color="#804040"> 23 Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
color="#804040"> 24 Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
color="#804040"> 25 Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
color="#804040"> 26 '''.split()
color="#804040"> 27 firstname = [intern(n) color="#804040">for n color="#804040">in firstname]
color="#804040"> 28
color="#804040"> 29 familyname = list(set( ''' color="#ff00ff">
color="#804040"> 30 Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
color="#804040"> 31 Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
color="#804040"> 32 Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
color="#804040"> 33 Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
color="#804040"> 34 Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
color="#804040"> 35 Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
color="#804040"> 36 Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
color="#804040"> 37 Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
color="#804040"> 38 Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
color="#804040"> 39 Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
color="#804040"> 40 '''.split() ))
color="#804040"> 41 familyname = [intern(n) color="#804040">for n color="#804040">in familyname]
color="#804040"> 42
color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
color="#804040"> 44 # Choice of delta times between log entries
color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs) color="#804040">for secs color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
color="#804040"> 46
color="#804040"> 47 # Choice of cluster sizes
color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
color="#804040"> 49
color="#804040"> 50 # Many messages between the same two people?
color="#804040"> 51 repeatmsgfraction = 2.0 color="#0000ff"># Extra 200%
color="#804040"> 52
color="#804040"> 53 def color="#008080">namegen():
color="#804040"> 54 'Generates (first, family) name tuples'
color="#804040"> 55 names = list(itertools.product(firstname, familyname))
color="#804040"> 56 random.shuffle(names)
color="#804040"> 57 for name color="#804040">in names:
color="#804040"> 58 yield name
color="#804040"> 59
color="#804040"> 60 def color="#008080">name2email(name):
color="#804040"> 61 'format (first, family) name as email address'
color="#804040"> 62 return ' color="#ff00ff">' % name
color="#804040"> 63
color="#804040"> 64 def color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 65 firstname=firstname, familyname=familyname):
color="#804040"> 66 'Generate clustercount clusters of unique users'
color="#804040"> 67 nextname = namegen().__next__
color="#804040"> 68 return sorted( sorted(nextname() color="#804040">for i color="#804040">in range(random.choice(clustersizefreq)))
color="#804040"> 69 for j color="#804040">in range(clustercount) )
color="#804040"> 70
color="#804040"> 71 def color="#008080">clusterprint(clusters):
color="#804040"> 72 'Print already sorted clusters in output format'
color="#804040"> 73 for cluster color="#804040">in clusters:
color="#804040"> 74 print ( ' color="#ff00ff">, '.join(name2email(n) color="#804040">for n color="#804040">in cluster) )
color="#804040"> 75
color="#804040"> 76 def color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 77 firstname=firstname, familyname=familyname,
color="#804040"> 78 starttime=starttime, timedeltafreq=timedeltafreq,
color="#804040"> 79 repeatmsgfraction=repeatmsgfraction):
color="#804040"> 80 'Generate a log of clustercount clusters of unique users'
color="#804040"> 81 clusters = clustergen(clustercount)
color="#804040"> 82 # clustered emails
color="#804040"> 83 log = sum([ list(itertools.permutations(cluster, 2)) color="#804040">for cluster color="#804040">in clusters ],
color="#804040"> 84 [])
color="#804040"> 85 # repeats
color="#804040"> 86 for i color="#804040">in range(int(repeatmsgfraction*len(log))):
color="#804040"> 87 log.append(random.choice(log))
color="#804040"> 88 # dispersed emails
color="#804040"> 89 random.shuffle(log)
color="#804040"> 90 # log times for emails
color="#804040"> 91 logtime = starttime
color="#804040"> 92 for n, (send, to) color="#804040">in enumerate(log):
color="#804040"> 93 log[n] = (logtime, send, to)
color="#804040"> 94 logtime += random.choice(timedeltafreq)
color="#804040"> 95 return clusters, log
color="#804040"> 96
color="#804040"> 97 def color="#008080">logprint(log):
color="#804040"> 98 'Print log in input format'
color="#804040"> 99 for logtime, send, to color="#804040">in log:
color="#804040">100 print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
color="#804040">101 , name2email(send), name2email(to)))

The Program in use

Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.

style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log) style="font-family: monospace;">,,
style="font-family: monospace;">,,

Thu Dec 11 17:53:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
style="font-family: monospace;">


To Do

  • The cluster info for the puzzle should, of course, have the
    clusters of size two filtered out

  • I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
    Martians forcibly wiped it from my brain" :-)

- Paddy.


Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

Blog Archive