Go deh!: August 2009

Thursday, August 27, 2009

Saw this, and thought of you...

Here in the UK, it is back-to-school time with many parents buying new school computers for their children. Windows 7 is not released yet, but here is an alternative opinion on the operating system from a competitor:

http://windows7sins.org/#7

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

If you are all sitting comfortably, then I shall begin...

Once upon a time, in a land an internet away, A python programmer was
browsing the web when he StumbledUpon a post with a curious regular
expression purporting to test numbers for primality. On further
investigation, the regexp was attributed in more than one place to
"Abigail" and took the form:

style="font-weight: bold; font-family: monospace;">^1?$|^(11+?)\1+$

Wanting to find out how it worked, the programmer was stumped, as all
the posts mentioning the regexp asked you to visit a page with a
supposedly excellent explanation of its inner workings, but the domain
name was no longer present.

The programmer decided to work it out for himself and came up with the
following...

A prime number is a
positive integer that is only divisible by itself and one. Zero and one
are not prime.

Let's say a number N, greater than one, is not prime. Then
there exists two numbers, S and T, that when multiplied together equal
N. We have:

S x T = N

Lets us further assume
that T is greater than, or equal to S. (They can be swapped if
necessary to make this so). Then a small bit of manipulation gives:

(S x (T-1)) + S = N

(You take one less S in
the multiplication, then you need to add the S back).

Lets swap the terms and write this as equation (a):

style="font-weight: bold;">S + (S x (T-1)) = N style="font-weight: bold;"> when S,T,N are integers greater
than one, and N is style="font-style: italic; font-weight: bold;">not style="font-weight: bold;"> prime

Abigail's regexp relies on representing integers as strings of that
number of ones:

Zero would be
    ''

One would be     '1'

two becomes     '11'

and so on...

The regexp gives a match if the string of ones does not represent a
prime.

To match zero or one occurrences of a 1 we use the first half of the
regexp:

style="font-family: monospace;">^1?$

Where ^anchors the following to a match from the beginning of the
string, ? will match zero or one occurrences of the preceding 1, and
the trailing $ matches the end of the string, i.e. it matches
a string that contains only zero or one 1 characters.

Going back to equation (a), greater than one, is the same as two or
more.

So S will be represented by two or more 1's - we don't know how many,
but we do know that it is two or more. S, in a regexp, becomes
something like:

style="font-family: monospace;">11+

1 followed by one or more 1's. S is less than or equal to T,
which translates to using a less greedy form of the zero-or-one
matcher, which is +? giving a representation of S as:

style="font-family: monospace;">11+?

We want to find S followed by T-1 identical copies of S, so lets form a
group out of what matches S:

style="font-family: monospace;">(11+?)

Then by referring to this group, it is the first group in the regexp so
can be referred to as \1, we can search for extra copies of the group
by:

style="font-family: monospace;">\1+

S and its copies must cover the whole of the number so must start at ^
and finish at $:

style="font-weight: bold; font-family: monospace;">^(11+?)\1+$

Together with the earlier regexp fraction that covers the numbers zero
and one, they can be or'd together to cover all non primes as
the following:

style="font-family: monospace;">^1?$|^(11+?)\1+$

...Not knowing when to quit, the programmer opted to display some of
his Python code that defines a primality tester using the regexp:

>>> import re

>>> def isprime(n):

    return not re.match(r' style="font-weight: bold;">^1?$|^(11+?)\1+$', '1' * n)

>>> [i for i in range(40) if isprime(i)]

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]

And another function that shows S and T from the regexp match:

>>> def is style="font-weight: bold;">notprime(n):

    notprime = re.match(r'^(1?)$|^(11+?)(\2+)$', '1' * n)

    if notprime:

        if n <=1:

            return (len( notprime.groups()[0]), )

        else:

            S, SxT1 = notprime.groups()[1:]

            s = len(S)

            t = 1 + len(SxT1)//s

            return (s,t)

    else:

        return ()

>>> [(i, isnotprime(i)) for i in range(15)]

[(0, (0,)), (1, (1,)), (2, ()), (3, ()), (4, (2, 2)), (5, ()), (6, (2, 3)), (7, ()), (8, (2, 4)), (9, (3, 3)), (10, (2, 5)), (11, ()), (12, (2, 6)), (13, ()), (14, (2, 7))]

... And so to bed.

THE END.

Saturday, August 08, 2009

Words From Hex Letters

Enter the six hexadecimal letters into an href="http://rosettacode.org/wiki/Anagrams#Python">anagram
solver with a large scrabble-like href="http://www.puzzlers.org/pub/wordlists/unixdict.txt">dictionary
and you can generate a list of sixteen 'English' words that are also
hex numbers. Good for sprinkling around your assembler listings to
spice up the usual DEADBEEF. CAFÉABBÉ anyone? (confess over a coffee).

style="text-align: left; width: 189px; height: 648px; margin-left: 40px;"
border="1" cellpadding="2" cellspacing="2">

WORD
MEANING

ABBE
priest/clergyman

ABED
in bed/ asleep

BABE
infant

BADE
bid, appeal

BEAD
droplet

BEEF
meat from cow. complaint.

CAFE
small restaurant

CEDE
surrender, abdicate

DADA
father, dad

DADE
<None found>

DEAD
"pining for the href="http://en.wikipedia.org/wiki/Fjords" title="Fjords"
class="mw-redirect">fjords", stunned,
snuffed-it.

DEAF
"hard of hearing"

DEED
contract. accomplishment

FACE
boldness. front

FADE
disappear gradually.

FEED
food. give food

P.S. Anyone have a meaning for DADE?

Sunday, August 02, 2009

Facebook User-Clustering Puzzle

I came across the following href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">puzzle
at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.

They give a strict format for the log file entries of:

style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>

Some lines from the log might look like:

Thu Dec 11 18:17:01 PST 2008    Finley.Cox@xx.com    James.Harrison@xx.com
Thu Dec 11 18:21:01 PST 2008    Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
Thu Dec 11 18:23:01 PST 2008    Harry.White@xx.com   Finley.Ali@xx.com

They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some
time:

from: A to B

from: A to C

from: B to A

from: B to C

from: C to A

from: C to B

You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.

Their is also a format for how clusters are to be reported:

Only clusters of three or more people are to be reported

Reports are to be printed to stdout.

A cluster to a line.

Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
two characters of a single comma followed by a single space.

The lines of clusters to also be sorted on the characters
appearing in the each cluster line.

An output for some log file might look like:

Aaron.Lewis@xx.com, Callum.Bennett@xx.com, Jayden.Mills@xx.com, Lucas.Lewis@xx.com, Matthew.Kennedy@xx.com
Alfie.Clarke@xx.com, Amy.Russell@xx.com, Connor.King@xx.com, Ethan.Thomas@xx.com, Joshua.Mcdonald@xx.com

The log files are said to be large.

The above is the original problem, re-phrased.

The Reverse Problem

Early on in thinking about how to solve the problem I thought I would
need much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good task too and set out to solve that
instead!

I need to:

Generate pseudo-random email addresses

Generate clusters of users

Generate a log from the clusters

And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.

The code follows, this description.

Line 9 is a debug remnant.

Although have only run the code in Python 3, I think it
should work in Python 2.6 as well.

Lines 10-13 are because intern is moved to the sys module in Python 3.

Lines 15-40: I needed thousands of names so trawled WP for href="http://en.wikipedia.org/wiki/List_of_most_popular_given_names">first
names and href="http://en.wikipedia.org/wiki/List_of_most_common_surnames">family
names to create their product.

Lines 27 and 41: I intern'd the strings because I could -
It's a possible premature style="text-decoration: line-through;">ejacu
optimization, but hey.

Lines 43-45: I will be building the log without times but
in order, then adding times to the ordered log entries. to do that i
need the time to start, and some possible time increments between log
entries that I will randomly choose between.

Line 48; The size of cluster I generate is a random choice
between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will
predominate.

Line 51 controls how many style="font-style: italic;">extra times an
email might be sent between two people.

Line 53: namegen
just pairs every first name with every family name. If you want more,
you could use two first names for example, as well as extending the
name lists. (or hyphenated family names). Anyone for Imogen Ethan
Islam-Singh :-)

Line 60: name2email
changes the name from namegen into an email address string

Lines 64-69: clustergen
generates clustercount clusters of names. It randomly chooses how many
names to put in each cluster using clustersizefreq;, then generates
that many names per cluster. nextname feeds new names are required, and
the sorts ensure the clusters are generated in the correct order for
printing to satisfy the original puzzles output criteria.

Lines 71-74: clusterprint will correctly format and print
the clustergen data structure.

Lines 76-95: logggen.
The largest function

Line 81 creates a number of clusters

Line 83 takes the cluster of users and turns each cluster
in turn into the expanded set of correspondances between every member
of the cluster, in order.

Line 86 randomly duplicates some entries of the log using
repeatmsgfraction.

Line 89 adds some authenticity by randomising the order
of the emails

Lines 91-94 add the time information to the log.

Line 95 returns the cluster info as well as the log.

Lines 97-101 function logprint can correctly format and
print the log data. Note the handling of the PST timezone in line 100.

 1 
 color="#804040"> 2 '''
 color="#804040"> 3  Random Data generator for:
 color="#804040"> 4   href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">http://www.facebook.com/careers/puzzles.php?puzzle_id=8
 color="#804040"> 5 '''
 color="#804040"> 6 
 color="#804040"> 7 
 color="#804040"> 8 import itertools, random, datetime
 color="#804040"> 9 from pprint  color="#a020f0">import pprint  color="#a020f0">as pp
 color="#804040"> 10 try:  color="#0000ff"># Python 3/2 compatibility
 color="#804040"> 11  from sys  color="#a020f0">import intern
 color="#804040"> 12 except:
 color="#804040"> 13  pass
 color="#804040"> 14 
 color="#804040"> 15 firstname = ''' color="#ff00ff"> 
 color="#804040"> 16  Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
 color="#804040"> 17  Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
 color="#804040"> 18  Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
 color="#804040"> 19  Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
 color="#804040"> 20  Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
 color="#804040"> 21  Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
 color="#804040"> 22  Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
 color="#804040"> 23  Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
 color="#804040"> 24  Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
 color="#804040"> 25  Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
 color="#804040"> 26  '''.split()
 color="#804040"> 27 firstname = [intern(n)  color="#804040">for n  color="#804040">in firstname]
 color="#804040"> 28 
 color="#804040"> 29 familyname = list(set( ''' color="#ff00ff"> 
 color="#804040"> 30  Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
 color="#804040"> 31  Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
 color="#804040"> 32  Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
 color="#804040"> 33  Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
 color="#804040"> 34  Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
 color="#804040"> 35  Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
 color="#804040"> 36  Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
 color="#804040"> 37  Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
 color="#804040"> 38  Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
 color="#804040"> 39  Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
 color="#804040"> 40  '''.split() ))
 color="#804040"> 41 familyname = [intern(n)  color="#804040">for n  color="#804040">in familyname]
 color="#804040"> 42 
 color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
 color="#804040"> 44 # Choice of delta times between log entries
 color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs)  color="#804040">for secs  color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
 color="#804040"> 46 
 color="#804040"> 47 # Choice of cluster sizes
 color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
 color="#804040"> 49 
 color="#804040"> 50 # Many messages between the same two people?
 color="#804040"> 51 repeatmsgfraction = 2.0  color="#0000ff"># Extra 200%
 color="#804040"> 52 
 color="#804040"> 53 def  color="#008080">namegen():
 color="#804040"> 54  'Generates (first, family) name tuples'
 color="#804040"> 55  names = list(itertools.product(firstname, familyname))
 color="#804040"> 56  random.shuffle(names)
 color="#804040"> 57  for name  color="#804040">in names:
 color="#804040"> 58  yield name
 color="#804040"> 59 
 color="#804040"> 60 def  color="#008080">name2email(name):
 color="#804040"> 61  'format (first, family) name as email address'
 color="#804040"> 62  return ' color="#ff00ff">%s.%s@xx.com' % name
 color="#804040"> 63 
 color="#804040"> 64 def  color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
 color="#804040"> 65  firstname=firstname, familyname=familyname):
 color="#804040"> 66  'Generate clustercount clusters of unique users'
 color="#804040"> 67  nextname = namegen().__next__
 color="#804040"> 68  return sorted( sorted(nextname()  color="#804040">for i  color="#804040">in range(random.choice(clustersizefreq)))
 color="#804040"> 69  for j  color="#804040">in range(clustercount) )
 color="#804040"> 70 
 color="#804040"> 71 def  color="#008080">clusterprint(clusters):
 color="#804040"> 72  'Print already sorted clusters in output format'
 color="#804040"> 73  for cluster  color="#804040">in clusters:
 color="#804040"> 74  print ( ' color="#ff00ff">, '.join(name2email(n)  color="#804040">for n  color="#804040">in cluster) )
 color="#804040"> 75 
 color="#804040"> 76 def  color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
 color="#804040"> 77  firstname=firstname, familyname=familyname,
 color="#804040"> 78  starttime=starttime, timedeltafreq=timedeltafreq,
 color="#804040"> 79  repeatmsgfraction=repeatmsgfraction):
 color="#804040"> 80  'Generate a log of clustercount clusters of unique users'
 color="#804040"> 81  clusters = clustergen(clustercount)
 color="#804040"> 82  # clustered emails
 color="#804040"> 83  log = sum([ list(itertools.permutations(cluster, 2))  color="#804040">for cluster  color="#804040">in clusters ],
 color="#804040"> 84  [])
 color="#804040"> 85  # repeats
 color="#804040"> 86  for i  color="#804040">in range(int(repeatmsgfraction*len(log))):
 color="#804040"> 87  log.append(random.choice(log))
 color="#804040"> 88  # dispersed emails
 color="#804040"> 89  random.shuffle(log)
 color="#804040"> 90  # log times for emails
 color="#804040"> 91  logtime = starttime
 color="#804040"> 92  for n, (send, to)  color="#804040">in enumerate(log):
 color="#804040"> 93  log[n] = (logtime, send, to)
 color="#804040"> 94  logtime += random.choice(timedeltafreq)
 color="#804040"> 95  return clusters, log
 color="#804040"> 96 
 color="#804040"> 97 def  color="#008080">logprint(log):
 color="#804040"> 98  'Print log in input format'
 color="#804040"> 99  for logtime, send, to  color="#804040">in log:
 color="#804040">100  print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
 color="#804040">101  , name2email(send), name2email(to)))
 color="#804040">102

The Program in use

Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.

style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log) style="font-family: monospace;">
Alfie.Khan@xx.com,
Lily.Rose@xx.com, Tyler.Blanchet@xx.com style="font-family: monospace;">
Finley.Ali@xx.com,
Harry.White@xx.com

Finley.Cox@xx.com,
James.Harrison@xx.com

Thu Dec 11 17:53:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
Lily.Rose@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
Finley.Ali@xx.com    Harry.White@xx.com style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
Alfie.Khan@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
Alfie.Khan@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
Lily.Rose@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
Tyler.Blanchet@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
Lily.Rose@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
Tyler.Blanchet@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">

30

To Do

The cluster info for the puzzle should, of course, have the
clusters of size two filtered out

I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
Martians forcibly wiped it from my brain" :-)

- Paddy.

Go deh!

Thursday, August 27, 2009

Saw this, and thought of you...

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

Saturday, August 08, 2009

Words From Hex Letters

Sunday, August 02, 2009

Facebook User-Clustering Puzzle

The Reverse Problem

The Program in use

To Do

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive