Go deh!: Facebook User-Clustering Puzzle

I came across the following href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">puzzle
at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.

They give a strict format for the log file entries of:

style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>

Some lines from the log might look like:

Thu Dec 11 18:17:01 PST 2008    Finley.Cox@xx.com    James.Harrison@xx.com
Thu Dec 11 18:21:01 PST 2008    Alfie.Khan@xx.com    Tyler.Blanchet@xx.com
Thu Dec 11 18:23:01 PST 2008    Harry.White@xx.com   Finley.Ali@xx.com

They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some
time:

from: A to B

from: A to C

from: B to A

from: B to C

from: C to A

from: C to B

You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.

Their is also a format for how clusters are to be reported:

Only clusters of three or more people are to be reported

Reports are to be printed to stdout.

A cluster to a line.

Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
two characters of a single comma followed by a single space.

The lines of clusters to also be sorted on the characters
appearing in the each cluster line.

An output for some log file might look like:

Aaron.Lewis@xx.com, Callum.Bennett@xx.com, Jayden.Mills@xx.com, Lucas.Lewis@xx.com, Matthew.Kennedy@xx.com
Alfie.Clarke@xx.com, Amy.Russell@xx.com, Connor.King@xx.com, Ethan.Thomas@xx.com, Joshua.Mcdonald@xx.com

The log files are said to be large.

The above is the original problem, re-phrased.

The Reverse Problem

Early on in thinking about how to solve the problem I thought I would
need much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good task too and set out to solve that
instead!

I need to:

Generate pseudo-random email addresses

Generate clusters of users

Generate a log from the clusters

And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.

The code follows, this description.

Line 9 is a debug remnant.

Although have only run the code in Python 3, I think it
should work in Python 2.6 as well.

Lines 10-13 are because intern is moved to the sys module in Python 3.

Lines 15-40: I needed thousands of names so trawled WP for href="http://en.wikipedia.org/wiki/List_of_most_popular_given_names">first
names and href="http://en.wikipedia.org/wiki/List_of_most_common_surnames">family
names to create their product.

Lines 27 and 41: I intern'd the strings because I could -
It's a possible premature style="text-decoration: line-through;">ejacu
optimization, but hey.

Lines 43-45: I will be building the log without times but
in order, then adding times to the ordered log entries. to do that i
need the time to start, and some possible time increments between log
entries that I will randomly choose between.

Line 48; The size of cluster I generate is a random choice
between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will
predominate.

Line 51 controls how many style="font-style: italic;">extra times an
email might be sent between two people.

Line 53: namegen
just pairs every first name with every family name. If you want more,
you could use two first names for example, as well as extending the
name lists. (or hyphenated family names). Anyone for Imogen Ethan
Islam-Singh :-)

Line 60: name2email
changes the name from namegen into an email address string

Lines 64-69: clustergen
generates clustercount clusters of names. It randomly chooses how many
names to put in each cluster using clustersizefreq;, then generates
that many names per cluster. nextname feeds new names are required, and
the sorts ensure the clusters are generated in the correct order for
printing to satisfy the original puzzles output criteria.

Lines 71-74: clusterprint will correctly format and print
the clustergen data structure.

Lines 76-95: logggen.
The largest function

Line 81 creates a number of clusters

Line 83 takes the cluster of users and turns each cluster
in turn into the expanded set of correspondances between every member
of the cluster, in order.

Line 86 randomly duplicates some entries of the log using
repeatmsgfraction.

Line 89 adds some authenticity by randomising the order
of the emails

Lines 91-94 add the time information to the log.

Line 95 returns the cluster info as well as the log.

Lines 97-101 function logprint can correctly format and
print the log data. Note the handling of the PST timezone in line 100.

 1 
 color="#804040"> 2 '''
 color="#804040"> 3  Random Data generator for:
 color="#804040"> 4   href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">http://www.facebook.com/careers/puzzles.php?puzzle_id=8
 color="#804040"> 5 '''
 color="#804040"> 6 
 color="#804040"> 7 
 color="#804040"> 8 import itertools, random, datetime
 color="#804040"> 9 from pprint  color="#a020f0">import pprint  color="#a020f0">as pp
 color="#804040"> 10 try:  color="#0000ff"># Python 3/2 compatibility
 color="#804040"> 11  from sys  color="#a020f0">import intern
 color="#804040"> 12 except:
 color="#804040"> 13  pass
 color="#804040"> 14 
 color="#804040"> 15 firstname = ''' color="#ff00ff"> 
 color="#804040"> 16  Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
 color="#804040"> 17  Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
 color="#804040"> 18  Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
 color="#804040"> 19  Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
 color="#804040"> 20  Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
 color="#804040"> 21  Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
 color="#804040"> 22  Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
 color="#804040"> 23  Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
 color="#804040"> 24  Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
 color="#804040"> 25  Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
 color="#804040"> 26  '''.split()
 color="#804040"> 27 firstname = [intern(n)  color="#804040">for n  color="#804040">in firstname]
 color="#804040"> 28 
 color="#804040"> 29 familyname = list(set( ''' color="#ff00ff"> 
 color="#804040"> 30  Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
 color="#804040"> 31  Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
 color="#804040"> 32  Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
 color="#804040"> 33  Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
 color="#804040"> 34  Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
 color="#804040"> 35  Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
 color="#804040"> 36  Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
 color="#804040"> 37  Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
 color="#804040"> 38  Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
 color="#804040"> 39  Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
 color="#804040"> 40  '''.split() ))
 color="#804040"> 41 familyname = [intern(n)  color="#804040">for n  color="#804040">in familyname]
 color="#804040"> 42 
 color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
 color="#804040"> 44 # Choice of delta times between log entries
 color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs)  color="#804040">for secs  color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
 color="#804040"> 46 
 color="#804040"> 47 # Choice of cluster sizes
 color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
 color="#804040"> 49 
 color="#804040"> 50 # Many messages between the same two people?
 color="#804040"> 51 repeatmsgfraction = 2.0  color="#0000ff"># Extra 200%
 color="#804040"> 52 
 color="#804040"> 53 def  color="#008080">namegen():
 color="#804040"> 54  'Generates (first, family) name tuples'
 color="#804040"> 55  names = list(itertools.product(firstname, familyname))
 color="#804040"> 56  random.shuffle(names)
 color="#804040"> 57  for name  color="#804040">in names:
 color="#804040"> 58  yield name
 color="#804040"> 59 
 color="#804040"> 60 def  color="#008080">name2email(name):
 color="#804040"> 61  'format (first, family) name as email address'
 color="#804040"> 62  return ' color="#ff00ff">%s.%s@xx.com' % name
 color="#804040"> 63 
 color="#804040"> 64 def  color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
 color="#804040"> 65  firstname=firstname, familyname=familyname):
 color="#804040"> 66  'Generate clustercount clusters of unique users'
 color="#804040"> 67  nextname = namegen().__next__
 color="#804040"> 68  return sorted( sorted(nextname()  color="#804040">for i  color="#804040">in range(random.choice(clustersizefreq)))
 color="#804040"> 69  for j  color="#804040">in range(clustercount) )
 color="#804040"> 70 
 color="#804040"> 71 def  color="#008080">clusterprint(clusters):
 color="#804040"> 72  'Print already sorted clusters in output format'
 color="#804040"> 73  for cluster  color="#804040">in clusters:
 color="#804040"> 74  print ( ' color="#ff00ff">, '.join(name2email(n)  color="#804040">for n  color="#804040">in cluster) )
 color="#804040"> 75 
 color="#804040"> 76 def  color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
 color="#804040"> 77  firstname=firstname, familyname=familyname,
 color="#804040"> 78  starttime=starttime, timedeltafreq=timedeltafreq,
 color="#804040"> 79  repeatmsgfraction=repeatmsgfraction):
 color="#804040"> 80  'Generate a log of clustercount clusters of unique users'
 color="#804040"> 81  clusters = clustergen(clustercount)
 color="#804040"> 82  # clustered emails
 color="#804040"> 83  log = sum([ list(itertools.permutations(cluster, 2))  color="#804040">for cluster  color="#804040">in clusters ],
 color="#804040"> 84  [])
 color="#804040"> 85  # repeats
 color="#804040"> 86  for i  color="#804040">in range(int(repeatmsgfraction*len(log))):
 color="#804040"> 87  log.append(random.choice(log))
 color="#804040"> 88  # dispersed emails
 color="#804040"> 89  random.shuffle(log)
 color="#804040"> 90  # log times for emails
 color="#804040"> 91  logtime = starttime
 color="#804040"> 92  for n, (send, to)  color="#804040">in enumerate(log):
 color="#804040"> 93  log[n] = (logtime, send, to)
 color="#804040"> 94  logtime += random.choice(timedeltafreq)
 color="#804040"> 95  return clusters, log
 color="#804040"> 96 
 color="#804040"> 97 def  color="#008080">logprint(log):
 color="#804040"> 98  'Print log in input format'
 color="#804040"> 99  for logtime, send, to  color="#804040">in log:
 color="#804040">100  print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
 color="#804040">101  , name2email(send), name2email(to)))
 color="#804040">102

The Program in use

Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.

style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log) style="font-family: monospace;">
Alfie.Khan@xx.com,
Lily.Rose@xx.com, Tyler.Blanchet@xx.com style="font-family: monospace;">
Finley.Ali@xx.com,
Harry.White@xx.com

Finley.Cox@xx.com,
James.Harrison@xx.com

Thu Dec 11 17:53:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
Lily.Rose@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
Finley.Ali@xx.com    Harry.White@xx.com style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
Alfie.Khan@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
Alfie.Khan@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
Harry.White@xx.com    Finley.Ali@xx.com style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
Lily.Rose@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
Tyler.Blanchet@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
James.Harrison@xx.com    Finley.Cox@xx.com style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
Tyler.Blanchet@xx.com    Alfie.Khan@xx.com style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
Lily.Rose@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
Finley.Cox@xx.com    James.Harrison@xx.com style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
Tyler.Blanchet@xx.com    Lily.Rose@xx.com style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
Alfie.Khan@xx.com    Tyler.Blanchet@xx.com style="font-family: monospace;">

30

To Do

The cluster info for the puzzle should, of course, have the
clusters of size two filtered out

I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
Martians forcibly wiped it from my brain" :-)

- Paddy.

3 comments:

Paddy3118Sun Aug 02, 06:39:00 pm
Of course lines 29 and 40 should be like lines 15 and 26.

- Paddy.
AnonymousMon Aug 03, 12:51:00 am
This is the maximal clique problem, and it's NP complete. Cliquer (http://users.tkk.fi/pat/cliquer.html) provides a good implementation in the general case.

Of course there are a few things in your favour here:

1) You can start by removing every link that doesn't have a reciprocal link (A->B exists, but B->A doesn't) and then transform the problem into an undirected graph problem.

This is important because it will greatly reduce the artificial connectivity created by mass mailings.

2) I'd expect the data to be largely tree-like. That is, the cycles in the graph will tend to be pretty short. Clusters of nodes (cliques or not) will tend to be disconnected from each other, or connected only by a small number of links.
Paddy3118Mon Aug 03, 06:36:00 am
Aha, emails without replies, I would need to sprinkle my log with a certain percentage of those too. Something like inserting between lines 87 and 88:

for i in range(int(noreplymsgfraction*len(log))):
....log.append( (nextname(),nextname()) )

Assuming I return nextname from function clustergen, and assign a suitable fraction to noreplymsgfraction.

Thanks.

Go deh!

Sunday, August 02, 2009

Facebook User-Clustering Puzzle

The Reverse Problem

The Program in use

To Do

3 comments:

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive