Mainly Tech projects on Python and Electronic Design Automation.

Sunday, August 02, 2009

Facebook User-Clustering Puzzle

I came across the following href="">puzzle
at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.

They give a strict format for the log file entries of:

style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>

Some lines from the log might look like:

Thu Dec 11 18:17:01 PST 2008
Thu Dec 11 18:21:01 PST 2008
Thu Dec 11 18:23:01 PST 2008

They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some

from: A to B

from: A to C

from: B to A

from: B to C

from: C to A

from: C to B

You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.

Their is also a format for how clusters are to be reported:

  • Only clusters of three or more people are to be reported

  • Reports are to be printed to stdout.

  • A cluster to a line.

  • Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
    two characters of a single comma followed by a single space.

  • The lines of clusters to also be sorted on the characters
    appearing in the each cluster line.

An output for some log file might look like:,,,,,,,,

The log files are said to be large.

The above is the original problem, re-phrased.

The Reverse Problem

Early on in thinking about how to solve the problem I thought I would
need  much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good  task too and set out to solve that

I need to:

  1. Generate pseudo-random email addresses

  2. Generate clusters of users

  3. Generate a log from the clusters

 And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.

The code follows, this description.

  • Line 9 is a debug remnant.

  • Although have only run the code in Python 3, I think it
    should work in Python 2.6 as well.

    Lines 10-13 are because intern is moved to the sys module in Python 3.

  • Lines 15-40: I needed thousands of names so trawled WP for href="">first
    names and href="">family
    names to create their product.

  • Lines 27 and 41: I intern'd the strings because I could -
    It's a possible premature style="text-decoration: line-through;">ejacu
    optimization, but hey.

  • Lines 43-45: I will be building the log without times but
    in order, then adding times to the ordered log entries. to do that i
    need the time to start, and some possible time increments between log
    entries that I will randomly choose between.

  • Line 48; The size of cluster I generate is a random choice
    between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will

  • Line 51 controls how many style="font-style: italic;">extra times an
    email might be sent between two people.

  • Line 53: namegen
    just pairs every first name with every family name. If you want more,
    you could use two first names for example, as well as extending the
    name lists. (or hyphenated family names). Anyone for Imogen Ethan
    Islam-Singh :-)

  • Line 60: name2email
    changes the name from namegen into an email address string

  • Lines 64-69: clustergen
    generates clustercount clusters of names. It randomly chooses how many
    names to put in each cluster using clustersizefreq;, then generates
    that many names per cluster. nextname feeds new names are required, and
    the sorts ensure the clusters are generated in the correct order for
    printing to satisfy the original puzzles output criteria.

  • Lines 71-74: clusterprint will correctly format and print
    the clustergen data structure.

  • Lines 76-95: logggen.
    The largest function 

    • Line 81 creates a number of clusters

    • Line 83 takes the cluster of users and turns each cluster
      in turn into the expanded set of correspondances between every member
      of the cluster, in order.

    • Line 86 randomly duplicates some entries of the log using

    • Line 89 adds some authenticity by randomising the order
      of the emails

    • Lines 91-94 add the time information to the log.

    • Line 95 returns the cluster info as well as the log.

  • Lines 97-101 function logprint can correctly format and
    print the log data. Note the handling of the PST timezone in line 100.

color="#804040"> 2 '''
color="#804040"> 3 Random Data generator for:
color="#804040"> 4 href="">
color="#804040"> 5 '''
color="#804040"> 6
color="#804040"> 7
color="#804040"> 8 import itertools, random, datetime
color="#804040"> 9 from pprint color="#a020f0">import pprint color="#a020f0">as pp
color="#804040"> 10 try: color="#0000ff"># Python 3/2 compatibility
color="#804040"> 11 from sys color="#a020f0">import intern
color="#804040"> 12 except:
color="#804040"> 13 pass
color="#804040"> 14
color="#804040"> 15 firstname = ''' color="#ff00ff">
color="#804040"> 16 Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
color="#804040"> 17 Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
color="#804040"> 18 Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
color="#804040"> 19 Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
color="#804040"> 20 Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
color="#804040"> 21 Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
color="#804040"> 22 Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
color="#804040"> 23 Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
color="#804040"> 24 Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
color="#804040"> 25 Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
color="#804040"> 26 '''.split()
color="#804040"> 27 firstname = [intern(n) color="#804040">for n color="#804040">in firstname]
color="#804040"> 28
color="#804040"> 29 familyname = list(set( ''' color="#ff00ff">
color="#804040"> 30 Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
color="#804040"> 31 Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
color="#804040"> 32 Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
color="#804040"> 33 Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
color="#804040"> 34 Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
color="#804040"> 35 Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
color="#804040"> 36 Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
color="#804040"> 37 Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
color="#804040"> 38 Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
color="#804040"> 39 Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
color="#804040"> 40 '''.split() ))
color="#804040"> 41 familyname = [intern(n) color="#804040">for n color="#804040">in familyname]
color="#804040"> 42
color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
color="#804040"> 44 # Choice of delta times between log entries
color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs) color="#804040">for secs color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
color="#804040"> 46
color="#804040"> 47 # Choice of cluster sizes
color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
color="#804040"> 49
color="#804040"> 50 # Many messages between the same two people?
color="#804040"> 51 repeatmsgfraction = 2.0 color="#0000ff"># Extra 200%
color="#804040"> 52
color="#804040"> 53 def color="#008080">namegen():
color="#804040"> 54 'Generates (first, family) name tuples'
color="#804040"> 55 names = list(itertools.product(firstname, familyname))
color="#804040"> 56 random.shuffle(names)
color="#804040"> 57 for name color="#804040">in names:
color="#804040"> 58 yield name
color="#804040"> 59
color="#804040"> 60 def color="#008080">name2email(name):
color="#804040"> 61 'format (first, family) name as email address'
color="#804040"> 62 return ' color="#ff00ff">' % name
color="#804040"> 63
color="#804040"> 64 def color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 65 firstname=firstname, familyname=familyname):
color="#804040"> 66 'Generate clustercount clusters of unique users'
color="#804040"> 67 nextname = namegen().__next__
color="#804040"> 68 return sorted( sorted(nextname() color="#804040">for i color="#804040">in range(random.choice(clustersizefreq)))
color="#804040"> 69 for j color="#804040">in range(clustercount) )
color="#804040"> 70
color="#804040"> 71 def color="#008080">clusterprint(clusters):
color="#804040"> 72 'Print already sorted clusters in output format'
color="#804040"> 73 for cluster color="#804040">in clusters:
color="#804040"> 74 print ( ' color="#ff00ff">, '.join(name2email(n) color="#804040">for n color="#804040">in cluster) )
color="#804040"> 75
color="#804040"> 76 def color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 77 firstname=firstname, familyname=familyname,
color="#804040"> 78 starttime=starttime, timedeltafreq=timedeltafreq,
color="#804040"> 79 repeatmsgfraction=repeatmsgfraction):
color="#804040"> 80 'Generate a log of clustercount clusters of unique users'
color="#804040"> 81 clusters = clustergen(clustercount)
color="#804040"> 82 # clustered emails
color="#804040"> 83 log = sum([ list(itertools.permutations(cluster, 2)) color="#804040">for cluster color="#804040">in clusters ],
color="#804040"> 84 [])
color="#804040"> 85 # repeats
color="#804040"> 86 for i color="#804040">in range(int(repeatmsgfraction*len(log))):
color="#804040"> 87 log.append(random.choice(log))
color="#804040"> 88 # dispersed emails
color="#804040"> 89 random.shuffle(log)
color="#804040"> 90 # log times for emails
color="#804040"> 91 logtime = starttime
color="#804040"> 92 for n, (send, to) color="#804040">in enumerate(log):
color="#804040"> 93 log[n] = (logtime, send, to)
color="#804040"> 94 logtime += random.choice(timedeltafreq)
color="#804040"> 95 return clusters, log
color="#804040"> 96
color="#804040"> 97 def color="#008080">logprint(log):
color="#804040"> 98 'Print log in input format'
color="#804040"> 99 for logtime, send, to color="#804040">in log:
color="#804040">100 print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
color="#804040">101 , name2email(send), name2email(to)))

The Program in use

Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.

style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log) style="font-family: monospace;">,,
style="font-family: monospace;">,,

Thu Dec 11 17:53:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
style="font-family: monospace;">


To Do

  • The cluster info for the puzzle should, of course, have the
    clusters of size two filtered out

  • I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
    Martians forcibly wiped it from my brain" :-)

- Paddy.


  1. Of course lines 29 and 40 should be like lines 15 and 26.

    - Paddy.

  2. This is the maximal clique problem, and it's NP complete. Cliquer ( provides a good implementation in the general case.

    Of course there are a few things in your favour here:

    1) You can start by removing every link that doesn't have a reciprocal link (A->B exists, but B->A doesn't) and then transform the problem into an undirected graph problem.

    This is important because it will greatly reduce the artificial connectivity created by mass mailings.

    2) I'd expect the data to be largely tree-like. That is, the cycles in the graph will tend to be pretty short. Clusters of nodes (cliques or not) will tend to be disconnected from each other, or connected only by a small number of links.

  3. Aha, emails without replies, I would need to sprinkle my log with a certain percentage of those too. Something like inserting between lines 87 and 88:

    for i in range(int(noreplymsgfraction*len(log))):
    ....log.append( (nextname(),nextname()) )

    Assuming I return nextname from function clustergen, and assign a suitable fraction to noreplymsgfraction.




Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

Blog Archive