at facebook. The task is to generate clusters of users based on who
they send emails to, as extracted from their large log file.
They give a strict format for the log file entries of:
style="font-family: monospace;"><date> '\t'
<send email address> '\t' <to email address>
<send email address> '\t' <to email address>
Some lines from the log might look like:
Thu Dec 11 18:17:01 PST 2008 Finley.Cox@xx.com James.Harrison@xx.com
Thu Dec 11 18:21:01 PST 2008 Alfie.Khan@xx.com Tyler.Blanchet@xx.com
Thu Dec 11 18:23:01 PST 2008 Harry.White@xx.com Finley.Ali@xx.com
They then state that a cluster exists if everyone in the group has sent
at least one email to everyone else in the group, so a cluster of users
A B and C, would have had to have sent the following emails at some
time:
from: A to B
from: A to C
from: B to A
from: B to C
from: C to A
from: C to B
from: A to C
from: B to A
from: B to C
from: C to A
from: C to B
You should extract the clusters from the log and report only maximal
clusters, i.e. in the example above, don't report A,C as a cluster if
you are also reporting A,B,C.
Their is also a format for how clusters are to be reported:
- Only clusters of three or more people are to be reported
- Reports are to be printed to stdout.
- A cluster to a line.
- Emails in the cluster to appear in sorted order; style="font-style: italic;">separated by the
two characters of a single comma followed by a single space. - The lines of clusters to also be sorted on the characters
appearing in the each cluster line.
An output for some log file might look like:
Aaron.Lewis@xx.com, Callum.Bennett@xx.com, Jayden.Mills@xx.com, Lucas.Lewis@xx.com, Matthew.Kennedy@xx.com
Alfie.Clarke@xx.com, Amy.Russell@xx.com, Connor.King@xx.com, Ethan.Thomas@xx.com, Joshua.Mcdonald@xx.com
The log files are said to be large.
The above is the original problem, re-phrased.
The Reverse Problem
Early on in thinking about how to solve the problem I thought I would
need much more than the ten lines or so of sample log lines
that they gave. I then thought that style="font-style: italic;">generating a log
file would be a good task too and set out to solve that
instead!
I need to:
- Generate pseudo-random email addresses
- Generate clusters of users
- Generate a log from the clusters
And in that order too. When I have emails, I can create
clusters of emails of different sizes, then expand the clusters into a
log file structure, then print the log in the correct format.
The code follows, this description.
- Line 9 is a debug remnant.
- Although have only run the code in Python 3, I think it
should work in Python 2.6 as well.
Lines 10-13 are because intern is moved to the sys module in Python 3. - Lines 15-40: I needed thousands of names so trawled WP for href="http://en.wikipedia.org/wiki/List_of_most_popular_given_names">first
names and href="http://en.wikipedia.org/wiki/List_of_most_common_surnames">family
names to create their product. - Lines 27 and 41: I intern'd the strings because I could -
It's a possible premature style="text-decoration: line-through;">ejacu
optimization, but hey. - Lines 43-45: I will be building the log without times but
in order, then adding times to the ordered log entries. to do that i
need the time to start, and some possible time increments between log
entries that I will randomly choose between. - Line 48; The size of cluster I generate is a random choice
between the sizes mentioned here.. For example, a cluster size of 2, ( style="font-family: monospace;">[2]*20), will
predominate. - Line 51 controls how many style="font-style: italic;">extra times an
email might be sent between two people. - Line 53: namegen
just pairs every first name with every family name. If you want more,
you could use two first names for example, as well as extending the
name lists. (or hyphenated family names). Anyone for Imogen Ethan
Islam-Singh :-) - Line 60: name2email
changes the name from namegen into an email address string - Lines 64-69: clustergen
generates clustercount clusters of names. It randomly chooses how many
names to put in each cluster using clustersizefreq;, then generates
that many names per cluster. nextname feeds new names are required, and
the sorts ensure the clusters are generated in the correct order for
printing to satisfy the original puzzles output criteria. - Lines 71-74: clusterprint will correctly format and print
the clustergen data structure. - Lines 76-95: logggen.
The largest function - Line 81 creates a number of clusters
- Line 83 takes the cluster of users and turns each cluster
in turn into the expanded set of correspondances between every member
of the cluster, in order. - Line 86 randomly duplicates some entries of the log using
repeatmsgfraction. - Line 89 adds some authenticity by randomising the order
of the emails - Lines 91-94 add the time information to the log.
- Line 95 returns the cluster info as well as the log.
- Lines 97-101 function logprint can correctly format and
print the log data. Note the handling of the PST timezone in line 100.
1
color="#804040"> 2 '''
color="#804040"> 3 Random Data generator for:
color="#804040"> 4 href="http://www.facebook.com/careers/puzzles.php?puzzle_id=8">http://www.facebook.com/careers/puzzles.php?puzzle_id=8
color="#804040"> 5 '''
color="#804040"> 6
color="#804040"> 7
color="#804040"> 8 import itertools, random, datetime
color="#804040"> 9 from pprint color="#a020f0">import pprint color="#a020f0">as pp
color="#804040"> 10 try: color="#0000ff"># Python 3/2 compatibility
color="#804040"> 11 from sys color="#a020f0">import intern
color="#804040"> 12 except:
color="#804040"> 13 pass
color="#804040"> 14
color="#804040"> 15 firstname = ''' color="#ff00ff">
color="#804040"> 16 Lola Imogen Jasmine Leah Daniel Isla Millie Lilly Benjamin Evie
color="#804040"> 17 Finlay Ella Jacob Grace Jayden Freya Summer James Adam Henry
color="#804040"> 18 Poppy Joshua Charlotte Thomas Noah Maisie Tyler Liam George Mason
color="#804040"> 19 Matthew Joseph Nathan Stan Scarlett Jessica Samuel Leo William Olivia
color="#804040"> 20 Ruby Isabella Emma Isabelle Charlie Phoebe Alfie Owen Erin Oscar
color="#804040"> 21 Cameron Katie Harvey Aaron Ethan Lily Finley Riley Jamie Megan
color="#804040"> 22 Caitlin Amy Connor Alexander Ben Eva Bethany Brooke Layla Harry
color="#804040"> 23 Chloe Harrison Lauren Ava Oliver Mia Abigail Ellie Holly Jake
color="#804040"> 24 Hannah Luke Molly Jack Amelia Dylan Alex Lucas Lucy Ryan
color="#804040"> 25 Daisy Sophie Sophia Archie Callum Lewis Isabel Logan Max Emily
color="#804040"> 26 '''.split()
color="#804040"> 27 firstname = [intern(n) color="#804040">for n color="#804040">in firstname]
color="#804040"> 28
color="#804040"> 29 familyname = list(set( ''' color="#ff00ff">
color="#804040"> 30 Wilson Russell Smith Phillips Dixon White Gill Brown Statham Oneill
color="#804040"> 31 Harris Doyle James Ahmad Roberts Mann Moore Thomas Saunders Barker
color="#804040"> 32 Bright Allen George Hill Jackson Rose Malley Mills Sheppard Mason
color="#804040"> 33 Wright Holt Pearce Bull Richards Ward Lopez Mathey King Stone
color="#804040"> 34 Johnson Baker Dupont Green Davis Blanchet Taylor Dean Donnelly Evans
color="#804040"> 35 Woods Patel Meacher Simon Clarke Young Ellis Palmer Lynch Alexander
color="#804040"> 36 Sutton Ahmed Adams Kennedy Davies Walker Austin Fernandez Mustafa Rodriguez
color="#804040"> 37 Bennett Murray Powell Harrison Campbell Cole Khan Mcdonald Edwards Wood
color="#804040"> 38 Thompson Singh Williams Power Price Watson Hall Jones Cox Chapman
color="#804040"> 39 Arnold Reid Garcia Morgan Mitchell Ali Lewis Stubbs Scott Islam
color="#804040"> 40 '''.split() ))
color="#804040"> 41 familyname = [intern(n) color="#804040">for n color="#804040">in familyname]
color="#804040"> 42
color="#804040"> 43 starttime = datetime.datetime(2008, 12, 11, 17, 53, 1, 0)
color="#804040"> 44 # Choice of delta times between log entries
color="#804040"> 45 timedeltafreq = [ datetime.timedelta(0, secs) color="#804040">for secs color="#804040">in [60]*3 + [120]*10 + [240]*10 ]
color="#804040"> 46
color="#804040"> 47 # Choice of cluster sizes
color="#804040"> 48 clustersizefreq = [2]*20 + [3]*8 + [4]*3 + [5]*2 + [6]*1 +[7]*1 + [8]*1
color="#804040"> 49
color="#804040"> 50 # Many messages between the same two people?
color="#804040"> 51 repeatmsgfraction = 2.0 color="#0000ff"># Extra 200%
color="#804040"> 52
color="#804040"> 53 def color="#008080">namegen():
color="#804040"> 54 'Generates (first, family) name tuples'
color="#804040"> 55 names = list(itertools.product(firstname, familyname))
color="#804040"> 56 random.shuffle(names)
color="#804040"> 57 for name color="#804040">in names:
color="#804040"> 58 yield name
color="#804040"> 59
color="#804040"> 60 def color="#008080">name2email(name):
color="#804040"> 61 'format (first, family) name as email address'
color="#804040"> 62 return ' color="#ff00ff">%s.%s@xx.com' % name
color="#804040"> 63
color="#804040"> 64 def color="#008080">clustergen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 65 firstname=firstname, familyname=familyname):
color="#804040"> 66 'Generate clustercount clusters of unique users'
color="#804040"> 67 nextname = namegen().__next__
color="#804040"> 68 return sorted( sorted(nextname() color="#804040">for i color="#804040">in range(random.choice(clustersizefreq)))
color="#804040"> 69 for j color="#804040">in range(clustercount) )
color="#804040"> 70
color="#804040"> 71 def color="#008080">clusterprint(clusters):
color="#804040"> 72 'Print already sorted clusters in output format'
color="#804040"> 73 for cluster color="#804040">in clusters:
color="#804040"> 74 print ( ' color="#ff00ff">, '.join(name2email(n) color="#804040">for n color="#804040">in cluster) )
color="#804040"> 75
color="#804040"> 76 def color="#008080">loggen(clustercount, clustersizefreq=clustersizefreq,
color="#804040"> 77 firstname=firstname, familyname=familyname,
color="#804040"> 78 starttime=starttime, timedeltafreq=timedeltafreq,
color="#804040"> 79 repeatmsgfraction=repeatmsgfraction):
color="#804040"> 80 'Generate a log of clustercount clusters of unique users'
color="#804040"> 81 clusters = clustergen(clustercount)
color="#804040"> 82 # clustered emails
color="#804040"> 83 log = sum([ list(itertools.permutations(cluster, 2)) color="#804040">for cluster color="#804040">in clusters ],
color="#804040"> 84 [])
color="#804040"> 85 # repeats
color="#804040"> 86 for i color="#804040">in range(int(repeatmsgfraction*len(log))):
color="#804040"> 87 log.append(random.choice(log))
color="#804040"> 88 # dispersed emails
color="#804040"> 89 random.shuffle(log)
color="#804040"> 90 # log times for emails
color="#804040"> 91 logtime = starttime
color="#804040"> 92 for n, (send, to) color="#804040">in enumerate(log):
color="#804040"> 93 log[n] = (logtime, send, to)
color="#804040"> 94 logtime += random.choice(timedeltafreq)
color="#804040"> 95 return clusters, log
color="#804040"> 96
color="#804040"> 97 def color="#008080">logprint(log):
color="#804040"> 98 'Print log in input format'
color="#804040"> 99 for logtime, send, to color="#804040">in log:
color="#804040">100 print (' color="#ff00ff">%s\t color="#ff00ff">%s\t color="#ff00ff">%s' % (logtime.strftime(" color="#ff00ff">%a %b %d %H:%M:%S PST %Y")
color="#804040">101 , name2email(send), name2email(to)))
color="#804040">102
The Program in use
Switching to the command
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.
line shell, I generate a log and its clustering and print the results.
Because of the way the log is generated from the clustering, the
clustering of the log should be as stated.
style="font-family: monospace;">>>> style="font-weight: bold;">clusters, log = loggen(3);
clusterprint(clusters); print(); logprint(log); print(); len(log)
style="font-family: monospace;">
Alfie.Khan@xx.com,
Lily.Rose@xx.com, Tyler.Blanchet@xx.com
style="font-family: monospace;">
Finley.Ali@xx.com,
Harry.White@xx.com
Finley.Cox@xx.com,
James.Harrison@xx.com
Thu Dec 11 17:53:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
Lily.Rose@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
Finley.Ali@xx.com Harry.White@xx.com
style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
Alfie.Khan@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
Alfie.Khan@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
Lily.Rose@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
Tyler.Blanchet@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
Lily.Rose@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
Tyler.Blanchet@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
30
clusterprint(clusters); print(); logprint(log); print(); len(log)
style="font-family: monospace;">
Alfie.Khan@xx.com,
Lily.Rose@xx.com, Tyler.Blanchet@xx.com
style="font-family: monospace;">
Finley.Ali@xx.com,
Harry.White@xx.com
Finley.Cox@xx.com,
James.Harrison@xx.com
Thu Dec 11 17:53:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 17:55:01
PST 2008
Lily.Rose@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 17:59:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:03:01
PST 2008
Finley.Ali@xx.com Harry.White@xx.com
style="font-family: monospace;">
Thu Dec 11 18:07:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:09:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:13:01
PST 2008
Alfie.Khan@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 18:17:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:21:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:23:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:27:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:29:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:33:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:37:01
PST 2008
Alfie.Khan@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 18:39:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:43:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 18:45:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 18:47:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:49:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 18:53:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 18:57:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 18:59:01
PST 2008
Harry.White@xx.com Finley.Ali@xx.com
style="font-family: monospace;">
Thu Dec 11 19:03:01
PST 2008
Lily.Rose@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 19:04:01
PST 2008
Tyler.Blanchet@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 19:06:01
PST 2008
James.Harrison@xx.com Finley.Cox@xx.com
style="font-family: monospace;">
Thu Dec 11 19:10:01
PST 2008
Tyler.Blanchet@xx.com Alfie.Khan@xx.com
style="font-family: monospace;">
Thu Dec 11 19:12:01
PST 2008
Lily.Rose@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
Thu Dec 11 19:16:01
PST 2008
Finley.Cox@xx.com James.Harrison@xx.com
style="font-family: monospace;">
Thu Dec 11 19:20:01
PST 2008
Tyler.Blanchet@xx.com Lily.Rose@xx.com
style="font-family: monospace;">
Thu Dec 11 19:22:01
PST 2008
Alfie.Khan@xx.com Tyler.Blanchet@xx.com
style="font-family: monospace;">
30
To Do
- The cluster info for the puzzle should, of course, have the
clusters of size two filtered out - I haave a solution to the original puzzle, but: " style="font-style: italic;" title="Insert schoolboy excuse here">Blue
Martians forcibly wiped it from my brain" :-)
- Paddy.
Of course lines 29 and 40 should be like lines 15 and 26.
ReplyDelete- Paddy.
This is the maximal clique problem, and it's NP complete. Cliquer (http://users.tkk.fi/pat/cliquer.html) provides a good implementation in the general case.
ReplyDeleteOf course there are a few things in your favour here:
1) You can start by removing every link that doesn't have a reciprocal link (A->B exists, but B->A doesn't) and then transform the problem into an undirected graph problem.
This is important because it will greatly reduce the artificial connectivity created by mass mailings.
2) I'd expect the data to be largely tree-like. That is, the cycles in the graph will tend to be pretty short. Clusters of nodes (cliques or not) will tend to be disconnected from each other, or connected only by a small number of links.
Aha, emails without replies, I would need to sprinkle my log with a certain percentage of those too. Something like inserting between lines 87 and 88:
ReplyDeletefor i in range(int(noreplymsgfraction*len(log))):
....log.append( (nextname(),nextname()) )
Assuming I return nextname from function clustergen, and assign a suitable fraction to noreplymsgfraction.
Thanks.