Mainly Tech projects on Python and Electronic Design Automation.

Sunday, February 22, 2009

Vanity Search on Rosetta Code

A vanity search is usually when you look for your own name on Google to
show how popular you are (in those terms, no personal slight intended).



I wanted to know how many new pages on Rosetta Code that I had started
so wrote the following script that starts from a users first page of href="http://www.rosettacode.org/wiki/Special:Contributions/Paddy3118">contributions,
and downloads all further pages searching for new page creations, which
are specially marked in the HTML and show as style="font-weight: bold;">N in the table.



The code:



'''
Rosetta Code Vanity search:
color="#ff00ff"> How many new pages has someone created?
'''

color="#a020f0">import urllib, re

user = ' color="#ff00ff">Paddy3118'

site = ' color="#ff00ff">http://www.rosettacode.org'
nextpage = site + ' color="#ff00ff">/wiki/Special:Contributions/' + user
nextpage_re = re.compile(
r' color="#ff00ff"><a href="([^"]+)" title="[^"]+" rel="next">older ')

newpages = []
pagecount = 0
color="#804040">while nextpage:
page = urllib.urlopen(nextpage)
pagecount +=1
nextpage = ''
color="#804040">for line color="#804040">in page:
color="#804040">if color="#804040">not nextpage:
color="#0000ff"># Search for URL to next page of results for download
nextpage_match = re.search(nextpage_re, line)
color="#804040">if nextpage_match:
nextpage = (site + nextpage_match.groups()[0]).replace(' color="#ff00ff">&amp;', ' color="#ff00ff">&')
color="#0000ff">#print nextpage
npline=line
color="#804040">if ' color="#ff00ff"><span class="newpage">' color="#804040">in line:
color="#0000ff"># extract N page name from title
newpages.append(line.partition(' color="#ff00ff"> title="')[2].partition(' color="#ff00ff">"')[0])
page.close()

nontalk = [p color="#804040">for p color="#804040">in newpages color="#804040">if color="#804040">not p.startswith(' color="#ff00ff">Talk:')]

color="#804040">print " color="#ff00ff">User: %s has created %i new pages of which %i were not Talk: pages, from approx %i edits" % (
user, len(newpages), len(nontalk), pagecount*50 )
color="#804040">print " color="#ff00ff">New pages created, in order, are: color="#6a5acd">\n ",
color="#804040">print " color="#6a5acd">\n ".join(nontalk[::-1])






What I have created on RC


The output of the program shows all the pages I created , in order of
creation:


User: Paddy3118 has created 31 new pages of which 20 were not Talk: pages, from approx 300 edits
New pages created, in order, are:
href="http://paddy3118.blogspot.com/2008/08/spiral.html">Spiral
href="http://paddy3118.blogspot.com/2008/08/monty-hall-problem-simulations.html">Monty Hall simulation
Web Scraping
Sequence of Non-squares
Anagrams
User talk:Lupus
href="http://paddy3118.blogspot.com/2008/10/max-licenses-in-use.html">Max Licenses In Use
One dimensional cellular automata
Conway's Game of Life
Village Pump:Home/Foldable output
Data Munging
Data Munging 2
Column Aligner
Probabilistic Choice
href="http://paddy3118.blogspot.com/2008/12/knapsack-problem.html">Knapsack Problem
Yuletide Holiday
Common number base conversions
Octal
Integer literals
Command Line Interpreter




I have added links to show when I blogged about a task as well as
starting the RC page. I can see that I have a 'User talk:' page that
should also be filtered out.



I was always writing small examples that I thought might be useful
examples for a Python training course. I was looking for a public home
for them and initially thought that, after stumbling across RC, that RC
would be a good home for them. I am only partially right, but I have
found the discipline of writing for RC to be enjoyable in itself, so
continue to contribute.



I quite enjoyed the challenge of creating an RC task beginning with the
letters K and then Y, so they could complete their full alphabet of
 named tasks. Yuletide Holiday was created around xmas 2008
and is really about Y2k errors - but they seem to have stuck with my
name :-)

I need to re-visit Data Munging and add extra clarification to the task
as RC needs a good task description, wheras a lot of data munging tasks
don't. 



If you are interested in language comparison sites then you might want
to take a look at RC too!



- Paddy.


No comments:

Post a Comment

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive