Mainly Tech projects on Python and Electronic Design Automation.

Saturday, January 20, 2007

Python example: from 1997 to now.

I stumbled on an old site: My Programming Language Crisis of ~1997.
There is a data-mining example, complete with sample input data and 'golden output' called invert:
Stdin consists of lines of two tab-separated fields. Call the first field A and the second field B.
The program must gather like values of B together, collecting the A's that go with them.
The output of the program must be lines consisting of an arbitrary number of tab-separated fields; the first field of each line is a unique B; all subsequent fields of that line are A's that were associated with that B in the input. There will be as many output lines as there were unique B's in the input. The output lines must be sorted on the first field (B's) and all subsequent fields of each line (each B's As) must be sorted.

...For example, suppose you wanted to gather together all the unique URLs in a set of web pages in order to validate them efficiently. Your report needs to be able to associate dead URLs with the files they were originally found in. Thus the A's in the input are filenames and the B's are URLs.
The Python program of its day brought back memories and I of course had to re-write it using some of the later additions to the language.

Things I used includes:
  • The use of the fileinput module to do looping over lines of input.
  • split as a string method.
  • The default split action of splitting on white space.
    (A little dicey but input fields don't include spaces).
  • The setdefault dict method for providing an empty list when necessary, so the try...except block can go.
  • The new sorted function that returns a sorted value.
  • list unpacking of kv into k,v in the output for loop.
  • The join method to insert tab separators into the ordered list of output fields.
Overall I'd say that Python has become more clear over the years.

The prog:

# invert benchmark in Python
# see http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/
# This version by Donald 'Paddy' McCarthy

import fileinput

B = {}
for line in fileinput.input():
fields = line.split()
B.setdefault(fields[1], []).append(fields[0])

# In-memory data sort
# in-place sorting of values in key-value pairs
kv = sorted(B.items())
for k,v in kv:
v.sort()
print "\t".join([k]+v)


'''

## Here follows the original (1997?), program

#! /depot/python/arch/bin/python
# invert benchmark in Python
# see <url:http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/
# Original by Keith Waclena
# Optimized by Tom Carroll

from sys import stdin
from string import split, join

B = {}

while 1:
line = stdin.readline()
if not line: break
fields = split(line[0:-1], "\t")
try: #assume this key is already present
B[fields[1]].append(fields[0])
except: #it's not!? well then, we best put it in...
B[fields[1]] = [fields[0],]

keys = B.keys()
keys.sort()

values = B[key]
values.sort()
print key + "\t" + join(values, "\t")
'''

6 comments:

  1. I think tsort(1) does almost the same, but sorts by the first column and the fields are colon separated.

    This is probably the shell pipeline (untested):
    <file awk 'print $2: $1' | tsort

    ReplyDelete
  2. Unfortunately Brian tsort does not work. I used the following awk command that at least gave no awk errors:
    awk '{print $2 $1}' INPUT.txt |tsort >out3.txt

    tsort gave many warnings aboutthe input containing loops and did not give the correct output.

    I don't use tsort myself, and could not work out why it might have been thought to work from looking at the info page but that could be just me.

    The point of the post however wasn't to find other ways of creating the right output - I was interested in how the language evolved over around a decade.
    maybe someone could post an 'idiomatic' solution in Perl 5.8?

    Well I'll be B&*%!r'd!!
    I just now took a look at the TCL solution from all those years ago and it looks just like how I just updated the Python solution.
    This Pythonista has been TiCkLed ;-)
    - Paddy.

    ReplyDelete
  3. The last stanza can be shorter and clearer:

    for k,v in sorted(B.items()):
    print "\t".join([k]+sorted(v))

    ReplyDelete
  4. With a current python I would have written it like this:

    import fileinput
    from collections import defaultdict
    B=defaultdict(list)
    for fields in (line.split() for line in fileinput.input()):
    ...B[fields[1]].append(fields[0])
    for k,v in sorted(B.items()):
    ...print "\t".join([k]+sorted(v))

    But I am not willing to compact much more, and some people would consider this already to be too much: Sparse is better than dense and all that...

    ReplyDelete
  5. I agree about the use of defaultdict instead of setdefault - the idea was to show language changes over time and that would fit right in.

    - Paddy.

    ReplyDelete
  6. It's interesting that nobody used unpacking where it really would help, in "fields".

    for line in fileinput.input():
    key, value = line.split()
    B[key].append(value)

    ReplyDelete

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive