Saturday, January 20, 2007

Python example: from 1997 to now.

I stumbled on an old site: My Programming Language Crisis of ~1997.
There is a data-mining example, complete with sample input data and 'golden output' called invert:
Stdin consists of lines of two tab-separated fields. Call the first field A and the second field B.
The program must gather like values of B together, collecting the A's that go with them.
The output of the program must be lines consisting of an arbitrary number of tab-separated fields; the first field of each line is a unique B; all subsequent fields of that line are A's that were associated with that B in the input. There will be as many output lines as there were unique B's in the input. The output lines must be sorted on the first field (B's) and all subsequent fields of each line (each B's As) must be sorted.

...For example, suppose you wanted to gather together all the unique URLs in a set of web pages in order to validate them efficiently. Your report needs to be able to associate dead URLs with the files they were originally found in. Thus the A's in the input are filenames and the B's are URLs.
The Python program of its day brought back memories and I of course had to re-write it using some of the later additions to the language.

Things I used includes:
  • The use of the fileinput module to do looping over lines of input.
  • split as a string method.
  • The default split action of splitting on white space.
    (A little dicey but input fields don't include spaces).
  • The setdefault dict method for providing an empty list when necessary, so the try...except block can go.
  • The new sorted function that returns a sorted value.
  • list unpacking of kv into k,v in the output for loop.
  • The join method to insert tab separators into the ordered list of output fields.
Overall I'd say that Python has become more clear over the years.

The prog:

# invert benchmark in Python
# see
# This version by Donald 'Paddy' McCarthy

import fileinput

B = {}
for line in fileinput.input():
fields = line.split()
B.setdefault(fields[1], []).append(fields[0])

# In-memory data sort
# in-place sorting of values in key-value pairs
kv = sorted(B.items())
for k,v in kv:
print "\t".join([k]+v)


## Here follows the original (1997?), program

#! /depot/python/arch/bin/python
# invert benchmark in Python
# see <url:
# Original by Keith Waclena
# Optimized by Tom Carroll

from sys import stdin
from string import split, join

B = {}

while 1:
line = stdin.readline()
if not line: break
fields = split(line[0:-1], "\t")
try: #assume this key is already present
except: #it's not!? well then, we best put it in...
B[fields[1]] = [fields[0],]

keys = B.keys()

values = B[key]
print key + "\t" + join(values, "\t")

Tuesday, January 02, 2007

Data Mining: in three languages

I answered this
with a reply written in AWK then wrote versions in Perl and
Python. You sshould be aware that I have writen more awk, perl and
python (in that order); but I think I know more awk, python then perl
(in that order).

The awk example:

# Author Donald 'Paddy' McCarthy Jan 01 2007

nodata = 0; # Curret run of consecutive flags<0 in lines of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline="!"; # ... and line number(s) where it occurs
FNR==1 {
# Accumulate input file names
infiles = infiles "," infiles
} else {
infiles = FILENAME
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0

# extract field info, skipping initial date field
for(field=2; field<=NF; field+=2){
# check run of data-absent fields
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
if(nodata_max<nodata && (nodata>0)){
# re-initialise run of nodata counter
# gather values for averaging

# totals for the file so far
tot_file += tot_line
num_file += num_line

printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", \
$1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

# debug prints of original data plus some of the computed values
#printf "%s %15.3g %4i\n", $0, tot_line, num_line
#printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line, nodata, nodata_max, nodata_maxline


printf "\n"
printf "File(s) = %s\n", infiles
printf "Total = %10.3f\n", tot_file
printf "Readings = %6i\n", num_file
printf "Average = %10.3f\n", tot_file / num_file

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline

The same functionality in perl is very similar to the awk program:

# Author Donald 'Paddy' McCarthy Jan 01 2007

$nodata = 0; # Curret run of consecutive flags<0 in lines of file
$nodata_max=-1; # Max consecutive flags<0 in lines of file
$nodata_maxline="!"; # ... and line number(s) where it occurs
foreach (@ARGV) {
# Accumulate input file names
if($infiles ne ""){
$infiles = "$infiles, $_";
} else {
$infiles = $_;

while (<>){
$tot_line=0; # sum of line data
$num_line=0; # number of line data items with flag>0

# extract field info, skipping initial date field
@fields = split(/\s+/);
$nf = @fields;
$date = $fields[0];
for($field=1; $field<$nf; $field+=2){
$datum = $fields[$field] +0.0;
$flag = $fields[$field+1] +0;
# check run of data-absent fields
if($nodata_max==$nodata and ($nodata>0)){
$nodata_maxline = "$nodata_maxline, $fields[0]";
if($nodata_max<$nodata and ($nodata>0)){
$nodata_max = $nodata;
# re-initialise run of nodata counter
$nodata = 0;
# gather values for averaging
$tot_line += $datum;

# totals for the file so far
$tot_file += $tot_line;
$num_file += $num_line;

printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n",
$date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;


printf "\n";
printf "File(s) = %s\n", $infiles;
printf "Total = %10.3f\n", $tot_file;
printf "Readings = %6i\n", $num_file;
printf "Average = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
$nodata_max, $nodata_maxline;

The python program however splits the fields in the line slightly
differently (although it could use the method used in the perl and
awk programs too):

# Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput
import sys

nodata = 0; # Curret run of consecutive flags<0 in lines of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline=[]; # ... and line number(s) where it occurs

tot_file = 0 # Sum of file data
num_file = 0 # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0

# extract field info
field = line.split()
date = field[0]
data = [float(f) for f in field[1::2]]
flags = [int(f) for f in field[2::2]]

for datum, flag in zip(data, flags):
if flag<1:
nodata += 1
# check run of data-absent fields
if nodata_max==nodata and nodata>0:
if nodata_max<nodata and nodata>0:
# re-initialise run of nodata counter
# gather values for averaging
tot_line += datum
num_line += 1

# totals for the file so far
tot_file += tot_line
num_file += num_line

print "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f" % (
len(data) -num_line,
num_line, tot_line,
tot_line/num_line if (num_line>0) else 0)

print ""
print "File(s) = %s" % (", ".join(infiles),)
print "Total = %10.3f" % (tot_file,)
print "Readings = %6i" % (num_file,)
print "Average = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
nodata_max, ", ".join(nodata_maxline))


$ time gawk -f readings.awk readingsx.txt readings.txt|tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057

File(s) = readingsx.txt,readingsx.txt
Total = 1361259.300
Readings = 129579
Average = 10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real 0m1.069s
user 0m0.904s
sys 0m0.061s

$ time perl readingsx.txt readings.txt|tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057

File(s) = readingsx.txt, readings.txt
Total = 1361259.300
Readings = 129579
Average = 10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real 0m2.450s
user 0m1.639s
sys 0m0.015s

$ time /cygdrive/c/Python25/python readingsx.txt readings.txt|tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057

File(s) = readingsx.txt, readings.txt
Total = 1361259.300
Readings = 129579
Average = 10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real 0m1.138s
user 0m0.061s
sys 0m0.030s


The differences in the Python prog. are not done as an
optimisation. The nifty list indexing of [1::2] and the zip just flow
naturally, (to me), from the data format.

The data format consists of single
line records of this format:

<string:date> [
<float:data-n> <int:flag-n> ]*24


1991-03-31      10.000  1       10.000  1       ... 20.000      1       35.000  1