Tuesday, January 02, 2007

Data Mining: in three languages

I answered this
post
with a reply written in AWK then wrote versions in Perl and
Python. You sshould be aware that I have writen more awk, perl and
python (in that order); but I think I know more awk, python then perl
(in that order).


The awk example:


# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{
nodata = 0; # Curret run of consecutive flags<0 in lines of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline="!"; # ... and line number(s) where it occurs
}
FNR==1 {
# Accumulate input file names
if(infiles){
infiles = infiles "," infiles
} else {
infiles = FILENAME
}
}
{
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0

# extract field info, skipping initial date field
for(field=2; field<=NF; field+=2){
datum=$field;
flag=$(field+1);
if(flag<1){
nodata++
}else{
# check run of data-absent fields
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
}
if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=$1
}
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line+=datum
num_line++;
}
}

# totals for the file so far
tot_file += tot_line
num_file += num_line

printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", \
$1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

# debug prints of original data plus some of the computed values
#printf "%s %15.3g %4i\n", $0, tot_line, num_line
#printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line, nodata, nodata_max, nodata_maxline


}

END{
printf "\n"
printf "File(s) = %s\n", infiles
printf "Total = %10.3f\n", tot_file
printf "Readings = %6i\n", num_file
printf "Average = %10.3f\n", tot_file / num_file

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}


The same functionality in perl is very similar to the awk program:


# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN {
$nodata = 0; # Curret run of consecutive flags<0 in lines of file
$nodata_max=-1; # Max consecutive flags<0 in lines of file
$nodata_maxline="!"; # ... and line number(s) where it occurs
}
foreach (@ARGV) {
# Accumulate input file names
if($infiles ne ""){
$infiles = "$infiles, $_";
} else {
$infiles = $_;
}
}

while (<>){
$tot_line=0; # sum of line data
$num_line=0; # number of line data items with flag>0

# extract field info, skipping initial date field
chomp;
@fields = split(/\s+/);
$nf = @fields;
$date = $fields[0];
for($field=1; $field<$nf; $field+=2){
$datum = $fields[$field] +0.0;
$flag = $fields[$field+1] +0;
if(($flag+1<2)){
$nodata++;
}else{
# check run of data-absent fields
if($nodata_max==$nodata and ($nodata>0)){
$nodata_maxline = "$nodata_maxline, $fields[0]";
}
if($nodata_max<$nodata and ($nodata>0)){
$nodata_max = $nodata;
$nodata_maxline=$fields[0];
}
# re-initialise run of nodata counter
$nodata = 0;
# gather values for averaging
$tot_line += $datum;
$num_line++;
}
}

# totals for the file so far
$tot_file += $tot_line;
$num_file += $num_line;

printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n",
$date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n";
printf "File(s) = %s\n", $infiles;
printf "Total = %10.3f\n", $tot_file;
printf "Readings = %6i\n", $num_file;
printf "Average = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
$nodata_max, $nodata_maxline;


The python program however splits the fields in the line slightly
differently (although it could use the method used in the perl and
awk programs too):


# Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput
import sys

nodata = 0; # Curret run of consecutive flags<0 in lines of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline=[]; # ... and line number(s) where it occurs

tot_file = 0 # Sum of file data
num_file = 0 # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0

# extract field info
field = line.split()
date = field[0]
data = [float(f) for f in field[1::2]]
flags = [int(f) for f in field[2::2]]

for datum, flag in zip(data, flags):
if flag<1:
nodata += 1
else:
# check run of data-absent fields
if nodata_max==nodata and nodata>0:
nodata_maxline.append(date)
if nodata_max<nodata and nodata>0:
nodata_max=nodata
nodata_maxline=[date]
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line += datum
num_line += 1

# totals for the file so far
tot_file += tot_line
num_file += num_line

print "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f" % (
date,
len(data) -num_line,
num_line, tot_line,
tot_line/num_line if (num_line>0) else 0)

print ""
print "File(s) = %s" % (", ".join(infiles),)
print "Total = %10.3f" % (tot_file,)
print "Readings = %6i" % (num_file,)
print "Average = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
nodata_max, ", ".join(nodata_maxline))







Timings:


$ time gawk -f readings.awk readingsx.txt readings.txt|tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057

File(s) = readingsx.txt,readingsx.txt
Total = 1361259.300
Readings = 129579
Average = 10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real 0m1.069s
user 0m0.904s
sys 0m0.061s

$ time perl readings.pl readingsx.txt readings.txt|tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057

File(s) = readingsx.txt, readings.txt
Total = 1361259.300
Readings = 129579
Average = 10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real 0m2.450s
user 0m1.639s
sys 0m0.015s

$ time /cygdrive/c/Python25/python readings.py readingsx.txt readings.txt|tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057

File(s) = readingsx.txt, readings.txt
Total = 1361259.300
Readings = 129579
Average = 10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real 0m1.138s
user 0m0.061s
sys 0m0.030s

$


The differences in the Python prog. are not done as an
optimisation. The nifty list indexing of [1::2] and the zip just flow
naturally, (to me), from the data format.





The data format consists of single
line records of this format:


<string:date> [
<float:data-n> <int:flag-n> ]*24


e.g.


1991-03-31      10.000  1       10.000  1       ... 20.000      1       35.000  1




No comments:

Post a Comment