Go deh!: Data Mining: in three languages

I answered this
post with a reply written in AWK then wrote versions in Perl and
Python. You sshould be aware that I have writen more awk, perl and
python (in that order); but I think I know more awk, python then perl
(in that order).

The awk example:

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{
  nodata = 0;             # Curret run of consecutive flags<0 in lines of file
  nodata_max=-1;          # Max consecutive flags<0 in lines of file
  nodata_maxline="!";     # ... and line number(s) where it occurs
}
FNR==1 {
  # Accumulate input file names
  if(infiles){
    infiles = infiles "," infiles
  } else {
    infiles = FILENAME
  }
}
{
  tot_line=0;             # sum of line data
  num_line=0;             # number of line data items with flag>0

  # extract field info, skipping initial date field
  for(field=2; field<=NF; field+=2){
    datum=$field;
    flag=$(field+1);
    if(flag<1){
      nodata++
    }else{
      # check run of data-absent fields
      if(nodata_max==nodata && (nodata>0)){
        nodata_maxline=nodata_maxline ", " $1
      }
      if(nodata_max<nodata && (nodata>0)){
        nodata_max=nodata
        nodata_maxline=$1
      }
      # re-initialise run of nodata counter
      nodata=0;
      # gather values for averaging
      tot_line+=datum
      num_line++;
    }
  }

  # totals for the file so far
  tot_file += tot_line
  num_file += num_line

  printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
         $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

  # debug prints of original data plus some of the computed values
  #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
  #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline


}

END{
  printf "\n"
  printf "File(s)  = %s\n", infiles
  printf "Total    = %10.3f\n", tot_file
  printf "Readings = %6i\n", num_file
  printf "Average  = %10.3f\n", tot_file / num_file

  printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}

The same functionality in perl is very similar to the awk program:

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN {
  $nodata = 0;             # Curret run of consecutive flags<0 in lines of file
  $nodata_max=-1;          # Max consecutive flags<0 in lines of file
  $nodata_maxline="!";     # ... and line number(s) where it occurs
}
foreach (@ARGV) {
  # Accumulate input file names
  if($infiles ne ""){
    $infiles = "$infiles, $_";
  } else {
    $infiles = $_;
  }
}

while (<>){
  $tot_line=0;             # sum of line data
  $num_line=0;             # number of line data items with flag>0

  # extract field info, skipping initial date field
  chomp;
  @fields = split(/\s+/);
  $nf = @fields;
  $date = $fields[0];
  for($field=1; $field<$nf; $field+=2){
    $datum = $fields[$field] +0.0;
    $flag  = $fields[$field+1] +0;
    if(($flag+1<2)){
      $nodata++;
    }else{
      # check run of data-absent fields
      if($nodata_max==$nodata and ($nodata>0)){
        $nodata_maxline = "$nodata_maxline, $fields[0]";
      }
      if($nodata_max<$nodata and ($nodata>0)){
        $nodata_max = $nodata;
        $nodata_maxline=$fields[0];
      }
      # re-initialise run of nodata counter
      $nodata = 0;
      # gather values for averaging
      $tot_line += $datum;
      $num_line++;
    }
  }

  # totals for the file so far
  $tot_file += $tot_line;
  $num_file += $num_line;

  printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
         $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n";
printf "File(s)  = %s\n", $infiles;
printf "Total    = %10.3f\n", $tot_file;
printf "Readings = %6i\n", $num_file;
printf "Average  = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
       $nodata_max, $nodata_maxline;

The python program however splits the fields in the line slightly
differently (although it could use the method used in the perl and
awk programs too):

# Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput
import sys

nodata = 0;             # Curret run of consecutive flags<0 in lines of file
nodata_max=-1;          # Max consecutive flags<0 in lines of file
nodata_maxline=[];      # ... and line number(s) where it occurs

tot_file = 0            # Sum of file data
num_file = 0            # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():
  tot_line=0;             # sum of line data
  num_line=0;             # number of line data items with flag>0

  # extract field info
  field = line.split()
  date  = field[0]
  data  = [float(f) for f in field[1::2]]
  flags = [int(f)   for f in field[2::2]]

  for datum, flag in zip(data, flags):
    if flag<1:
      nodata += 1
    else:
      # check run of data-absent fields
      if nodata_max==nodata and nodata>0:
        nodata_maxline.append(date)
      if nodata_max<nodata and nodata>0:
        nodata_max=nodata
        nodata_maxline=[date]
      # re-initialise run of nodata counter
      nodata=0;
      # gather values for averaging
      tot_line += datum
      num_line += 1

  # totals for the file so far
  tot_file += tot_line
  num_file += num_line

  print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
        date,
        len(data) -num_line,
        num_line, tot_line,
        tot_line/num_line if (num_line>0) else 0)

print ""
print "File(s)  = %s" % (", ".join(infiles),)
print "Total    = %10.3f" % (tot_file,)
print "Readings = %6i" % (num_file,)
print "Average  = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
    nodata_max, ", ".join(nodata_maxline))

Timings:

$ time gawk -f readings.awk readingsx.txt readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readingsx.txt,readingsx.txt
Total    = 1361259.300
Readings = 129579
Average  =     10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real    0m1.069s
user    0m0.904s
sys     0m0.061s

$ time perl readings.pl readingsx.txt readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readingsx.txt, readings.txt
Total    = 1361259.300
Readings = 129579
Average  =     10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real    0m2.450s
user    0m1.639s
sys     0m0.015s

$ time /cygdrive/c/Python25/python readings.py readingsx.txt readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readingsx.txt, readings.txt
Total    = 1361259.300
Readings = 129579
Average  =     10.505

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05

real    0m1.138s
user    0m0.061s
sys     0m0.030s

$

The differences in the Python prog. are not done as an
optimisation. The nifty list indexing of [1::2] and the zip just flow
naturally, (to me), from the data format.

The data format consists of single
line records of this format:

<string:date> [
<float:data-n> <int:flag-n> ]*24

e.g.

1991-03-31      10.000  1       10.000  1       ... 20.000      1       35.000  1

Go deh!

Tuesday, January 02, 2007

Data Mining: in three languages

No comments:

Post a Comment

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive