Thursday, October 11, 2007

Wide Finder on the command line


Having followed articles on Wide Finder for the past week, I thought the command line pipe version here, gave fast results straight away but could be improved.

Unfortunately, my main computer at home, my 17" laptop had to go for repairs and so I decided to start using an old Toshiba Portege 7200 that runs XP but has a tiny disk.
The temporary laptop doesn't have the resources to run Cygwin as its unix-on-windows toolset which I am comfortable with so I installed GNU tools for windows fromhere and bash for windows from here then set about re-running and improving the command-line example.



   ##
   ## clv1.sh
   ##

   # Original 'command line tools version from 
   http://www.dalkescientific.com/writings/diary/archive/2007/10/10/other_wide_finder_implementations.html
   time (
       grep -E 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap \
       | gawk '{print $7}' \
       | sort \
       | uniq -c \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv2.sh
   ##

   # include grep functionality in gawk call
   time (
       gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{print $7}' o1000k.ap \
       | sort \
       | uniq -c \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv3.sh
   ##

   # uniq -c function moved into gawk
   time (
       gawk --re-interval '/GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{x[$7]++}END{for (n in x){print x[n], n}}' o1000k.ap \
       | sort -n \
       | tail -10\
       )

   ##
   ## clv4.awk
   ##

   /GET \/ongoing\/When\/[0-9]{3}x\/[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/[^ .]+ /{
       field2count[$7]++
       }
   END{
       # asort requires no duplicate array values
       # Accumulate fields with same count 
       for (field in field2count){
           count = field2count[field]
           count2fields[count+0] = count2fields[count] count " " field "\n";
       }
       # invert and force value type to be numeric
       for (count in count2fields){
       allfields2count[count2fields[count]] = count+0
       }
       n = asort(allfields2count,orderedcounts)
       for (i=n-10; i<=n; i++){
           printf "%s", count2fields[orderedcounts[i]]
       }
   }


   ##
   ## clv4.sh
   ##

   # All but final tail moved into gawk
   time (
       gawk --re-interval -f clv4.awk o1000k.ap \
       | tail \
       )

   ##
   ## clv5.awk
   ##

   /GET \/ongoing\/When\/[0-9][0-9][0-9]x\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]\/[0-9][0-9]\/[^ .]+ /{
       field2count[$7]++
       }
   END{
       # asort requires no duplicate array values
       # Accumulate fields with same count 
       for (field in field2count){
           count = field2count[field]
           count2fields[count+0] = count2fields[count] count " " field "\n";
       }
       # invert and force value type to be numeric
       for (count in count2fields){
       allfields2count[count2fields[count]] = count+0
       }
       n = asort(allfields2count,orderedcounts)
       for (i=n-10; i<=n; i++){
           printf "%s", count2fields[orderedcounts[i]]
       }
   }


   ##
   ## clv5.sh
   ##

   # All but final tail moved into gawk
   # Change regexp to not use interval expressions
   time (
       gawk -f clv5.awk o1000k.ap \
       | tail \
       )


Here are the timings for each version (best of three runs):


 clv1.sh 43.3s
 clv2.sh 42.1s
 clv3.sh 32.5s
 clv4.sh 32.7s
 clv5.sh 32.5s

Conclusion


The biggest gain came when I used gawk to do the frequency counting in clv3.sh which is the kind of thing I would naturally do and saves a third of the time which would make it the fastest in the table here with a proportionally scaled run time of 0.62seconds

STOP Press!


I decided to compare run times with Frederic Lundh's work. On my initial run I got times above 32s for everything, and with wf-6.py taking over 90s. 
The memory mapping seemed to be sticky as a subsequent run gave the following timings: 

 wf-1-gala.py 30.4s
 wf-2.py 8.2s
 wf-3.py 9.2s
 wf-4.py 9.5s
 wf-5.py 10.0s
 wf-6.py 6.9s

This is much faster. I decided to test the persistance of the memory mapping of the file by re-running the command-line tests which gave improved results of: 

 clv1.sh 39.5s
 clv2.sh 43.0s
 clv3.sh 17.0s
 clv4.sh 14.8s
 clv5.sh 14.8s


It seems that memory mapping a file decreases run time but on my slow laptop, if the file is to be used once then the time to make the file memory resident may kill any time savings.

If this task was the only thing I had to do in a cron job on the hour, every hour then, despite knowing Python and Perl I would probably go for an awk solution. I would have probably stopped at clv3.sh.

6 comments:

  1. Cool! Until I got comfortable with perl, I wrote a lot of awk and gawk code.

    I didn't have gawk on my machine so I installed the latest version, 3.1.5. My wall-clock time for clv5.awk was 3.36 seconds (2.234u 0.948s 0:03.36 94.3%). That's a lot faster than normal awk, which died after 30 seconds saying it knew nothing about asort. But it's slower than most of the versions I tested.

    ReplyDelete
  2. Thanks Andrew for the feedback!
    I'm still comfortable with awk, (I learn t it well before I knew Perl or Python).

    ReplyDelete
  3. Thanks for showing that awk still rocks.

    How about a parallel version now, like here?

    ReplyDelete
  4. Perhaps you could still speed up by only keeping a top 10 of most frequent requests against which you sort (and forgetting about the rest).

    Something like this (sorry about the formatting):

    match($0, /GET [^ .]+ /) {
    counts[substr($0, RSTART, RLENGTH)]++
    }
    END {
    for (req in counts) {
    x = counts[req]
    y = req
    for (i=1; i<=n; i++)
    if (top[i,0] < x) {
    xt = top[i,0]
    yt = top[i,1]
    top[i,0] = x
    top[i,1] = y
    x = xt
    y = yt
    }
    if (n<=10) {
    n++
    top[n,0] = x
    top[n,1] = y
    }
    }
    OFS=": "
    for (i=1; i<=10; i++)
    print top[i,0], top[i,1]
    }

    ReplyDelete
  5. It seems to me that using head instead of tail and giving the second sort a "-r" should speed up the things a little bit.

    In addition, it would be very interesting to see how much time the different programs take (e.g., I suspect the second sort to be more costly than the first, but how does the first sort relate to the grep?)

    ReplyDelete
  6. Please anonymous peoples (anonimi )? Could you add some sort of distinguishing name/number so I can refer to you individually?

    For a parallel version, I hope to get some time one evening to try this at work on their dual & quad core machines.

    On the top 10 suggestion, I can't fully follow your example - I hate bloggers comment box formatting too - but I tried to use the built-in sort to find thhe top 10 rather than explicit AWK loops as the undferlying C code should be faster.


    I like the idea of a using 'sort -r| head' instead of 'sort|tail' I should time it :-)

    - Paddy.

    ReplyDelete