Tuesday, April 21, 2009

Monitoring a linux process as it reads files

I am processing 20 thousand files with some proprietary software and
needed to monitor how far it got in reading the files. In my own Python
version of the utility the reading of data was ten times faster than
the subsequent processing and i wanted to find out if this proprietary
solution, which was havinfg performance problems , was equally spending
most of its time reading data.



The proprietary program took a list of 20000 files to process as its
first argument and I remembered that  on Linux, the /proc
directory had info on running processes  and sure enough , the
/proc/<process id>/fd directory had info on all the file
descriptors currently open by the process as links. So by opening the
list of files to in my editor and searching within it for the file name
shown on  one of the file descriptors, I could gauge how many
files been read so far.



I decided to automate the checking and wrote a shell script using
cat/fgrep/gawk/... that then told me what line in the list of files to
process the program was currently at.



Now I've had time to refine things to use mainly python but to
demonstrate its use I also have to generate a test environment



First create some test files to process


style="font-family: monospace;">bash$ style="font-weight: bold;">mkdir -p /tmp/test style="font-family: monospace;">
bash$  style="font-weight: bold;">for ((i=0; i < 100; i++))
do touch /tmp/test/file$i ;done
style="font-family: monospace;">
bash$ style="font-weight: bold;">/bin/ls /tmp/test/file* >
/tmp/test/all_files.lst
style="font-family: monospace;">
bash$ style="font-weight: bold;">head /tmp/test/all_files.lst style="font-family: monospace;">
/tmp/test/file0 style="font-family: monospace;">
/tmp/test/file1 style="font-family: monospace;">
/tmp/test/file10 style="font-family: monospace;">
/tmp/test/file11 style="font-family: monospace;">
/tmp/test/file12 style="font-family: monospace;">
/tmp/test/file13 style="font-family: monospace;">
/tmp/test/file14 style="font-family: monospace;">
/tmp/test/file15 style="font-family: monospace;">
/tmp/test/file16 style="font-family: monospace;">
/tmp/test/file17 style="font-family: monospace;">
bash$            



Now lets create a test executable to monitor


This script just holds each file open for reading for twenty seconds
before closing the file.

style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time style="font-family: monospace; color: rgb(0, 0, 153);">
for
name in file(sys.argv[1]):
style="font-family: monospace; color: rgb(0, 0, 153);">
 
f = file(name.strip())
style="font-family: monospace; color: rgb(0, 0, 153);">
 
time.sleep(45)
style="font-family: monospace; color: rgb(0, 0, 153);">
 
f.close()


' style="font-weight: bold;">/tmp/test/all_files.lst 
&
style="font-family: monospace;">
[2]  style="font-weight: bold;"> style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">
bash$               



here is what the fd directory looks like


style="font-family: monospace;">bash$ ls -l /proc/ style="font-family: monospace; color: rgb(0, 0, 153);"> style="font-weight: bold; color: red;">7984 style="font-family: monospace;">/fd style="font-family: monospace;">
total 0 style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 0 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 1 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 2 -> /dev/tty1
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 3 -> /tmp/test/all_files.lst
style="font-family: monospace;">
lrwxrwxrwx 1 HP
DV8025EA None 0 Apr 21 22:17 4 -> /tmp/test/file0
style="font-family: monospace;">
bash$                                                                        



And here is a python script to monitor that fd directories
link number 4 periodically


style="font-family: monospace;">bash$ style="font-weight: bold;">python -c ' style="color: rgb(0, 0, 153);">import sys,time,os,datetime style="color: rgb(0, 0, 153);">
name2index =
dict((name.strip(), index) for index,name in
enumerate(file(sys.argv[1])))
style="color: rgb(0, 0, 153);">
all = len(name2index) style="color: rgb(0, 0, 153);">
while True: style="color: rgb(0, 0, 153);">
  path =
os.path.realpath("/proc/7984/fd/ style="font-weight: bold; color: red;">4
").strip() style="color: rgb(0, 0, 153);">
  print
name2index[path],"/",all, path, datetime.datetime.now().isoformat()
style="color: rgb(0, 0, 153);">
 
time.sleep(30)


' /tmp/test/all_files.lst

22 / 100 /tmp/test/file29 2009-04-21T22:34:07.817750

23 / 100 /tmp/test/file3 2009-04-21T22:34:37.820750

24 / 100 /tmp/test/file30 2009-04-21T22:35:07.825750

24 / 100 /tmp/test/file30 2009-04-21T22:35:37.834750

                             



I watched the monitor output over the next couple of hours and found
out when file reading ended and processing of read data started.



END.

1 comment:

  1. strace -e open -p pid

    will tell you the same kind of information.

    ReplyDelete