Tuesday, January 06, 2009

word differ

My brother was over from France this new year and he asked me to write him a small script for the first time.



He had a problem in that he had some prose in a foreign language, and a list of words that the students should learn. He wanted stats on what words from the list were actually used in his prose.



I wrote the following small script. Interestingly enough, the bit that needed a bit of research was doing the graphical file opening for Windows - I really am that used to a Unix environment :-)



It is a very simple program but, on reflection, I can remember doing similar work twenty years ago in AWK: finding out what is different between two 'things' and printing the differences in a sorted order. Back then I would have used Suns original awk, keeping the two sets of data in associative arrays, doing the set arithmatic in explicit for loops, and calling out to the Unix sort utility to print sorted output. Such specialized diff'ing tasks have crept up regularly for me, and they are now very easy to do in Python (or Perl); and I have never had to resort to C for even the largest datasets I've encountered.



word_differ.pyw


Here is the program word_differ.pyw:


# -*- coding: cp1252 -*-

'''
Find the difference in words used in two documents

Asks for two files which then have their words extracted and compared.

Most punctuation, and all words of just numbers are dropped.

(C) 2009 Donald 'Paddy' McCarthy. paddy3118@gmail.com
'''

import Tkinter, tkFileDialog, re, datetime, sys
root = Tkinter.Tk()
root.withdraw()


def wordsinfile(f):
words = set()
txt = f.read()
words = set( w.lower() for w in re.split(r"""[ \t\n\-,;:\.\?@#~=+_!"£$%^&\*\(\)<>|\\\[\]{}`]""", txt)
if w and not all( c in '0123456789' for c in w) )
return words


f1 = tkFileDialog.askopenfile(parent=root,initialdir="/",title='First text file for word diffing')
if not f1: sys.exit()
f2 = tkFileDialog.askopenfile(parent=root,initialdir="/",title='Second text file for word diffing')
if not f2: sys.exit()
f3 = tkFileDialog.asksaveasfile(parent=root,initialdir="/",title='Save output as file (end with .txt to create a windows text file)')
if not f3: sys.exit()

set1 = wordsinfile(f1)
set2 = wordsinfile(f2)

print >>f3, "Output from program word_differ.py (Author Donald McCarthy: (C) 2009)."
print >>f3, " Generated on: " + datetime.datetime.now().isoformat()

print >>f3, " First File: " + f1.name
print >>f3, " Second File: " + f2.name
print >>f3, " Output to: " + f3.name

print >>f3, "\nWords in the first file that are not in the second:"
for w in sorted(set1 - set2):
print >>f3, " ", w
print >>f3, "\nWords in the second file that are not in the first:"
for w in sorted(set2 - set1):
print >>f3, " ", w
print >>f3, "\nWords common to both files:"
for w in sorted(set1 & set2):
print >>f3, " ", w

f1.close(); f2.close(); f3.close()


Example word list: word list.txt


  back
black
jump
bored
hair
missile
lamb
little
played
sent

Example prose: rhyme.txt


Mary had a little lamb
Who's hair was long and black.
She played with it,
Got bored with it
And then she sent it back!

By Paddy McCarthy - 2009-01-04.

Example output: diff.txt


Output from program word_differ.py (Author Donald McCarthy: (C) 2009).
Generated on: 2009-01-04T07:20:18.890000
First File: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/word list.txt
Second File: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/rhyme.txt
Output to: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/diff.txt

Words in the first file that are not in the second:
jump
missile

Words in the second file that are not in the first:
a
and
by
got
had
it
long
mary
mccarthy
paddy
she
then
was
who's
with

Words common to both files:
back
black
bored
hair
lamb
little
played
sent

1 comment:

  1. FWIW, if you want to avoid the GUI work in the future you could try making you're Python code into a utility at Utility Mill. It sets up a web interface for you so you just put in the Python code that actually does the work. About 5 lines your case :-)

    (Disclaimer: I'm the utility mill developer.)

    ReplyDelete