My brother was over from France this new year and he asked me to write him a small script for the first time.
He had a problem in that he had some prose in a foreign language, and a list of words that the students should learn. He wanted stats on what words from the list were actually used in his prose.
I wrote the following small script. Interestingly enough, the bit that needed a bit of research was doing the graphical file opening for Windows - I really am that used to a Unix environment :-)
It is a very simple program but, on reflection, I can remember doing similar work twenty years ago in AWK: finding out what is different between two 'things' and printing the differences in a sorted order. Back then I would have used Suns original awk, keeping the two sets of data in associative arrays, doing the set arithmatic in explicit for loops, and calling out to the Unix sort utility to print sorted output. Such specialized diff'ing tasks have crept up regularly for me, and they are now very easy to do in Python (or Perl); and I have never had to resort to C for even the largest datasets I've encountered.
word_differ.pyw
Here is the program word_differ.pyw:
# -*- coding: cp1252 -*-
'''
Find the difference in words used in two documents
Asks for two files which then have their words extracted and compared.
Most punctuation, and all words of just numbers are dropped.
(C) 2009 Donald 'Paddy' McCarthy. paddy3118@gmail.com
'''
import Tkinter, tkFileDialog, re, datetime, sys
root = Tkinter.Tk()
root.withdraw()
def wordsinfile(f):
words = set()
txt = f.read()
words = set( w.lower() for w in re.split(r"""[ \t\n\-,;:\.\?@#~=+_!"£$%^&\*\(\)<>|\\\[\]{}`]""", txt)
if w and not all( c in '0123456789' for c in w) )
return words
f1 = tkFileDialog.askopenfile(parent=root,initialdir="/",title='First text file for word diffing')
if not f1: sys.exit()
f2 = tkFileDialog.askopenfile(parent=root,initialdir="/",title='Second text file for word diffing')
if not f2: sys.exit()
f3 = tkFileDialog.asksaveasfile(parent=root,initialdir="/",title='Save output as file (end with .txt to create a windows text file)')
if not f3: sys.exit()
set1 = wordsinfile(f1)
set2 = wordsinfile(f2)
print >>f3, "Output from program word_differ.py (Author Donald McCarthy: (C) 2009)."
print >>f3, " Generated on: " + datetime.datetime.now().isoformat()
print >>f3, " First File: " + f1.name
print >>f3, " Second File: " + f2.name
print >>f3, " Output to: " + f3.name
print >>f3, "\nWords in the first file that are not in the second:"
for w in sorted(set1 - set2):
print >>f3, " ", w
print >>f3, "\nWords in the second file that are not in the first:"
for w in sorted(set2 - set1):
print >>f3, " ", w
print >>f3, "\nWords common to both files:"
for w in sorted(set1 & set2):
print >>f3, " ", w
f1.close(); f2.close(); f3.close()
Example word list: word list.txt
back
black
jump
bored
hair
missile
lamb
little
played
sent
Example prose: rhyme.txt
Mary had a little lamb
Who's hair was long and black.
She played with it,
Got bored with it
And then she sent it back!
By Paddy McCarthy - 2009-01-04.
Example output: diff.txt
Output from program word_differ.py (Author Donald McCarthy: (C) 2009).
Generated on: 2009-01-04T07:20:18.890000
First File: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/word list.txt
Second File: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/rhyme.txt
Output to: C:/Documents and Settings/All Users/Documents/Paddys/word_differ/diff.txt
Words in the first file that are not in the second:
jump
missile
Words in the second file that are not in the first:
a
and
by
got
had
it
long
mary
mccarthy
paddy
she
then
was
who's
with
Words common to both files:
back
black
bored
hair
lamb
little
played
sent