I've been evolving a script and thought
I would share one aspect of the design that helped in development and
maintenance.
The Python script takes around five or
six machine generated files as input, manipulates, checks and derives
new data, then stuffs the new data into a database.
Parse
My input data files would all appear
static in path-name so I have a parser for each file and pass it a
file name but I have a default name that is correct. This leads to
functions looking like
def read_all_colours(fname='all_colours.lst'): '''{A description of what is held in the file, how it is organised, and what is to be extracted}''' # Parsing code with file(fname) as f: # Depending on the complexity of the task # parsing might include regexps (using kodos) # or not. … # Only elementary data checks on a single files # data are done here. … return all_colours def read_xyz_file(fname='xyz_file.lst): … # Additionally for dictionaries I adopt a naming # scheme of'2' so a mapping # of kites to colours would be called kite2colour. return xyz2abc
…
I try and preserve those names returned
from the read_* functions and call them from the top level. (Those
read_* functions may call other functions of course but for the sake
of this blog entry, I've simplified things).
My global calls might look like:
if __name__ == '__main__': all_colours = read_all_colours() xyz2abc = read_xyz_file() … all_kites, kite2colour = read_kite_info()
I can switch from vim and run this up in idle, checking and sampling
the global variables getting the parsers debugged and also starting
to write interactive code snippets for the next phase – cross
checks.
Consistency checking
def check_consistency(all_colours, xyz2abc, …, all_kites, kite2colour): 'Take all the parsed data apply consistency checks' # maybe all the colours in kite2colour must be a member of all_colours assert set(kite2colour.values()).issubset(set(all_colours)), \ 'Some kite has an unknown colour' # Further checks with associated assertions …
the individual consistency checks can be developed in the editor then
the whole script executed or just the check_consistency function
updates cut-n-pasted into idle or after the parsing is done, all the
data is available globally in idle so I also just mess with the data
interactively, being careful to not change the parsed data, and
develop checks their then paste them to the check_consistency
function in the editor.
When I'm done, I have added a global
call to function check_consistency(...) after all the parsing steps.
Munging
This is the part where I take the
parsed & checked input data and apply the business logic to
create the derived data in a form that makes it easy to output. It
may be split into several functions called from the global context or
just one. If there is intermediate data generated that I think may be
“key” then I like to return that to the global namespace too
Once the data munging function is
created and the program run in idle intermediate and final data can
be inspected and further tests done quite easily as most of what you
need is available when the program ends in the idle REPL environment.
Output
Straight-forward call to a function(s)
that does only simple data manipulations if necessary to dump the
derived data in the correct output format.
First draft program structure
## namedtuple definitions are great for program structure ## if you can keep to short field names … ## Parser functions … ## Checker functions … ## Business logic functions … ## Output functions def write_db(our_kites, our_kite_colours, fname='business_db.sql') … ## Global environment if __name__ == '__main__': ## Parse all_colours = read_all_colours() xyz2abc = read_xyz_file() … all_kites, kite2colour = read_kite_info() ## Check check_consistency(all_colours, xyz2abc, …, all_kites, kite2colour) ## Munge our_kites, our_kite_colours, best_colours = \ munge(all_colours, xyz2abc, …, all_kites, kite2colour) ## Output write_db(our_kites, our_kite_colours)
Testing
I then have the first script that I
can slot into my design flow – I have other programs and scripts
that generate its inputs and tools ready to use the database it
creates. I can then slot it into the design flow and use real world,
historical, pre-computed and data generated using alternate tools
that do the same thing to test it. Others can test it too and from
the feedback I have to debug and maintain it.
Debugging/Maintenance
I have examples of what the parsers are
set up to parse in comments in their read_* functions. This helps
when a bug is due to some extra complication in the structure of an
input file. Usually I have to first run the script in idle then
cut-n-paste the parser function guts onto the command line to debug
parser fails. I'm used to it. It works for me. From the nature of the
problem reported, you can run the script using the problem data and
then interrogate the global data on the idle command line to trace
errors to their cause.
If I've used namedtuples then that
makes getting back into the code much easier. Listing just one
element of a compound homogenous variable such as a list/set/dict of
items would then remind you of what data you have and its structure.
Because my source is under version
control, I insert print statements for debug purposes and stick '0/0'
on a line to stop the interpreter as well as set breakpoints (horses
for courses and all that).
Production
For that I add a better command-line
interface and documentation and ensure sufficient tests are under
version control too.
End bit.
This is written because I've just been
working on another way to do a task and wrote the new script in the
same way as I wrote the last and realized that I had converged on a
structure that allowed the previous, (and other), scripts to be
easily maintained.
I regularly get to “eat my own dog
food” i.e. maintain what I write over the years, and have to
explain my code. 'Parse inputs → Munge → Dump to output(s)' is a
recurring theme so I guess it is no surprise that I seem to have
converged on a development template. Maybe its time to time to apply
some heat to this simulated annealing :-)