Thursday, February 07, 2013

Parse inputs → Munge → Dump


I've been evolving a script and thought I would share one aspect of the design that helped in development and maintenance.

The Python script takes around five or six machine generated files as input, manipulates, checks and derives new data, then stuffs the new data into a database.

Parse

My input data files would all appear static in path-name so I have a parser for each file and pass it a file name but I have a default name that is correct. This leads to functions looking like

def read_all_colours(fname='all_colours.lst'):
    '''{A description of what is held in the file, 
    how it is organised, and what is to be extracted}'''

    # Parsing code
    with file(fname) as f:
        # Depending on the complexity of the task
        # parsing might include regexps (using kodos)
        # or not.
        …
        # Only elementary data checks on a single files 
        # data are done here.
        …
    return all_colours

def read_xyz_file(fname='xyz_file.lst):
    …
    # Additionally for dictionaries I adopt a naming 
    # scheme of '2' so a mapping
    # of kites to colours would be called kite2colour.

    return xyz2abc

I try and preserve those names returned from the read_* functions and call them from the top level. (Those read_* functions may call other functions of course but for the sake of this blog entry, I've simplified things).

My global calls might look like:

if __name__ == '__main__':
    all_colours = read_all_colours()
    xyz2abc = read_xyz_file()
    …
    all_kites, kite2colour = read_kite_info()
I can switch from vim and run this up in idle, checking and sampling the global variables getting the parsers debugged and also starting to write interactive code snippets for the next phase – cross checks.

Consistency checking


def check_consistency(all_colours, xyz2abc, …, all_kites, kite2colour):
    'Take all the parsed data apply consistency checks'
    # maybe all the colours in kite2colour must be a member of all_colours
    assert set(kite2colour.values()).issubset(set(all_colours)), \
        'Some kite has an unknown colour'
    # Further checks with associated assertions
    …
the individual consistency checks can be developed in the editor then the whole script executed or just the check_consistency function updates cut-n-pasted into idle or after the parsing is done, all the data is available globally in idle so I also just mess with the data interactively, being careful to not change the parsed data, and develop checks their then paste them to the check_consistency function in the editor.

When I'm done, I have added a global call to function check_consistency(...) after all the parsing steps.

Munging

This is the part where I take the parsed & checked input data and apply the business logic to create the derived data in a form that makes it easy to output. It may be split into several functions called from the global context or just one. If there is intermediate data generated that I think may be “key” then I like to return that to the global namespace too

Once the data munging function is created and the program run in idle intermediate and final data can be inspected and further tests done quite easily as most of what you need is available when the program ends in the idle REPL environment.

Output

Straight-forward call to a function(s) that does only simple data manipulations if necessary to dump the derived data in the correct output format.

First draft program structure


## namedtuple definitions are great for program structure
## if you can keep to short field names
…

## Parser functions
…

## Checker functions
…

## Business logic functions
…

## Output functions
def write_db(our_kites, our_kite_colours, fname='business_db.sql')
…

## Global environment

if __name__ == '__main__':
    ## Parse
    all_colours = read_all_colours()
    xyz2abc = read_xyz_file()
    …
    all_kites, kite2colour = read_kite_info()
    ## Check
    check_consistency(all_colours, xyz2abc, …, all_kites, kite2colour)
    ## Munge
    our_kites, our_kite_colours, best_colours = \
        munge(all_colours, xyz2abc, …, all_kites, kite2colour)
    ## Output
    write_db(our_kites, our_kite_colours)

Testing

I then have the first script that I can slot into my design flow – I have other programs and scripts that generate its inputs and tools ready to use the database it creates. I can then slot it into the design flow and use real world, historical, pre-computed and data generated using alternate tools that do the same thing to test it. Others can test it too and from the feedback I have to debug and maintain it.

Debugging/Maintenance

I have examples of what the parsers are set up to parse in comments in their read_* functions. This helps when a bug is due to some extra complication in the structure of an input file. Usually I have to first run the script in idle then cut-n-paste the parser function guts onto the command line to debug parser fails. I'm used to it. It works for me. From the nature of the problem reported, you can run the script using the problem data and then interrogate the global data on the idle command line to trace errors to their cause.

If I've used namedtuples then that makes getting back into the code much easier. Listing just one element of a compound homogenous variable such as a list/set/dict of items would then remind you of what data you have and its structure.

Because my source is under version control, I insert print statements for debug purposes and stick '0/0' on a line to stop the interpreter as well as set breakpoints (horses for courses and all that).

Production

For that I add a better command-line interface and documentation and ensure sufficient tests are under version control too.

End bit.

This is written because I've just been working on another way to do a task and wrote the new script in the same way as I wrote the last and realized that I had converged on a structure that allowed the previous, (and other), scripts to be easily maintained.

I regularly get to “eat my own dog food” i.e. maintain what I write over the years, and have to explain my code. 'Parse inputs → Munge → Dump to output(s)' is a recurring theme so I guess it is no surprise that I seem to have converged on a development template. Maybe its time to time to apply some heat to this simulated annealing :-)






2 comments:

  1. That 0/0 is much better written as `__import__("pdb").set_trace()`. Put it into some abbreviation to autoexpand, and be cool. :-)

    If you really want a short one, just use l (a lowercase letter L). It is almost certainly a NameError. You shouldn't use one-letter variables in complex systems, and surely not letter l. :-D

    ReplyDelete