Thursday, February 07, 2013

Parse inputs → Munge → Dump


I've been evolving a script and thought I would share one aspect of the design that helped in development and maintenance.

The Python script takes around five or six machine generated files as input, manipulates, checks and derives new data, then stuffs the new data into a database.

Parse

My input data files would all appear static in path-name so I have a parser for each file and pass it a file name but I have a default name that is correct. This leads to functions looking like

def read_all_colours(fname='all_colours.lst'):
    '''{A description of what is held in the file, 
    how it is organised, and what is to be extracted}'''

    # Parsing code
    with file(fname) as f:
        # Depending on the complexity of the task
        # parsing might include regexps (using kodos)
        # or not.
        …
        # Only elementary data checks on a single files 
        # data are done here.
        …
    return all_colours

def read_xyz_file(fname='xyz_file.lst):
    …
    # Additionally for dictionaries I adopt a naming 
    # scheme of '2' so a mapping
    # of kites to colours would be called kite2colour.

    return xyz2abc

I try and preserve those names returned from the read_* functions and call them from the top level. (Those read_* functions may call other functions of course but for the sake of this blog entry, I've simplified things).

My global calls might look like:

if __name__ == '__main__':
    all_colours = read_all_colours()
    xyz2abc = read_xyz_file()
    …
    all_kites, kite2colour = read_kite_info()
I can switch from vim and run this up in idle, checking and sampling the global variables getting the parsers debugged and also starting to write interactive code snippets for the next phase – cross checks.

Consistency checking


def check_consistency(all_colours, xyz2abc, …, all_kites, kite2colour):
    'Take all the parsed data apply consistency checks'
    # maybe all the colours in kite2colour must be a member of all_colours
    assert set(kite2colour.values()).issubset(set(all_colours)), \
        'Some kite has an unknown colour'
    # Further checks with associated assertions
    …
the individual consistency checks can be developed in the editor then the whole script executed or just the check_consistency function updates cut-n-pasted into idle or after the parsing is done, all the data is available globally in idle so I also just mess with the data interactively, being careful to not change the parsed data, and develop checks their then paste them to the check_consistency function in the editor.

When I'm done, I have added a global call to function check_consistency(...) after all the parsing steps.

Munging

This is the part where I take the parsed & checked input data and apply the business logic to create the derived data in a form that makes it easy to output. It may be split into several functions called from the global context or just one. If there is intermediate data generated that I think may be “key” then I like to return that to the global namespace too

Once the data munging function is created and the program run in idle intermediate and final data can be inspected and further tests done quite easily as most of what you need is available when the program ends in the idle REPL environment.

Output

Straight-forward call to a function(s) that does only simple data manipulations if necessary to dump the derived data in the correct output format.

First draft program structure


## namedtuple definitions are great for program structure
## if you can keep to short field names
…

## Parser functions
…

## Checker functions
…

## Business logic functions
…

## Output functions
def write_db(our_kites, our_kite_colours, fname='business_db.sql')
…

## Global environment

if __name__ == '__main__':
    ## Parse
    all_colours = read_all_colours()
    xyz2abc = read_xyz_file()
    …
    all_kites, kite2colour = read_kite_info()
    ## Check
    check_consistency(all_colours, xyz2abc, …, all_kites, kite2colour)
    ## Munge
    our_kites, our_kite_colours, best_colours = \
        munge(all_colours, xyz2abc, …, all_kites, kite2colour)
    ## Output
    write_db(our_kites, our_kite_colours)

Testing

I then have the first script that I can slot into my design flow – I have other programs and scripts that generate its inputs and tools ready to use the database it creates. I can then slot it into the design flow and use real world, historical, pre-computed and data generated using alternate tools that do the same thing to test it. Others can test it too and from the feedback I have to debug and maintain it.

Debugging/Maintenance

I have examples of what the parsers are set up to parse in comments in their read_* functions. This helps when a bug is due to some extra complication in the structure of an input file. Usually I have to first run the script in idle then cut-n-paste the parser function guts onto the command line to debug parser fails. I'm used to it. It works for me. From the nature of the problem reported, you can run the script using the problem data and then interrogate the global data on the idle command line to trace errors to their cause.

If I've used namedtuples then that makes getting back into the code much easier. Listing just one element of a compound homogenous variable such as a list/set/dict of items would then remind you of what data you have and its structure.

Because my source is under version control, I insert print statements for debug purposes and stick '0/0' on a line to stop the interpreter as well as set breakpoints (horses for courses and all that).

Production

For that I add a better command-line interface and documentation and ensure sufficient tests are under version control too.

End bit.

This is written because I've just been working on another way to do a task and wrote the new script in the same way as I wrote the last and realized that I had converged on a structure that allowed the previous, (and other), scripts to be easily maintained.

I regularly get to “eat my own dog food” i.e. maintain what I write over the years, and have to explain my code. 'Parse inputs → Munge → Dump to output(s)' is a recurring theme so I guess it is no surprise that I seem to have converged on a development template. Maybe its time to time to apply some heat to this simulated annealing :-)






Tuesday, February 05, 2013

NAARP - Gangsta style. (No Python whatsoever)

...Or what happens if you put this blog post through the Gizoogle on-line gangsta translater:


No Attempt At Rigorous Proof


I had flu n' so read/skimmed a shitload mo' than usual dis week n' came across two conference vizzlez dat basically holla'd dat there is straight-up lil engineerin n' science up in tha books, papers, n' conference proceedingz of tha software industry up there.
Many hyped and/or well-read authors sheezy lil and no scientific and engineerin rigour up in tha formulation of they conclusions. They is basically sayin "Do dis 'cos I holla'd you would git mo' betta thangs up in dis biatch".

So nuff B-ta-tha-L-O-Gizzay posts is exactly tha same fo' realz. After readin Be neat, n' tha rest will follow I was moved ta make just dat point up in tha comments. It aint nuthin but not dat I don't smoke and disagree wit what tha fuck they is saying, itz just dat I came ta realize what tha fuck was missin from tha B-ta-tha-L-O-Gizzay post.

Now mah crazy ass freestylin long commentz of dis kind of muthafuckin thang ta blogs:




Hmm. Wherez yo' evidence n' scientific method biatch? What you say may well be true but you have no science ta back it up. Yo ass might as well be harpin on on some gangbangin' flat earth!

  • Do you have a impartial n' reproducible tool ta determine neatnizz fo' example?
  • Do others perception of neatnizz coincizzle wit yours?
  • Any emperical statistics ta back up yo' statement?
  • Where is tha graph of neatnizz vs "cost" ta back up what tha fuck yo ass is saying?

Yo ass might be muthafuckin right but you have not shown scientifically dat yo ass is right. Yo ass could be pushin snake oil!


... Is not snappy n' aint likely ta grow so What I be thinkin is needed be a anagram.

Naarp!

If you be thinkin dat a post be all bout tryin ta push you tha dopest way ta do somethang when programmin yo, but tha post makes no attempt ta sheezy you reproducible evidence supportin they claim then just slap a

    NAARP!

In tha comments, where NAARP standz for:


No
Attempt
At
Rigorous
Proof

If spoken, Naarp is pronounced as if spoken by a hood idiot/ghetto bumpkin. I aint talkin' bout chicken n' gravy biatch. I was thankin especially of tha big-ass playa up in Hot fuzz whoz ass says Yaarp fo' yes, n' of Semen Pegg thankin up Naarp when impersonatin his ass ta mean no.

Hopefully it can sheezy dat gangstas can recognise tha lack of scientific method up in tha industry n' maybe nudge authors ta be aware dat they crew is now aware of dis too.

- Peace.

Monday, February 04, 2013

No Attempt At Rigorous Proof

I had flu and so read/skimmed a lot more than usual this week and came across two conference videos that basically said that there is very little engineering and science in the books, papers, and conference proceedings of the software industry out there.
Many famous and/or well-read authors show little or no scientific or engineering rigour in the formulation of their conclusions. They are basically saying "Do this 'cos I said you would get better results".

So many blog posts are exactly the same. After reading Be neat, and the rest will follow I was moved to make just that point in the comments. It's not that I don't agree or disagree with what they are saying, it's just that I came to realize what was missing from the blog post.

Now me writing long comments of this kind of thing to blogs:


Hmm. Where's your evidence and scientific method? What you say may well be true but you have no science to back it up. You might as well be harping on about a flat earth!
  • Do you have an impartial and reproducible tool to determine neatness for example?
  • Do others perception of neatness coincide with yours?
  • Any emperical statistics to back up your statement?
  • Where is the graph of neatness vs "cost" to back up what you are saying?
You might be right but you have not shown scientifically that you are right. You could be selling snake oil!

... Is not snappy and is not likely to grow so What I think is needed is an anagram.

Naarp!

If you think that a post is all about trying to sell you the best way to do something when programming, but the post makes no attempt to show you reproducible evidence supporting their claim then just slap a

    NAARP!

In the comments, where NAARP stands for:

No
Attempt
At
Rigorous
Proof


If spoken, Naarp is pronounced as if spoken by a village idiot/country bumpkin. I was thinking especially of the large guy in Hot fuzz who says Yaarp for yes, and of Simon Pegg thinking up Naarp when impersonating him to mean no.

Hopefully it can show that people can recognise the lack of scientific method in the industry and maybe nudge authors to be aware that their audience is now aware of this too.

- Peace.