I am doing some accelerated testing of a microcontroller running some diagnostic software that is put in a harsh environment and communicates its state and any firmware determinable errors it detects back to a controlling PC via a USB port.
Ultimately I decode from the data stream a series of appropriately named records interspersed with a few corrupted bytes from the remains of unrecoverable records.
When trying to assess what the microcontroller was doing, (as opposed to what it should be doing in normal operation), I found myself at one stage with a list of recieved records that simplify to something like this:
RESET1
RESET3
RESET1
RESET3
RESET1
RESET3
ERROR3
DATUM
ERROR3
DATUM
ERROR3
DATUM
CHANGE
RESET1
CHANGE
ERROR1
ERROR1
ERROR3
RESET1
ERROR3
RESET1
ERROR3
RESET1
ERROR1
STAGED
ERROR1
STAGED
DATUM
ERROR1
RESET1
DATUM
RESET1
ERROR2
DATUM
ERROR1
RESET1
DATUM
RESET1
ERROR2
DATUM
ERROR1
RESET1
DATUM
RESET1
ERROR2
ERROR2
RESET3
ERROR2
RESET3
RESET2
RESET1
RESET2
RESET1
ERROR2
RESET3
DATUM
CHANGE
RESET2
RESET1
RESET2
DATUM
CHANGE
DATUM
CHANGE
DATUM
CHANGE
RESET1
RESET1
ERROR1
ERROR3
ERROR2
RESET2
DATUM
RESET2
RESET3
STAGED
ERROR1
ERROR3
ERROR2
RESET2
DATUM
RESET2
RESET3
STAGED
ERROR1
ERROR3
ERROR2
RESET2
DATUM
RESET2
RESET3
STAGED
STAGED
CHANGE
ERROR3
ERROR2
RESET3
CHANGE
ERROR1
ERROR1
Meaning from chaos
When discussing with colleagues what was wrong with the above record sequence and what could be the cause, we spent a lot of time looking for repeated patterns in the series of records by eye.
I searched for, and could not find an existing tool to find repeated blocks of whole lines in a file. Stackoverflow gave me no more than my own suggestions so I decided to develop my own utility.
Fun with Regexps
My idea was to investigate if some regexp could extract repeated blocks of lines. If it's regexp exploration, then it is out with
Kodos.
Cut'n'paste the example records into the Search string box, then I can develop the regexp in the top window.
The essence of this first regexp of:
(?ms)(?P<repeat>(?P<lines>^.*?$)(?:\n(?P=lines))+)
is read in sections from the inside out as:
^.*?$ # Match a series of whole lines, non-greedily due to the ?
(?P<lines>^.*?$) # Give the matched lines the name "lines"
(?:\n(?P=lines))+ # Match repetitions of '\n' followed by the group named "lines"
(?P<repeat>(?P<lines>^.*?$)(?:\n(?P=lines))+) # call the repeated block of lines "repeat"
The kodos Group window shows the first match where the two "lines" of "RESET13" are repeated three times to give the "repeat" group.
Use regexp.findall with the above regexp and I created the heart of my script.
The block_lines utility
The final utility was in three parts:
- block_gen, to create test data of records with optional debug information showing the repeats in the generated blocks
- block_rep, to extract and show the repeated blocks.
- block_exp, to expand output from block_rep so you can round-trip the data.
block_gen output
In this sample generated in debug mode, everything matching ' # .*' at the end of any line is debug comments showing what is repeated lines between curly braces), and by how many times.
For example in this:
The comment of ' # 3 {}' means that the first line 'RESET1' is repeated three times.
The fourth line starts a block that occurs only 1 time and ends at line 6 because of the closing curly brace.
block_rep output
Although block_gen can show what repetitions were used when generating record sequences, block_rep is likely to find its own set of repeated blocks as the generator might generated a record repeated twice then generate the same record repeated three times more. block_rep may treat that as the one block repeated five times.
block_rep takes a file of lines and in debug mode strips off the ' #.
Note that in this form the repetition information is on the left and although the repeat counts of each block is present, only the block that is repeated is shown - not the expanded repetitions.
Docopt
I used the
docopt Python library for the argument handling of the final utility:
PS C:\Users\Paddy\Google Drive\Code> python .\block_lines -h
Extract repeated blocks of lines information from a file
Usage:
block_lines (rep|exp) [options] [INPUT-FILE [OUTPUT-FILE]]
block_lines gen [options] [OUTPUT-FILE]
block_lines (-h|--help|--version)
Modes:
rep Find the repeated blocks of lines.
exp Expand the output from rep.
gen (Random gen of output suitable for example rep mode processing).
Options:
-h --help Show this screen.
-v --version Show version.
-w N, --width N Digits in field for repetition count [default: 4].
-d --debug Debug:
* In rep mode remove anything after '#' on all input lines.
* In gen mode annotate with repeat counts after '#'.
I/O Defaults:
INPUT-FILE Defaults to stdin.
OUTPUT-FILE Defaults to stdout.
- Note that I did not have to think too hard about an algorithm for extracting repeated blocks - I make the regexp library take the strain.
- It is fast enough for my purposes - I do slurp the whole file into memory at once for example, but that is fine for my real world data so no (over) optimization is needed.
- The utility proved invaluable in being able to discuss and come to a consesnsus with colleagues about some experimental data we had. We could condense tens of thousands of lines, without bias, and reason about the result.
- Real-world data might need pre-conditioning before being able to usufully apply the utility, for example: fixed fields of varying value such as indices, timestamps, and maybe usernames, might need to be deleted or replaced with a fixed dummy value.
Show me the code!
License: It's mine, mine mine do you hear me!