Mainly Tech projects on Python and Electronic Design Automation.

Sunday, February 15, 2015

Find repeated blocks of characters in a string: char_rep().

I saw this Stack Overflow (SO) question which I thought was asking how to show the repeated characters in a string.

I immediately thought about my earlier code from post "Find repeated blocks of lines in a text file: The block_lines tool." and immediately thought of  modifying it to get a similar result i.e. using the in-built regular expression engine to find the repetitions.

in function char_rep, a regular expression finds a section of repeated characters in a string:

We find all those repeated sections in rightward order through the string as  well as detecting intervening non-repeated blocks of characters and format them appropriately and extend the output string.

The _format_* functions allow the output to be expressed as a either a regular expression that matches the input text; or alternatively as output like that asked for in the SO question.

The code

#!/bin/env python
Extract repeated blocks of chars from a string
from __future__ import print_function
import re
__all__ = ['char_rep']
__version__ = '0.0.1'
DEBUG = False
def _format_re(bchars, brep):
    'annotated in regexp-like format, repeated block of chars'
    bchars = re.escape(bchars)
    if brep > 1:
        return '(%s){%i}' % (bchars, brep)
        return bchars
def _format_so(bchars, brep):
    'brep(bchars) format for Stack Overflow:'
    return '%i(%s)' % (brep, bchars) if bchars else ''
def char_rep(txt, debug=False, _format=_format_re):
    'return repeated blocks of chars in text with their repeat count'
    output,  lastend = [], 0
    for match in re.finditer(r"""(?ms)(?P<repeat>(?P<chars>.+?)(?:(?P=chars))+)""", txt):
        if debug: print('\n  lastend, txt[lastend:] = ',lastend, repr(txt[lastend:]))
        beginpos, endpos = match.span()
        repeat, chars ='repeat'),'chars')
        if lastend < beginpos:  # Skipped a non-repeated, 'single' block
            output.append(_format(txt[lastend:beginpos], 1))
        output.append(_format(chars, repeat.count(chars)))
        lastend = endpos
        if debug:
            print('  match.span() = ', match.span())
            print("['repeat'] = ", repeat)
            print("['chars'] = ", chars)
            print("  repeatcount = ", repeat.count(chars))
            print("  output so far = %r" % ''.join(output))
    output = ''.join(output) + _format(txt[lastend:], 1)
    if debug: print("return %r" % output)
    return output
if __name__ == '__main__':
    for n, txt in enumerate('''
    '''.strip().split(), 1):
        output_re = char_rep(txt, debug=DEBUG, _format=_format_re)
        assert re.match('^%s$' % output_re, txt), "Output is not an RE to match all of input string!"
        output_so = char_rep(txt, debug=DEBUG, _format=_format_so)
        print("String %i: %r\n  re_formatted = %r\n  so_formatted = %r\n"
               % (n, txt, output_re, output_so))

Code output

String 1: 'ULULUL'
  re_formatted = '(UL){3}'
  so_formatted = '3(UL)'
  re_formatted = '(LDD){4}L(X){2}(LCC){3}'
  so_formatted = '4(LDD)1(L)2(X)3(LCC)'
  re_formatted = '(LDD){4}L(X){2}(LCC){3}C'
  so_formatted = '4(LDD)1(L)2(X)3(LCC)1(C)'
  re_formatted = '(D){2}(LDD){9}L'
  so_formatted = '2(D)9(LDD)1(L)'
  re_formatted = '(LDD){10}'
  so_formatted = '10(LDD)'
  re_formatted = '(L){4}(DLLLL){2}D'
  so_formatted = '4(L)2(DLLLL)1(D)'

You should note that we are at the mercy of the regular expression so that string 6 is actually 'LLLLD' repeated three times, but the program matches it differently. I guess I could have several repeat finding regular expressions and use the one that gives the most succinct value.

Debug mode

set DEBUG to True or call char_rep(txt, debug=True) in the code and you get a listing of the intermediate values from function char_rep.



Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

Blog Archive