Sunday, February 15, 2015

Find repeated blocks of characters in a string: char_rep().


I saw this Stack Overflow (SO) question which I thought was asking how to show the repeated characters in a string.

I immediately thought about my earlier code from post "Find repeated blocks of lines in a text file: The block_lines tool." and immediately thought of  modifying it to get a similar result i.e. using the in-built regular expression engine to find the repetitions.

in function char_rep, a regular expression finds a section of repeated characters in a string:
    r"""(?ms)(?P<repeat>(?P<chars>.+?)(?:(?P=chars))+)"""

We find all those repeated sections in rightward order through the string as  well as detecting intervening non-repeated blocks of characters and format them appropriately and extend the output string.

The _format_* functions allow the output to be expressed as a either a regular expression that matches the input text; or alternatively as output like that asked for in the SO question.

The code

#!/bin/env python
"""
Extract repeated blocks of chars from a string
"""
from __future__ import print_function
 
import re
 
 
__all__ = ['char_rep']
__version__ = '0.0.1'
DEBUG = False
 
 
def _format_re(bchars, brep):
    'annotated in regexp-like format, repeated block of chars'
    bchars = re.escape(bchars)
    if brep > 1:
        return '(%s){%i}' % (bchars, brep)
    else:
        return bchars
 
def _format_so(bchars, brep):
    'brep(bchars) format for Stack Overflow: http://stackoverflow.com/questions/28519559/compressing-string-with-repeating-chars'
    return '%i(%s)' % (brep, bchars) if bchars else ''
 
def char_rep(txt, debug=False, _format=_format_re):
    'return repeated blocks of chars in text with their repeat count'
    output,  lastend = [], 0
    for match in re.finditer(r"""(?ms)(?P<repeat>(?P<chars>.+?)(?:(?P=chars))+)""", txt):
        if debug: print('\n  lastend, txt[lastend:] = ',lastend, repr(txt[lastend:]))
        beginpos, endpos = match.span()
        repeat, chars = match.group('repeat'), match.group('chars')
        if lastend < beginpos:  # Skipped a non-repeated, 'single' block
            output.append(_format(txt[lastend:beginpos], 1))
        output.append(_format(chars, repeat.count(chars)))
        lastend = endpos
        if debug:
            print('  match.span() = ', match.span())
            print("  match.group['repeat'] = ", repeat)
            print("  match.group['chars'] = ", chars)
            print("  repeatcount = ", repeat.count(chars))
            print("  output so far = %r" % ''.join(output))
    output = ''.join(output) + _format(txt[lastend:], 1)
    if debug: print("return %r" % output)
    return output
 
 
if __name__ == '__main__':
    for n, txt in enumerate('''
    ULULUL 
    LDDLDDLDDLDDLXXLCCLCCLCC
    LDDLDDLDDLDDLXXLCCLCCLCCC
    DDLDDLDDLDDLDDLDDLDDLDDLDDLDDL
    LDDLDDLDDLDDLDDLDDLDDLDDLDDLDD
    LLLLDLLLLDLLLLD 
    '''.strip().split(), 1):
        output_re = char_rep(txt, debug=DEBUG, _format=_format_re)
        assert re.match('^%s$' % output_re, txt), "Output is not an RE to match all of input string!"
        output_so = char_rep(txt, debug=DEBUG, _format=_format_so)
        print("String %i: %r\n  re_formatted = %r\n  so_formatted = %r\n"
               % (n, txt, output_re, output_so))
 


Code output


String 1: 'ULULUL'
  re_formatted = '(UL){3}'
  so_formatted = '3(UL)'
 
String 2: 'LDDLDDLDDLDDLXXLCCLCCLCC'
  re_formatted = '(LDD){4}L(X){2}(LCC){3}'
  so_formatted = '4(LDD)1(L)2(X)3(LCC)'
 
String 3: 'LDDLDDLDDLDDLXXLCCLCCLCCC'
  re_formatted = '(LDD){4}L(X){2}(LCC){3}C'
  so_formatted = '4(LDD)1(L)2(X)3(LCC)1(C)'
 
String 4: 'DDLDDLDDLDDLDDLDDLDDLDDLDDLDDL'
  re_formatted = '(D){2}(LDD){9}L'
  so_formatted = '2(D)9(LDD)1(L)'
 
String 5: 'LDDLDDLDDLDDLDDLDDLDDLDDLDDLDD'
  re_formatted = '(LDD){10}'
  so_formatted = '10(LDD)'
 
String 6: 'LLLLDLLLLDLLLLD'
  re_formatted = '(L){4}(DLLLL){2}D'
  so_formatted = '4(L)2(DLLLL)1(D)'
 

You should note that we are at the mercy of the regular expression so that string 6 is actually 'LLLLD' repeated three times, but the program matches it differently. I guess I could have several repeat finding regular expressions and use the one that gives the most succinct value.

Debug mode

set DEBUG to True or call char_rep(txt, debug=True) in the code and you get a listing of the intermediate values from function char_rep.

END.