I saw this Stack Overflow (SO) question which I thought was asking how to show the repeated characters in a string.
I immediately thought about my earlier code from post "Find repeated blocks of lines in a text file: The block_lines tool." and immediately thought of modifying it to get a similar result i.e. using the in-built regular expression engine to find the repetitions.
in function char_rep, a regular expression finds a section of repeated characters in a string:
r"""(?ms)(?P<repeat>(?P<chars>.+?)(?:(?P=chars))+)"""
We find all those repeated sections in rightward order through the string as well as detecting intervening non-repeated blocks of characters and format them appropriately and extend the output string.
The _format_* functions allow the output to be expressed as a either a regular expression that matches the input text; or alternatively as output like that asked for in the SO question.
The code
#!/bin/env python """ Extract repeated blocks of chars from a string """ from __future__ import print_function import re __all__ = ['char_rep'] __version__ = '0.0.1' DEBUG = False def _format_re(bchars, brep): 'annotated in regexp-like format, repeated block of chars' bchars = re.escape(bchars) if brep > 1: return '(%s){%i}' % (bchars, brep) else: return bchars def _format_so(bchars, brep): 'brep(bchars) format for Stack Overflow: http://stackoverflow.com/questions/28519559/compressing-string-with-repeating-chars' return '%i(%s)' % (brep, bchars) if bchars else '' def char_rep(txt, debug=False, _format=_format_re): 'return repeated blocks of chars in text with their repeat count' output, lastend = [], 0 for match in re.finditer(r"""(?ms)(?P<repeat>(?P<chars>.+?)(?:(?P=chars))+)""", txt): if debug: print('\n lastend, txt[lastend:] = ',lastend, repr(txt[lastend:])) beginpos, endpos = match.span() repeat, chars = match.group('repeat'), match.group('chars') if lastend < beginpos: # Skipped a non-repeated, 'single' block output.append(_format(txt[lastend:beginpos], 1)) output.append(_format(chars, repeat.count(chars))) lastend = endpos if debug: print(' match.span() = ', match.span()) print(" match.group['repeat'] = ", repeat) print(" match.group['chars'] = ", chars) print(" repeatcount = ", repeat.count(chars)) print(" output so far = %r" % ''.join(output)) output = ''.join(output) + _format(txt[lastend:], 1) if debug: print("return %r" % output) return output if __name__ == '__main__': for n, txt in enumerate(''' ULULUL LDDLDDLDDLDDLXXLCCLCCLCC LDDLDDLDDLDDLXXLCCLCCLCCC DDLDDLDDLDDLDDLDDLDDLDDLDDLDDL LDDLDDLDDLDDLDDLDDLDDLDDLDDLDD LLLLDLLLLDLLLLD '''.strip().split(), 1): output_re = char_rep(txt, debug=DEBUG, _format=_format_re) assert re.match('^%s$' % output_re, txt), "Output is not an RE to match all of input string!" output_so = char_rep(txt, debug=DEBUG, _format=_format_so) print("String %i: %r\n re_formatted = %r\n so_formatted = %r\n" % (n, txt, output_re, output_so))
Code output
String 1: 'ULULUL' re_formatted = '(UL){3}' so_formatted = '3(UL)' String 2: 'LDDLDDLDDLDDLXXLCCLCCLCC' re_formatted = '(LDD){4}L(X){2}(LCC){3}' so_formatted = '4(LDD)1(L)2(X)3(LCC)' String 3: 'LDDLDDLDDLDDLXXLCCLCCLCCC' re_formatted = '(LDD){4}L(X){2}(LCC){3}C' so_formatted = '4(LDD)1(L)2(X)3(LCC)1(C)' String 4: 'DDLDDLDDLDDLDDLDDLDDLDDLDDLDDL' re_formatted = '(D){2}(LDD){9}L' so_formatted = '2(D)9(LDD)1(L)' String 5: 'LDDLDDLDDLDDLDDLDDLDDLDDLDDLDD' re_formatted = '(LDD){10}' so_formatted = '10(LDD)' String 6: 'LLLLDLLLLDLLLLD' re_formatted = '(L){4}(DLLLL){2}D' so_formatted = '4(L)2(DLLLL)1(D)'
You should note that we are at the mercy of the regular expression so that string 6 is actually 'LLLLD' repeated three times, but the program matches it differently. I guess I could have several repeat finding regular expressions and use the one that gives the most succinct value.
Debug mode
set DEBUG to True or call char_rep(txt, debug=True) in the code and you get a listing of the intermediate values from function char_rep.END.
No comments:
Post a Comment