Monday, October 21, 2013

Unifying Pythons string and list sequences for fsplit

I was looking at a particular post A more functional split function by Simon Grondin and wondered what it would look like in Python.

Simon describes his function thus:
"Fsplit takes a single ‘predicate’ function as argument. It loops over the array and adds one element at a time to a ‘buffer’ array. After adding an element, it calls the predicate on the buffer. If the predicate returns true, the buffer without the last element is added to the ‘ret’ array. At the end, that ‘ret’ array is returned, filtered from any 0-length elements."
Now a Python version would take both the predicate and array as arguments and, since it is to be in a functional style I decided to make it a generator function.

fsplit_list

Simon's first fsplit function works on arrays and had as its first test to split an array of integers on the occurrence of  integer 3. The second test was to split to " get all consecutive intervals with a sum of 4 or less"

I coded that up as the following Python:

>>> def fsplit_list(predicate, array):
    buf = []
    for c in array:
        buf.append(c)
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = [] if predicate([c]) else [c]
    else:
        if buf: yield buf


>>> # Split on 3
>>> list(fsplit_list(lambda x: 3 in x, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [4, 5, 1, 2], [4, 5]]
>>> # Get all consecutive intervals with a sum of 4 or less
>>> list(fsplit_list(lambda x: sum(x) > 4, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [3], [4], [1, 2], [3], [3], [4]]
>>>

fsplit_string

SImon then needed another function in his typed language that took a predicate and a string as arguments and worked in a similar way for strings. The test in this case was to split a string into substreings with no more than two vowels in them. My Python code was the similar function fsplit_string:

>>> import re
>>> def fsplit_string(predicate, string):
    buf = ''
    for c in string:
        buf += c
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = '' if predicate(c) else c
    else:
        if buf: yield buf


>>> # String
>>> list(fsplit_string(lambda string: len(re.findall(r'[aeiou]', string)) > 2, "lorem ipsum dolor sit amet"))
['lorem ', 'ipsum d', 'olor s', 'it am', 'et']
>>>

Unification

I wanted to have one function that worked on both lists and strings but without executing separate code dependant on the type of the sequence being split. Specifically it should work equally on lists and strings without testing for sequence type.

Comparing the two functions above I saw differences in the initialization of buf; the fact that iterating through a string leads to c being a string but for a list c is not a list - it is whatever object is in the list. This last point affects how buf is extended - for strings you can use += but lists have to be appended to.

My solution with the same tests is as follows:

>>> def fsplit(predicate, sequence):
    buf = type(sequence)()
    for c in (sequence[i:i+1] for i in range(len(sequence))):
        buf += c
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = type(sequence)() if predicate(c) else c
    else:
        if buf: yield buf


>>> # Split on 3
>>> list(fsplit(lambda x: 3 in x, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [4, 5, 1, 2], [4, 5]]
>>> # Get all consecutive intervals with a sum of 4 or less
>>> list(fsplit(lambda x: sum(x) > 4, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [3], [4], [1, 2], [3], [3], [4]]
>>> # String
>>> list(fsplit(lambda string: len(re.findall(r'[aeiou]', string)) > 2, "lorem ipsum dolor sit amet"))
['lorem ', 'ipsum d', 'olor s', 'it am', 'et']
>>>

That strange initialization of buf works because calling the type of a sequence gives the empty sequence of that type.
I realised that if I could create successive one member slices from a list as name c then they could be concatenated using += just like for strings hence the for statement that generates successive items from a list as one element lists - it works in a similar way for strings too - giving successive elements of a string as one element strings (but that is what the "normal" for statement did in fsplit_string).

I guess some of the fiddling about is because strings are immutable whereas lists are mutable in Python, but I wonder if there is a better/more pythonic way of writing function fsplit to the same constraints?

END.


Saturday, October 05, 2013

Project Euler #8 solution

Someone asked some innocuous question related to a Project Euler question where you had to find the largest number that is produced by multiplying five consecutive digits of a thousand digit number.

The digits were presented as twenty rows of fifty characters and the question was about how to grab the web page and extract the number.

I decided to do a straight-forward copy-paste of the digits into a string and going from there.

The algorithm I use recognises that finding a zero in the digits zeroes out any product and just traverses the digits in order performing appropriate actions based on accumulated state:

# http://projecteuler.net/problem=8#!

xstring = '''73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450'''
x = int(xstring.replace('\n', ''))
xl = [int(ch) for ch in xstring if ch != '\n']

# current product; current maximum and index of where seen; number of digits contributing to prod
prod, mx, digits = 0, (0, -1), 0
for i, digit in enumerate(xl):
    if digit == 0:
        # Need to start again at next non-zero
        prod = digits = 0
    else:
        digits += 1
        if digits == 1:
            prod = digit
        else:
            prod *= digit
        if digits > 5:
            # Remove influence of the first of six digits currently in product
            prod //= xl[i - 5]
            digits -= 1
        if digits == 5 and prod > mx[0]:
            # New max at position i
            mx = (prod, i)
if mx[0]:
    print ("The greatest product of five consecutive digits in the 1000-digit number")
    print ("is %i for ...%s... occurring at digits in position up to %i" %
           (mx[0], ''.join(str(i) for i in xl[mx[1] - 5: mx[1]]), mx[1]))
else:
    print ("The greatest product of five consecutive digits in the 1000-digit number does not occur")


The output is:

The greatest product of five consecutive digits in the 1000-digit number
is 40824 for ...39987... occurring at digits in position up to 368

- END.