Monday, June 30, 2014

Why do static languages make it difficult to maintain small codebases?

Prompted by this post on Reddit

So many projects are abandoned after wasting millions often due to leaders believing that they can adequately sub-divide a problem to harness large development teams using the features of static languages to deliver huge requirements lists.

Dynamic languages make it plain that someone has to know what is going on  at all times in the software and have use features, such as modularisation and reuse - not just of what is written in one language, but often making it straightforward to integrate best-in-class libraries written in other languages - be they static or dynamic; together with the support of the language community in doing so.

Dynamic languages are excellent in supporting the Unix philosophy of  using small interconnected specialist programs to solve large problems without having to totally reinvent the wheel. Quick prototyping and duck typing often allow several solutions paths to be explored and so dead-ends avoided earlier rather than the usual static language approach which can often lead to inferior solutions having to be pursued due to the general increased time-to-prototype of static languages.

Tuesday, June 24, 2014

Indexing a sublist of a list the way you index a substring of a string


Following this question on reddit, I decided to create a script to index a sublist within a list similar to the way index works for Python strings.

Python Strings

Strings are a concatenation of characters, (although there is no separate character type in Python - a one element string suffices). You can find a substring of characters within a string using the str.index method and the substring can be more than one character. If the substring is not present then ValueError is raised.
In [21]:
help(str.index)
Help on method_descriptor:

index(...)
    S.index(sub [,start [,end]]) -> int
    
    Like S.find() but raise ValueError when the substring is not found.


In [22]:
s = 'abcabcd'
s.index('b')
Out[22]:
1
In [23]:
s.index('ca')
Out[23]:
2
In [24]:
s.index('x')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
{ipython-input-24-a677ba2c36a5} in ()
----> 1 s.index('x')

ValueError: substring not found

Python Lists

Single, individual values may be searched for within a list using the list.index method, but a sequence, or sublist of items cannot be searched for:
In [25]:
help(list.index)
Help on method_descriptor:

index(...)
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.


In [26]:
x = [1, 2, 3, 4, 5]
x.index(3)
Out[26]:
2
Try looking for value 3 followed by value 4
In [27]:
x.index([3, 4])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
{ipython-input-27-3cb52b6ef201} in ()
----> 1 x.index([3, 4])

ValueError: [3, 4] is not in list
But note:
In [28]:
y = [1, 2, [3, 4], 5]
y.index([3, 4])
Out[28]:
2
Note: the sublist [3,4] exists as a top-level element of y and so in index-able, but the value 3 is not
In [29]:
y.index(3)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
{ipython-input-29-f0b9f73a108f} in ()
----> 1 y.index(3)

ValueError: 3 is not in list

The index-ing function

In [30]:
def indexsublist(sub, lst):
    """
    Find a sublist in a list    
    """
    if type(sub) == list:
        sublen = len(sub)
        first = sub[0] if sub else []
    else:
        raise ValueError
    indx = -1
    while True:
        try:
            indx = lst.index(first, indx + 1)
        except ValueError:
            break
        if sub == lst[indx: indx + sublen]:
            return indx
    raise ValueError
Give the function a sub-list together with a list to find it in and for every occurrence of the first item of the sub-list in the list it checks to see if the next items of the list are also the same from that position of the first values match. The while loop tries all matches of the first value of the sub-list until they are all tried or a match is found.

Tests

In [31]:
for x in [[1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]], ]:
    for sublist in [[5, [1, 2]], [4, 5, [1, 2]], [3, 5, [1, 2]],
                    [[1, 2]], [[1, 2], 3], [[1, 2], 4], [], 4, [4], [[4]]]:
        try:
            ans = indexsublist(sublist, x)
        except ValueError:
            ans = '{ValueError}'
        print('%12s == indexsublist(%14s, %s)' % (ans, sublist, x))
           8 == indexsublist(   [5, [1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           7 == indexsublist([4, 5, [1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
          10 == indexsublist([3, 5, [1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           9 == indexsublist(      [[1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           9 == indexsublist(   [[1, 2], 3], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
{ValueError} == indexsublist(   [[1, 2], 4], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           4 == indexsublist(            [], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
{ValueError} == indexsublist(             4, [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           3 == indexsublist(           [4], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           5 == indexsublist(         [[4]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])

A Find all indices function

There isn't a method on str to find all indices where a substring occurs in a string, but is is a part of the re.module and I decided to augment the above function to have that functionality for lists.
In [32]:
def findall(sub, lst, overlap=True):
    """
    Find all positions of sublist in a list    
    """
    if type(sub) == list:
        sublen = len(sub)
        first = sub[0] if sub else []
    else:
        raise ValueError
    indices, indx = [], -1
    while True:
        try:
            indx = lst.index(first, indx + 1)
        except ValueError:
            break
        if sub == lst[indx: indx + sublen]:
            indices.append(indx)
            if not overlap:
                indx += sublen - 1
    return indices
List indices in function findall just accumulates all the indices as it tries along the lst. Argument overlap chooses between overlapping and nonoverlapping searches. The difference is seen in the last two test outputs for searches for [4, 4].

Tests

In [33]:
for x in [[3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4], ]:
    for sublist in [[5, [1, 2]], [4, 5, [1, 2]], [3, 5, [1, 2]],
                    [[1, 2]], [[1, 2], 3], [[1, 2], 4], [], 4, [[4]], [4, 4]]:
        try:
            ans = findall(sublist, x)
        except ValueError:
            ans = ''
        print('%12s == findall(%14s, %s)' % (ans, sublist, x))
for x in [[3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4], ]:
    for sublist in [[4, 4], ]:
        try:
            ans = findall(sublist, x, overlap=False)
        except ValueError:
            ans = '{ValueError}'
        print('%12s == findall(%14s, %s, overlap=False)' % (ans, sublist, x))
      [6, 9] == findall(   [5, [1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [5] == findall([4, 5, [1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [8] == findall([3, 5, [1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
     [7, 10] == findall(      [[1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [7] == findall(   [[1, 2], 3], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
        [10] == findall(   [[1, 2], 4], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [2] == findall(            [], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
{ValueError} == findall(             4, [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [3] == findall(         [[4]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
[11, 12, 13] == findall(        [4, 4], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
    [11, 13] == findall(        [4, 4], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4], overlap=False)

End.

Sunday, June 15, 2014

Recursive Collections in Python

If you are settled, then lets begin...

A recursive data structure is a collection of data where when you try and show the items of the collection, you end up in an infinite loop and could go on showing items without end.
It is best shown with Python lists, (although other mutable types that can hold mutable types work too).

A simple recursive list

In [55]:
x = [0, 1, 2]
x
Out[55]:
[0, 1, 2]
X is at present a straightforward list.
A python list object is a container of, (internal references to), other objects. lists are mutable, which means that the same list object may have its contents altered but still remain the same list object, just with updated contents. (Analogous to a chest of drawers being able to have socks in the top drawer one week but be rearranged to have ties in the top drawer later on, but still be refered to as the same chest of drawers).
What if the list x object contained itself? Contained just means that one of the referenced objects of x is x itself.
We can do that:
In [56]:
x[2] = x
Lets give python a chance to give the value of x. It should be straightforward to display the x[0], and x[1] items but what happens when it reaches x[2]? We have said that it is x and we can display x[0], and x[1], but what happens whith this x[2]...
In [57]:
x
Out[57]:
[0, 1, [...]]
Yay!
Python is smart enough to find the point of recursion and give an indication of the point of recursion by the symbol [...].
Hmm, I wonder what pprint has to offer when displaying x?
In [58]:
from pprint import pprint as pp

pp(x)
[0, 1, (Recursion on list with id=86231560)]

pprint goes further and tries to show what is being recursed into by telling you the object id.
The id of an object is a unique integer representing every object of Python. If two Python names refer to values with the same id, then they are in fact referring to the same object:
In [59]:
s = t = object()
s == t, s is t, id(s) == id(t)
Out[59]:
(True, True, True)
So if x refers to x then the id of x should equal the id of x[2]
In [60]:
id(x), id(x[2]), id(x) == id(x[2])
Out[60]:
(86231560L, 86231560L, True)

Mutual recursion

We can also have more complex recursions. In this case we will have x refer to y which refers to x...
In [61]:
x = [0, 1, 2]
y = [3, 4, 5]

x[2], y[2] = y, x

print('x = %r; y= %r' % (x, y))
x = [0, 1, [3, 4, [...]]]; y= [3, 4, [0, 1, [...]]]

This time the recursions are found where x is in y for x and where y is in x for y.
Showing the id's will help clarify things
In [62]:
print('id(x) == %i; id(y) == %i' % (id(x), id(y)))
id(x) == 86190216; id(y) == 86215816

In [63]:
pp(x)
[0, 1, [3, 4, (Recursion on list with id=86190216)]]

In [64]:
pp(y)
[3, 4, [0, 1, (Recursion on list with id=86215816)]]

Wacky indexing

In [65]:
x[2][2][2][2][0]
Out[65]:
0
In [66]:
x[2][2][2][2][2][2][2][2] == x
Out[66]:
True

Is this list recursive?

One way of finding if a lst is recursive would be to look for the '[...] sequence in the repr of a list, but it would be fragile; easily broken by a list with a string item containing '[...]'.
Lets write a better function to show if a list is recursive.
In [68]:
x1 = [0, 1, 2]
z1 = [0, 1, [2, x1], 3, [4, 5, x1]] # Not recursive
z1
Out[68]:
[0, 1, [2, [0, 1, 2]], 3, [4, 5, [0, 1, 2]]]
In [69]:
x2 = [0, 1, 2]
z2 = [0, 1, [2, x2], 3, [4, 5, x2]]
x2.append(z2) # Now make x2, and z2 mutually recursive
x2
Out[69]:
[0, 1, 2, [0, 1, [2, [...]], 3, [4, 5, [...]]]]
In [70]:
z2
Out[70]:
[0, 1, [2, [0, 1, 2, [...]]], 3, [4, 5, [0, 1, 2, [...]]]]

The isRecursive test function

In [71]:
def isRecursive(lst, seen=None):
    if seen is None:
        seen = {id(lst)}
    for item in lst: # for only works for iterables; otherwise raise TypeError
        iditem = id(item)
        if iditem in seen:
            return iditem
        else:
            seen.add(iditem)
            try:
                foundrecursion = isRecursive(item, seen)
            except TypeError:
                continue
            else:
                if foundrecursion:
                    return foundrecursion
            finally:
                seen.discard(iditem)
    return False
In [72]:
isRecursive(x1)
Out[72]:
False
In [73]:
isRecursive(z1)
Out[73]:
False
In [74]:
isRecursive(x2)
Out[74]:
87163336L
In [75]:
isRecursive(z2)
Out[75]:
85609608L
In [76]:
id(x2), id(z2)
Out[76]:
(87163336L, 85609608L)

End.