Tuesday, June 24, 2014

Indexing a sublist of a list the way you index a substring of a string


Following this question on reddit, I decided to create a script to index a sublist within a list similar to the way index works for Python strings.

Python Strings

Strings are a concatenation of characters, (although there is no separate character type in Python - a one element string suffices). You can find a substring of characters within a string using the str.index method and the substring can be more than one character. If the substring is not present then ValueError is raised.
In [21]:
help(str.index)
Help on method_descriptor:

index(...)
    S.index(sub [,start [,end]]) -> int
    
    Like S.find() but raise ValueError when the substring is not found.


In [22]:
s = 'abcabcd'
s.index('b')
Out[22]:
1
In [23]:
s.index('ca')
Out[23]:
2
In [24]:
s.index('x')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
{ipython-input-24-a677ba2c36a5} in ()
----> 1 s.index('x')

ValueError: substring not found

Python Lists

Single, individual values may be searched for within a list using the list.index method, but a sequence, or sublist of items cannot be searched for:
In [25]:
help(list.index)
Help on method_descriptor:

index(...)
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.


In [26]:
x = [1, 2, 3, 4, 5]
x.index(3)
Out[26]:
2
Try looking for value 3 followed by value 4
In [27]:
x.index([3, 4])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
{ipython-input-27-3cb52b6ef201} in ()
----> 1 x.index([3, 4])

ValueError: [3, 4] is not in list
But note:
In [28]:
y = [1, 2, [3, 4], 5]
y.index([3, 4])
Out[28]:
2
Note: the sublist [3,4] exists as a top-level element of y and so in index-able, but the value 3 is not
In [29]:
y.index(3)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
{ipython-input-29-f0b9f73a108f} in ()
----> 1 y.index(3)

ValueError: 3 is not in list

The index-ing function

In [30]:
def indexsublist(sub, lst):
    """
    Find a sublist in a list    
    """
    if type(sub) == list:
        sublen = len(sub)
        first = sub[0] if sub else []
    else:
        raise ValueError
    indx = -1
    while True:
        try:
            indx = lst.index(first, indx + 1)
        except ValueError:
            break
        if sub == lst[indx: indx + sublen]:
            return indx
    raise ValueError
Give the function a sub-list together with a list to find it in and for every occurrence of the first item of the sub-list in the list it checks to see if the next items of the list are also the same from that position of the first values match. The while loop tries all matches of the first value of the sub-list until they are all tried or a match is found.

Tests

In [31]:
for x in [[1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]], ]:
    for sublist in [[5, [1, 2]], [4, 5, [1, 2]], [3, 5, [1, 2]],
                    [[1, 2]], [[1, 2], 3], [[1, 2], 4], [], 4, [4], [[4]]]:
        try:
            ans = indexsublist(sublist, x)
        except ValueError:
            ans = '{ValueError}'
        print('%12s == indexsublist(%14s, %s)' % (ans, sublist, x))
           8 == indexsublist(   [5, [1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           7 == indexsublist([4, 5, [1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
          10 == indexsublist([3, 5, [1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           9 == indexsublist(      [[1, 2]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           9 == indexsublist(   [[1, 2], 3], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
{ValueError} == indexsublist(   [[1, 2], 4], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           4 == indexsublist(            [], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
{ValueError} == indexsublist(             4, [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           3 == indexsublist(           [4], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])
           5 == indexsublist(         [[4]], [1, 2, 3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2]])

A Find all indices function

There isn't a method on str to find all indices where a substring occurs in a string, but is is a part of the re.module and I decided to augment the above function to have that functionality for lists.
In [32]:
def findall(sub, lst, overlap=True):
    """
    Find all positions of sublist in a list    
    """
    if type(sub) == list:
        sublen = len(sub)
        first = sub[0] if sub else []
    else:
        raise ValueError
    indices, indx = [], -1
    while True:
        try:
            indx = lst.index(first, indx + 1)
        except ValueError:
            break
        if sub == lst[indx: indx + sublen]:
            indices.append(indx)
            if not overlap:
                indx += sublen - 1
    return indices
List indices in function findall just accumulates all the indices as it tries along the lst. Argument overlap chooses between overlapping and nonoverlapping searches. The difference is seen in the last two test outputs for searches for [4, 4].

Tests

In [33]:
for x in [[3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4], ]:
    for sublist in [[5, [1, 2]], [4, 5, [1, 2]], [3, 5, [1, 2]],
                    [[1, 2]], [[1, 2], 3], [[1, 2], 4], [], 4, [[4]], [4, 4]]:
        try:
            ans = findall(sublist, x)
        except ValueError:
            ans = ''
        print('%12s == findall(%14s, %s)' % (ans, sublist, x))
for x in [[3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4], ]:
    for sublist in [[4, 4], ]:
        try:
            ans = findall(sublist, x, overlap=False)
        except ValueError:
            ans = '{ValueError}'
        print('%12s == findall(%14s, %s, overlap=False)' % (ans, sublist, x))
      [6, 9] == findall(   [5, [1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [5] == findall([4, 5, [1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [8] == findall([3, 5, [1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
     [7, 10] == findall(      [[1, 2]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [7] == findall(   [[1, 2], 3], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
        [10] == findall(   [[1, 2], 4], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [2] == findall(            [], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
{ValueError} == findall(             4, [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
         [3] == findall(         [[4]], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
[11, 12, 13] == findall(        [4, 4], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4])
    [11, 13] == findall(        [4, 4], [3, 4, [], [4], 3, 4, 5, [1, 2], 3, 5, [1, 2], 4, 4, 4, 4], overlap=False)

End.

1 comment: